Tuesday, July 13, 2010

Open Database Conventions

I've been using a set of database table and field naming conventions and related standards that have been gradually adopted and refined over the last 10 years. I decided today that it would be a good idea to share these with others, particularly because I decided to bring Jim from UVOG in on a database project I'm starting on. I realized that sharing them with others on an even wider scale might be a good thing on the off-chance that, if someone else out there adopts them, and if our code ever ends up crossing paths, we will gain greater interoperability and understanding on account of using the same techniques.

Feel free to take some or all of these ideas and implement them in your own projects. I am willing to hear feedback and incorporate it back into the standard if you think there's a better way to do something that I've mentioned.

Some of these conventions will apply specifically to MySQL, while others will apply to the greater spectrum of databases. My experience spans MS Access, MS SQL Server, dBase, FoxPro, the Apollo Database Engine, and MySQL, with some very light experience in SQLite and PostgreSQL, but the bulk of my work today happens in MySQL.

Naming the Database

The name of the database itself is the least of my concerns. I would just recommend that in code that operates with or upon the database, this value should be easily configurable along with the hostname, port, and authentication credentials, so that any database could be selected.

Naming Tables

The name of a table should be a succinct name descriptive of what is to be represented by a single record in the table. If each record in the table represents an account, for example, name the table "account" (in the singular, all lowercase.) The reason we use singular forms is so that dot notation naming of fields makes sense when read out loud, for example, account.balance. There are many reasons we use all lowercase. Most production Database servers are case sensitive, and by using all lowercase we can eliminate the possibility of an error relating to case. It also improves code readability when SQL statements are written out with keywords and function names capitalized and with table and field names in lowercase.

There is a special case of naming in the event of a junction table. (A junction table is used to create a many-to-many relationship between two tables, and consists of nothing more than an id field, and foreign keys for the two tables whose records it is tying together.) The name of a junction table should be the name of the other two tables mashed together, with the name of the greater entity first. For example, if I have a company table and an customer table, and customers are potentially associated with multiple companies, I would name the junction table companyemployee, and it would contain id, kcompany, and kemployee, in that order. If there is a toss-up in determining which entity is "greater," place the two table names in alphabetical order.

Naming Fields

Always name your fields as though they are going to be used in dot notation. We want account.balance, not account.accountbalance, so do not redundantly repeat the table name within the field name. Avoid abbreviations that seem like a hack. For example, number is better than "no" or "num." Widely used abbreviations like ipaddress, ssn, or id are ok--be particularly careful to avoid expanding acronyms that are not expanded in common speech.

The Primary Key

In order to be properly maintained, every record needs to be addressable by a unique identifier. This identifier should not be meaningful in any real-world way for reasons that are explained in depth elsewhere, but I will give one brief example. If you were making an employee database, you might be tempted to use a person's SSN as the primary key for the employee table. Three years after the program is written, your company might hire someone who is in the country on a work visa, and therefore does not have an SSN, and now your system is broken. One will follow this line of thinking out and conclude that the key should be completely arbitrary and meaningless so that no such conflicting situation may ever occur. Additionally, for the sake of consistency, the primary key field in each table you make should be of the same name and type. I recommend naming the first field of each table "id", setting it to "auto_increment," and setting it to become the PRIMARY KEY index.

The value of this id field should be used internally for programming purposes and for describing relations between tables, but it should never be shown prominently to the user. If it is shown, it is merely for debugging or reference purposes. The number in this field, once generated, should not be changeable by the user, is not guaranteed to be sequential with the records around it, and is not guaranteed to fall into any range of values more specific than those allowed by an int (11) field.

Foreign Keys

A foreign key can be thought of as a "pointer" referring from the current record out to another record in another table (or possibly another record in the same table.) All foreign keys should be prefixed with the letter "k" (meaning key), and should either be named "parent" if the key is referring back to the same table for the purpose of producing a hierarchical system, or if referring to another table, it should be given the name of the table into which it points. If more than one key in a record must point to the same table for different purposes, the purpose should be appended to the name followed by an underscore. For example, kaddress_from and kaddress_to would be pointers to two records in the table named "address." The relation to one of them is described as "from" and the other as "to." This is the only time an underscore is permitted to be used in a field name, and the purpose of the underscore is to make the table name unambiguous.

When possible, try to put keys to parent records or records which are considered to "own" the current record near the top of the field listing, following just after id.

If these standards are followed for the primary key and foreign key fields, I could look up any reference using the following technique:

Given a field whose name begins with "k": Take everything in the field name following the "k" up to the end of the name, or up to (but not including) the first encountered underscore "_" character, and consider this to be the "lookup table name," if such a table exists. Retrieve the data by issuing a query in the form of:

SELECT * FROM [lookup table name] WHERE id = [value of the field]

Choosing Appropriate Data Types for Fields

Always choose the most restrictive type that can safely store the data, including valid values that you might not anticipate.

KEY FIELDS

For id fields or foreign key fields, use INT (11)

For boolean (yes/no or true/false) fields, use TINYINT (1) A value of zero means false, and any nonzero value means true. You can distinguish between individual nonzero values for record housekeeping if you wish, as long as others reading the data as merely true or false will not be under a misconception from not making such a distinction.

CURRENCY

For currency values less than a million dollars, use DECIMAL (8, 2) --- increase the first size by a sufficient number of digits if you need larger currency values.

NUMERIC DATA

For numeric data involving whole numbers or integers only, use INT (11) --- unless the size is astronomical, in which case, you will need to upgrade to a larger type.

For weights or non-integer quantities less than ten thousand, use DECIMAL (8, 4) --- increase the first size if values beyond ten thousand are needed, or both sizes if more precision than four decimal positions is needed. Four decimal places was chosen so that a number like 12.05% could be fully represented as 0.1205.

DATES AND TIMES

For date or time values or stamps where the "time part" has meaning (particularly in sorting), use the DATETIME type. A value of "0000-00-00 00:00:00" means not entered or unknown.

For date values where the time is irrelevant, use the DATE type. A value of "0000-00-00" means not entered or unknown.

MULTIPLE CHOICE

For multiple choice values: If there are a discrete number of choices presently and in the near future, few choices (about 8 or less), and little or no benefit would be had by allowing the user to customize the set of choices, an enumerated value may be used. Example: alignment ENUM ('left', 'right', 'center') DEFAULT 'left'

For multiple choice values where the choices are numerous or user defined or potentially user managed, a foreign key and a lookup table should be used instead of an enum.

STRINGS OF TEXT (NO MULTI-LINE VALUES)

For any string data: If the options are very well established, choose the smallest size category from the table below that can hold all possible values, or for data that is free-form, choose one category larger than you think is really necessary. I will give some example with each size category:

VARCHAR (4) -- A standard name suffix like Jr., Sr., II., III.
VARCHAR (8) -- Data known for certain to be less than or equal to 8 characters.
VARCHAR (16) -- A zip code. (Minimum to store a US zip code is presently 9 digits plus one hyphen.)
VARCHAR (24) -- A phone number, with decorations. The following number is 23 characters long, including spaces: +1 (541) 375-0448 x8888
VARCHAR (32) -- Product codes, SKU numbers, or generated numbers that are guaranteed to be less than or equal to 32 characters.
VARCHAR (48) -- A first(given) name by itself, or a last name(surname) by itself.
VARCHAR (64) -- A full name field (first and last name together.)
VARCHAR (128) -- An email address.
VARCHAR (248) -- A web address.

PARAGRAPHS OF TEXT, NOTES, MESSAGE CONTENT, CODE SNIPPETS

For anything multi-line, or potentially longer than 248 characters, use the TEXT type. If the data or text will be more than a couple of kilobytes, investigate the LONGTEXT type.

IMAGES OR BINARY FILES

I don't like big files taking up space in my InnoDB tables and slowing down my replication server. Instead, I store the original filename only (and only as a convenience) in a VARCHAR field, and then save the file separately named based upon the table name and record number the file is associated with. If I need to replicate these files, I let rsync take care of that. If the "attached" files MUST be synchronized with the other data in the record at any given moment, then it might be acceptable to resort to BLOB fields.

Order of Fields

When creating tables, the order of the fields really doesn't matter to the database engine. The sequence, therefore, should be chosen for purposes of clarity in documentation.


Well, that's all I have for now. Have at it, tear it up, and send me suggestions.

1 comment:

Unknown said...

I use CHAR for text strings up to 255 characters and VARCHAR(5000) for bigger things like comments.