Tuesday, July 15, 2008

Denormalizing Your Database

I just read an article by Jeff Atwood titled "Maybe Normalizing Isn't Normal". The article states an opinion that it's possible that a fully normalized database is not necessarily the correct approach for a data model. The article suggests that many people just normalize their databases by default because that is what they have been taught or have been doing their entire careers. It seemed to me that the majority of the comments on the article took issue with this. They explained that there were many alternatives that should be attempted first and that denormalization was a last resort. One comment called on the classic quote "Premature optimization is the root of all evil," to describe any design that denormalized before other methods were tried.

Jeff is fond of saying that he bashes people over the head in order to get his ideas across. Fine, maybe he doesn't say that and I'm paraphrasing, but I don't think I am too far off there. My take on the article is that we should not normalize our databases simply because we were told to do so and/or have always done so, but that we should think about the needs of our applications while we are designing them. Avoiding premature optimization does not mean we stop thinking and blindly follow guidelines. It is our job and responsibility to think about how our application uses the data it stores while we are writing the application. Does your application really need to allow a user to add any quantity of phone numbers? Can you get by with a column in the user table for each of work, home, fax, and pager numbers? Your application will perform faster with the second approach even though you may end up with duplicate phone numbers in your system. What kind of application are you writing is a very important question.

When I started out in this internet-fangled industry, applications were much simpler. OK, that made me sound really old and I've only been doing this professionally for 8ish years now. My point is that a fully normalized database used to be a good match for the UI and user experience for most applications. You edited a user, and the user form showed up. You add an address and another form pops up. One form, one table, fast edits. That was in large part the kind of application most people were writing one or two decades ago. Most of your typical business applications seem to still work this way today.

Some people suggest that only companies with scaling issues should be considering denormalization. You can have a fairly complex query with twenty joins that runs fast if there are only limited numbers of records in your tables. Maybe only the Googles and eBays and Amazons need to worry about denormalizing their databases. Many common guidelines tend to get ignored at the level. eBay even goes to the extreme of disabling foreign keys because constraint checking just takes to darn long with their volume.

But maybe it's that consumer applications are starting to look more like a blend of traditional data entry and reporting applications. The fastest reporting systems have extremely flatened database models; the star model for OLAP data cubes being the obvious example. I don't want to maintain two databases and the ETL for the user facing side of every application I write. Not to mention that some real time data requirements preclude the notion of running an ETL package on a scheduled basis. But maybe that is the answer, and you can forget about your real time data. Or maybe there's a hybrid data model that makes sense. The question is, is anyone smart enough to identify problems before they become problems or does it make sense to try.

Or maybe we are just starting to learn what is possible with software as the internet and software evolve. With a focus on connections between people and all the different entities they interact with on a daily basis, it would seem that for some of the more recent internet applications to respond quickly, you would want to remove as many steps between those concepts as possible. Every join you can get rid of is a performance gain. Granted, with today's technology it doesn't make sense to include all of a user's office information in the user table if you want to easily find coworkers. A nice index seek across a companyId column is still much faster. But what if the companyId column is also the name? (Crazy? It is not unheard of to store a three letter country code directly in the address table and only use the foreign key for a value constraint. Why not the company name?) Is it really necessary to use a many-to-many relationship if a one-to-many will suffice? Is it necessary to find co-workers at all? It really does depend on the purpose of the application.

Concepts from applications in the public internet are even starting to affect business intranets. Have you googled enterprise 2.0 recently? The last company I worked for created a document management application geared towards sales and marketing departments. The application used tags, ratings, comments and other user actions to help identify and customize popular, successful and/or high quality business documents for various selling situations so the marketing team knew what was working and what wasn't. There were plenty of times where the conventional wisodm of normalization was not used due to the needs of the application.

I am not saying that normalizing your database is bad and denormalization is good. I am just saying that, as always, you need to think about the application you are writing and apply design principles accordingly. The problems that software solves are changing all the time. "If all the software we had was good enough, no one would need to write anything new." And as technologies go, the internet is still in its infancy. What we learn tomorrow may make that one table database design a good idea. I really hope not, but who knows.