What Is Database Normalization? What Are the Normal Forms?
Database normalization is a structured set of steps for optimally designing a database model. Through database normalization, database administrators, data engineers and data architects can model and design a framework for storing an application's data in such a way that the database layer of the application functions at maximum efficiency.
In this context, the data model refers to the set of relationships and conceptual entities specific to the business use case that the data represents. For example, in the case of retail banking data, this may include financial transactions and customer savings account information.
Since most websites and software applications rely on the storage of data to operate efficiently, database normalization remains a crucial step in data management.
It's worth noting there is also a sixth normal form (6NF). That said, 6NF is not yet standardized so I haven't included it in this definition.
Related Reading From Built In's Tech DictionaryWhat Is Data Architecture
The database normalization process is a critical step in our database management for many reasons. Database normalization helps us:
Database anomalies are errors that occur when modifying data, such as inserting, updating or deleting specific data values or attributes. These changes can lead to incorrect or missing data. Normalizing the data can prevent the database from suffering from these problems during daily operations.
Functional dependencies are relationships between database attributes, which are the conceptual and physical entities the data represents. A customer's attribute could be their phone number or home address.
Here's how it works: One attribute (A) is functionally dependent on another attribute (B) when, by knowing the value of B, you can uniquely determine the value of A. This process does not work in reverse. In other words, functional dependencies represent relationships between data. These are best organized or managed through normalization, a process that I’ll demonstrate below.
In database design, we want to avoid having many parts of the database (and its data) depend on several other parts of the database. In other words, we want to avoid a tightly coupled system.
In a tightly coupled system, one part of the system can break or stop functioning, thereby leading to many other parts breaking, which ultimately negatively affects database performance.
Normalization helps database administrators achieve a loosely coupled system that's more modular and less inter-dependent. This loose coupling helps with operational diagnostics and root-cause analysis of problems when they inevitably occur.
More Database 101 From Our ExpertsWhat Is a Relational Database?
In database management, redundancies mostly refer to having duplicate data in the system, which we can avoid through normalization. Avoiding duplicate data is important since it can save us the additional money and storage resources required to maintain it, in addition to the risk of having to update/maintain data consistency in multiple places instead of simplifying the data model.
A positive side effect of all the above is the avoidance of bugs, system issues and ultimately even potential downtime of the software application. System downtime can happen due to failures at the database layer if the data is not properly normalized. This aspect directly connects the proper management and storage of application data in the back-end systems with the customer experience on the front end — and ultimately the market success of the company that owns the application.
A normalized database also helps achieve efficiency gains. For example, your system may need to execute a smaller set of operations to model, serve and process the data for customers. A smaller set of operations due to well-normalized data also requires less computing power, thereby leading to cost savings on the software infrastructure (or stack) on which the database application runs. For this reason, database normalization is also directly tied to savings in operational costs.
Now that we understand the benefits of normalization, let's go through the steps for normalizing our databases. We refer to these as the normal forms.
At this step, our data is not yet normalized, so it's full of redundancies and still in its raw form without any structure or modeling logic applied to it.
Moving from 0NF to 1NF involves:
Let's look at an example of what it takes to move data from 0NF to 1NF using books as our sample data set.
Below you can see that the table represents data in zero normal form because each book stores multiple author information across different attributes (Author 1 and Author 2), which are repeating columns of the same data. In addition, there are columns such as Author Name that store multiple values, and this needs to be corrected.
We eliminate redundancies by splitting the data into a Books Table (and creating a Book ID identifier) and a Book-Author table. The Books Table now has a Book ID identifier, while the Book-Author Table stores all author ID information in one single column (the Author ID), plus additional author attributes (author's name, gender and geographical location).
We only have one book in this example but the structure holds as we increase the number of books we store.
Learn From Built In's ExpertsWhat Is a Non-Relational Database?
Moving from 1NF to 2NF involves:
This data is not fully functionally dependent since, per the functional dependency definition we discussed above, we can't uniquely determine the author's name based on certain other attributes, such as Book ID.
This is because BookID has two separate authors, which means we need to correctly take the data to its second normal form.
To uniquely determine the value of one attribute from the value of its key, we split the data further and create the Authors table. With the data in this form, we are now able to:
Moving from 2NF to 3NF involves:
Once we re-evaluate the data under the third normal form rules, we notice there's a non-ID field (Province) that uniquely determines the value of the Country attribute. Therefore, the Country is functionally dependent on Province, a non-ID column.
By eliminating this dependency, we can get the data in the third normal form. To do this, we create a Province table with the Province values set as the Province ID and keep the Province ID in the Authors table. Finally, we link the data and avoid the transitive dependency we saw in the second normal form.
Moving from 3NF to Boyce-Codd normal form involves:
The difference between this form and the third normal form is subtle. As in the third normal form, we’re primarily concerned with removing functional dependencies on non-key attributes, whereas here we are concerned with removing functional dependencies on potential key attributes.
At this point, our data is already extremely normalized so going beyond this point gets us into the realm of over-normalization, meaning normalizing to the point of introducing performance issues on the database since the data is overly separated and siloed.
In practice, these last two forms are not used. In theory, they are set to reduce the number of database anomalies even further. If you want to dig deeper into the theoretical details, here are a couple of helpful resources:
In common practice the first normal form through the Boyce-Codd normal form defines the set of steps most commonly used to achieve normalization benefits. The fourth and fifth forms are more rarely found in practice.
0NF — Zero Normal Form 1NF — First Normal Form 2NF — Second Normal Form 3NF — Third Normal Form Boyce-Codd Normal Form 4NF & 5NF — Fourth and Fifth Normal Forms How to Go From 0NF to 1NF How to Go From 1NF to 2NF How to Go From 2NF to 3NF