Last updated on September 4th, 2023
As I strive to expand my skill set, along with algorithms, I am focusing on honing my expertise in databases. In this post, I am going to find out the concept of data normalization, exploring its process through an example.
In simple terms, data normalization is a set of rules and practices for dealing with the design of relational databases and the arrangement of data in database tables. There are several normal forms, each with a specific objective. Let’s go through each of them to understand the normalization process.
Suppose I have a simple customer_order containing sales records, as below:
| customer_name | customer_email | product_name | quantity | supplier_name | supplier_phone |
|---|---|---|---|---|---|
| John Doe | john.doe@email.com | Product A, Product B | 2 | Supplier 1 | (555)123-4567, (555)987-6543 |
| John Doe | john.doe@email.com | Product C | 1 | Supplier 2 | (555)222-3333 |
| Alan Smith | alan.smith@email.com | Product A | 3 | Supplier 1 | (555)123-4567, (555)987-6543 |
Here is a demonstration of the normalization process:
1NF – First Normal Form
Objective: Ensure that each column contains atomic values (no repeating groups or multi-valued attributes) and has a primary key.
So based on the above objective, the productName should be separated multiple into different rows and add customer_id:
| customer_id | customer_name | customer_email | product_name | quantity | supplier_name | supplier_phone |
|---|---|---|---|---|---|---|
| 1 | John Doe | john.doe@email.com | Product A | 1 | Supplier 1 | (555)123-4567, (555)987-6543 |
| 1 | John Doe | john.doe@email.com | Product B | 1 | Supplier 1 | (555)123-4567, (555)987-6543 |
| 1 | John Doe | john.doe@email.com | Product C | 1 | Supplier 2 | (555)222-3333 |
| 2 | Alan Smith | alan.smith@email.com | Product A | 3 | Supplier 1 | (555)123-4567, (555)987-6543 |
Our table is now in 1NF because each column has atomic values, and there’s a primary key (customerId). However, the table has redundant customer, product, and supplier information.
2NF – Second Normal Form
Objective: Ensure that tables meet 1NF requirements and every non-primary key attribute is fully functionally dependent on the primary key. There are no partial dependencies.
So based on the above objective, the customer information, product information, and supplier information should be separated into different tables to reduce partial dependencies:
customer table
| customer_id | customer_name | customer_email |
|---|---|---|
| 1 | John Doe | john.doe@email.com |
| 2 | Alan Smith | alan.smith@email.com |
product table
| product_id | product_name | supplier_name | supplier_phone |
|---|---|---|---|
| 1 | Product A | Supplier 1 | (555)123-4567, (555)987-6543 |
| 2 | Product B | Supplier 1 | (555)123-4567, (555)987-6543 |
| 3 | Product C | Supplier 2 | (555)222-3333 |
order table
| order_id | customer_id | product_id | quantity |
|---|---|---|---|
| 1 | 1 | 1 | 1 |
| 1 | 1 | 2 | 1 |
| 2 | 1 | 3 | 1 |
| 3 | 2 | 1 | 3 |
The tables are now in 2NF, and there are no partial dependencies. In the customer table, non-key columns such as customer_name, customer_email are functionally dependent on customer_id. customer_idand product_id are foreign keys connecting the three tables – they depend on the primary keys of their respective tables (customer and product).
3NF – Third Normal Form
Objective: Tables meet 2NF requirements, and no non-key column is transitively dependent on another non-key column.
To achieve 3NF, remove any transitive dependencies that may exist in non-key columns. product and suppliers tables still have a transitive dependency because the supplier_name is a non-key column that depends on the product_id. I’ll separate the supplier information into a new supplier table:
After applying 3NF:
product table
| product_id | product_name | supplier_id |
|---|---|---|
| 1 | Product A | 1 |
| 2 | Product B | 1 |
| 3 | Product C | 2 |
supplier table
| supplier_id | supplier_name | supplier_phone |
|---|---|---|
| 1 | Supplier 1 | (555)123-4567, (555)987-6543 |
| 2 | Supplier 2 | (555)222-3333 |
Now all the tables satisfy the requirements for the 3NF:
ordertable: Quantity non-key column is functionally dependent on the primary key, OrderID, with no transitive dependencies.customertable:customer_name,customer_emailare functionally dependent oncustomer_idand do not depend on other non-key columns.producttable:product_nameandsupplier_idcolumns are functionally dependent onproduct_idand don’t depend on other non-key columns
4NF – Fourth Normal Form
Objective: Tables meet 3NF requirements, and eliminate multi-valued dependencies.
Since each supplier can have multiple phone numbers, there is a multi-valued dependency in the supplier table. To achieve 4NF, we need to eliminate multi-valued dependencies by dividing the supplier table into two separate tables – Suppliers and SupplierPhoneNumbers:
Eliminate multi-valued dependencies by further dividing the supplier table into supplier and supplier_info tables:
supplier table
| SupplierID | SupplierName |
|---|---|
| 1 | Supplier 1 |
| 2 | Supplier 2 |
supplier_info table
| SupplierID | SupplierPhone |
|---|---|
| 1 | (555)123-4567 |
| 1 | (555)987-6543 |
| 2 | (555)222-3333 |
Our final normalized design contains the following tables, adhering to the 4NF:
customer table
| CustomerID | CustomerName | CustomerEmail |
|---|---|---|
| 1 | John Doe | john.doe@email.com |
| 2 | Alan Smith | alan.smith@email.com |
product table
| ProductID | ProductName | SupplierID |
|---|---|---|
| 1 | Product A | 1 |
| 2 | Product B | 1 |
| 3 | Product C | 2 |
supplier table
| SupplierID | SupplierName |
|---|---|
| 1 | Supplier 1 |
| 2 | Supplier 2 |
supplier_info table
| SupplierID | SupplierPhone |
|---|---|
| 1 | (555)123-4567 |
| 1 | (555)987-6543 |
| 2 | (555)222-3333 |
order table
| OrderID | CustomerID | ProductID | Quantity |
|---|---|---|---|
| 1 | 1 | 1 | 2 |
| 2 | 1 | 2 | 2 |
| 3 | 1 | 3 | 1 |
| 4 | 2 | 1 | 3 |
With these tables, we achieve a 4NF normalized design that eliminates most data anomalies and redundancies while allowing data to be efficiently stored and queried for a customer order management system. This design ensures better data integrity, maintainability, and scalability for the entire database.
