Last updated on September 4th, 2023
As I strive to expand my skill set, along with algorithms, I am focusing on honing my expertise in databases. In this post, I am going to find out the concept of data normalization, exploring its process through an example.
In simple terms, data normalization is a set of rules and practices for dealing with the design of relational databases and the arrangement of data in database tables. There are several normal forms, each with a specific objective. Let’s go through each of them to understand the normalization process.
Suppose I have a simple customer_order
containing sales records, as below:
customer_name | customer_email | product_name | quantity | supplier_name | supplier_phone |
---|---|---|---|---|---|
John Doe | john.doe@email.com | Product A, Product B | 2 | Supplier 1 | (555)123-4567, (555)987-6543 |
John Doe | john.doe@email.com | Product C | 1 | Supplier 2 | (555)222-3333 |
Alan Smith | alan.smith@email.com | Product A | 3 | Supplier 1 | (555)123-4567, (555)987-6543 |
Here is a demonstration of the normalization process:
1NF – First Normal Form
Objective: Ensure that each column contains atomic values (no repeating groups or multi-valued attributes) and has a primary key.
So based on the above objective, the productName
should be separated multiple into different rows and add customer_id
:
customer_id | customer_name | customer_email | product_name | quantity | supplier_name | supplier_phone |
---|---|---|---|---|---|---|
1 | John Doe | john.doe@email.com | Product A | 1 | Supplier 1 | (555)123-4567, (555)987-6543 |
1 | John Doe | john.doe@email.com | Product B | 1 | Supplier 1 | (555)123-4567, (555)987-6543 |
1 | John Doe | john.doe@email.com | Product C | 1 | Supplier 2 | (555)222-3333 |
2 | Alan Smith | alan.smith@email.com | Product A | 3 | Supplier 1 | (555)123-4567, (555)987-6543 |
Our table is now in 1NF because each column has atomic values, and there’s a primary key (customerId). However, the table has redundant customer, product, and supplier information.
2NF – Second Normal Form
Objective: Ensure that tables meet 1NF requirements and every non-primary key attribute is fully functionally dependent on the primary key. There are no partial dependencies.
So based on the above objective, the customer information, product information, and supplier information should be separated into different tables to reduce partial dependencies:
customer
table
customer_id | customer_name | customer_email |
---|---|---|
1 | John Doe | john.doe@email.com |
2 | Alan Smith | alan.smith@email.com |
product
table
product_id | product_name | supplier_name | supplier_phone |
---|---|---|---|
1 | Product A | Supplier 1 | (555)123-4567, (555)987-6543 |
2 | Product B | Supplier 1 | (555)123-4567, (555)987-6543 |
3 | Product C | Supplier 2 | (555)222-3333 |
order
table
order_id | customer_id | product_id | quantity |
---|---|---|---|
1 | 1 | 1 | 1 |
1 | 1 | 2 | 1 |
2 | 1 | 3 | 1 |
3 | 2 | 1 | 3 |
The tables are now in 2NF, and there are no partial dependencies. In the customer
table, non-key columns such as customer_name
, customer_email
are functionally dependent on customer_id
. customer_id
and product_id
are foreign keys connecting the three tables – they depend on the primary keys of their respective tables (customer
and product
).
3NF – Third Normal Form
Objective: Tables meet 2NF requirements, and no non-key column is transitively dependent on another non-key column.
To achieve 3NF, remove any transitive dependencies that may exist in non-key columns. product
and suppliers
tables still have a transitive dependency because the supplier_name
is a non-key column that depends on the product_id
. I’ll separate the supplier information into a new supplier
table:
After applying 3NF:
product
table
product_id | product_name | supplier_id |
---|---|---|
1 | Product A | 1 |
2 | Product B | 1 |
3 | Product C | 2 |
supplier
table
supplier_id | supplier_name | supplier_phone |
---|---|---|
1 | Supplier 1 | (555)123-4567, (555)987-6543 |
2 | Supplier 2 | (555)222-3333 |
Now all the tables satisfy the requirements for the 3NF:
order
table: Quantity non-key column is functionally dependent on the primary key, OrderID, with no transitive dependencies.customer
table:customer_name
,customer_email
are functionally dependent oncustomer_id
and do not depend on other non-key columns.product
table:product_name
andsupplier_id
columns are functionally dependent onproduct_id
and don’t depend on other non-key columns
4NF – Fourth Normal Form
Objective: Tables meet 3NF requirements, and eliminate multi-valued dependencies.
Since each supplier can have multiple phone numbers, there is a multi-valued dependency in the supplier
table. To achieve 4NF, we need to eliminate multi-valued dependencies by dividing the supplier
table into two separate tables – Suppliers and SupplierPhoneNumbers:
Eliminate multi-valued dependencies by further dividing the supplier
table into supplier
and supplier_info
tables:
supplier
table
SupplierID | SupplierName |
---|---|
1 | Supplier 1 |
2 | Supplier 2 |
supplier_info
table
SupplierID | SupplierPhone |
---|---|
1 | (555)123-4567 |
1 | (555)987-6543 |
2 | (555)222-3333 |
Our final normalized design contains the following tables, adhering to the 4NF:
customer
table
CustomerID | CustomerName | CustomerEmail |
---|---|---|
1 | John Doe | john.doe@email.com |
2 | Alan Smith | alan.smith@email.com |
product
table
ProductID | ProductName | SupplierID |
---|---|---|
1 | Product A | 1 |
2 | Product B | 1 |
3 | Product C | 2 |
supplier
table
SupplierID | SupplierName |
---|---|
1 | Supplier 1 |
2 | Supplier 2 |
supplier_info
table
SupplierID | SupplierPhone |
---|---|
1 | (555)123-4567 |
1 | (555)987-6543 |
2 | (555)222-3333 |
order
table
OrderID | CustomerID | ProductID | Quantity |
---|---|---|---|
1 | 1 | 1 | 2 |
2 | 1 | 2 | 2 |
3 | 1 | 3 | 1 |
4 | 2 | 1 | 3 |
With these tables, we achieve a 4NF normalized design that eliminates most data anomalies and redundancies while allowing data to be efficiently stored and queried for a customer order management system. This design ensures better data integrity, maintainability, and scalability for the entire database.