Database Normalization

The origin of database normalization, for the purposes of this technical explanation, is E. F. Codd’s 1970 and 1972 papers. Codd (1970) stated, “A further advantage of the relational view is that it forms a sound basis for treating derivability, redundancy, and consistency of relations….” In this paper, Codd introduced first normal form (1NF) of a relational data model. A relation can be said to be in 1NF if the intersection of each attribute and tuple contains one and only one value rather than multiple values regardless of whether multiple values share some common attribute. An example of a relation in unnormalized form (UNF) is shown in Figure 1, and an example of that same relation in 1NF is shown in Figure 2. Assume that these are clients of a financial institution.

Figure 1

Example of the financial institution’s accounts relation in UNF

accounts
clientNoclientNameaccountNoaccountTypeaccountOpenDate
C1John Doe0123456789 9876543210Checking Savings1900-01-01 1902-03-01
C2Jane Doe135792468 246813579Checking Savings1901-05-01 1903-08-01

Figure 2

Example of the accounts relation in 1NF

accounts
clientNoclientNameaccountNoaccountTypeaccountOpenDate
C1John Doe0123456789Checking1900-01-01
C1John Doe9876543210Savings1902-03-01
C2Jane Doe135792468Checking1901-05-01
C2Jane Doe246813579Savings1903-08-01

In Figure 1, both the account numbers and the account open dates for each client were in the same attribute-tuple intersection. Splitting those two tuples such that each tuple only contains one value in each attribute-tuple intersection satisfies the criteria for 1NF.

Now that the relation is in 1NF, the next step is to convert the relation to 2NF. Codd (1972) stated, “A relation R is in second normal form if it is in first normal form and every non-prime attribute of R is fully dependent on each candidate key of R.” Insertion, update, and deletion dependency anomalies can occur if a relation is not normalized. The main idea of 2NF is to ensure that no essential information is lost from the relation should it ever be necessary to insert new data into a relation or to update or delete existing data from a relation.

For example, in the subject relation, John Doe has two accounts. What if John Doe were to inform the financial institution that he would like to close both accounts but that he plans to open a new account in the near future? As the accounts relation currently exists, deleting John Doe’s accounts would result in a loss of John Doe’s data. The financial institution would like to retain John Doe’s information, however, because John Doe plans to open a new account in the near future. The financial institution still considers John Doe a client because there is an understanding that John Doe still intends to work with the financial institution. So, to not lose John Doe’s data during the time between his account closures and his planned new account opening, the financial institution can convert the accounts relation from 1NF to 2NF by creating a new clientInformation relation for client information that can relate to the accounts relation on the clientNo attribute. That way, deleting John Doe from the accounts relation does not result in the financial institution losing John Doe’s essential information. In the future, the financial institution will be able to insert John Doe’s clientNo attribute, and all associated attributes and tuples, back into the accounts relation which will then link John Doe’s clientInformation data to the newly inserted data into the accounts relation. Also, due to the criteria that there be no partial dependencies on just a part of the candidate keys (there exists a composite key of clientNo and accountNo), the financial institution will have to further decompose the relation because accountType and accountOpenDate depend solely on accountNo and not also on clientNo. To achieve this, a new relation called accountInformation must be created. Figure 3 below shows the conversion of the original relation from 1NF to 2NF.

Figure 3

Conversion of the accounts Relation to 2NF by Creating the clientInformation Relation

clientInformation
clientNoclientName
C1John Doe
C2Jane Doe
accountInformation
accountNoaccountTypeaccountOpenDate
0123456789Checking1900-01-01
9876543210Savings1902-03-01
135792468Checking1901-05-01
246813579Savings1903-08-01
clientAccount
clientNoaccountNo
C10123456789
C19876543210
C2135792468
C2246813579

Thus, returning to Codd’s definition of second normal form, it can be seen that the original relation, prior to normalizing it into 2NF, can be recovered, or restored, by joining the clientInformation relation and the clientAccount relation on the clientNo attribute and the accountsInformation relation and the clientAccount relation on the accountNo attribute.

Typically, the final step, at minimum, is to bring the data into third normal form. Codd (1972) stated, “A relation R is in third normal form if it is in second normal form and every non-prime attribute of R is non-transitively dependent on each candidate key of R.” The first criterion for 3NF is that a relation R be in 2NF, which is now the case. The second criterion for 3NF is that every non-prime attribute of R be non-transitively dependent on each candidate key of R. There are no transitive dependencies present in any of the relations; clientName fullydepends on clientNo in the clientInformation relation, accountType and accountOpenDate fully depend on accountNo in the accountInformation relation, and no non-key attributes exist for there to be any dependency in the clientAccount relation. Thus, the second criterion for 3NF is satisfied. Therefore, the data is also in 3NF. Sometimes, converting data to 2NF also yields 3NF if the criteria for 3NF are satisfied by virtue of the nature of the data, which happens to be the case with this example.

References

Codd, E. F. (1970). A relational model of data for large shared data banks. Communications of the ACM, 13(6), 377-387.

Codd, E. F. (1972). Further normalization of the data base relational model. In R. Rustin (Ed.), Data base systems (Courant Computer Science Symposium Series No. 6, pp. 33–64). Prentice-Hall.

Additional Reading

Codd, E. F. (1974). Recent investigations in relational data base systems. Information Processing 74, 1017-1021.

Connolly, T. M., & Begg, C. E. (2015). Database systems: A practical approach to design, implementation, and management (6th ed.). Pearson.