Tuesday, October 9, 2012

RDBMS Concepts and Normalization of Databases


Introduction:
The term 'database' has many interpretations; one definition is a 'collection of persistent data'.  A relational database is one in which the data consists of a 'collection of tables related to each other through common values'.

What is Relational Database?
Definition:
A relational database is a collection of data where:
Ø  Logically related data is grouped into tables within the database
Ø  Tables within a relational -database hold information that is of value to an organization.
Ø  Each table contains individual data items

Characteristics:
A relational database has the following characteristics:
Ø  Its components are conceptually simple
Ø  It has a limited number of components or objects
Ø  It has well-defined data operations that are based on a sound theoretical foundation
Ø  Its data design and definition processes are simple
Ø  Installation and operation of a relational database implementation are simple
Ø  A relationship is a logical link between two tables. 
Ø  A relational database management system (RDBMS) uses matching values in multiple tables to relate the information in one table with the information in the other table.
Ø  The presentation of data as tables is a logical construct, it is independent of the way the data is physically stored on disk.

The two most prominent characteristics of a relational database are:

1)    data stored in tables and
2)   relationships between tables

Basic Terms:
There are three fundamental relational terms:
TABLE-is the same familiar and easy-to-use data structure you see every day in newspapers, telephone books, schedules, etc.

A table:
Ø  Is a list of related data (an entity or a relation)
Ø  Is comprised of one or more column(s)
Ø  Is organized into rows
Ø  Cannot have duplicate rows (per relational theory)
Ø  COLUMN-describes a particular aspect of some entity such as a student or a course.

A column (also called as a field or attribute):
Ø  Characteristics of an item
Ø  Identifies a specific data type that dictates -what you, can and cannot do with the-column
Ø  Gives the data a name
Ø  Determines the relative position of the data in the table
Ø  Is analogous to a data item, element or field (in a non- relational database system)
Ø  Is accessed by its name, which is unique within the table
Ø  ROW-is a set of columns and their values in a particular table.

A row (also called as a record or tuple):
Ø  Information about a separate item
Ø  Contains only one occurrence of each column in the table
Ø  Is analogous to a single record in a file of many records
Ø  Is the smallest unit of insertion or deletion in a table
Ø  Has no inherent order in a table
Ø  Must be unique within a table


Summary
In a relational database, references are to:
Ø  Tables, rather than files
Ø  Colunms, rather than data elements or fields
Ø  Rows, rather than record occurrences
Ø  Data that is logically related

Biggest difference:
Ø  We select only the columns we want, not the whole row
Ø  We select just the rows we want, not all the records
Ø  We update just the column(s) we need to, not the whole record

More Terms
There are several additional important terms used in the relational database environment:
1.     Domain
2.    Result Set
3.    View
4.    Null

Domain
A domain is the set of all possible values for a given column. A domain:
Ø  Is not a physical part of a table or database
Ø  Is a conceptual object
Ø  Includes all possible values that may ever occur, even if the value is not currently in a table
Ø  May be shared by multiple columns
Ø  May be restricted based upon the data, type associated with the column (numeric versus character).

Result Set
A result set is:
Ø  The group of values returned from one query against the database


View
A view is a logical (or virtual) table that is derived from one or more tables.  It does not physically exist.

A view can:
Ø  Be made up of selected columns from a table
Ø  Be made up of selected rows from a table
Ø  Include data from more than one table
Ø  Include calculated fields (that is, derived data fields that are created as a result of a request and not part of an actual table)
Ø  Be created from other views
Ø  Be thought of as a subset of a table
Ø  Be used as a table
Ø  Be all of the above

Views are used to:
Ø  Simplify data retrieval commands by accessing a smaller and simpler version of a table
Ø  Limit access to certain data for a particular user or application program

NULLs
A null means that no value is currently assigned to a particular column in a specific row in a table.

Ø  May exist because the value of the column may not be known at the time the row is created
Ø  May exist because a value just does not apply for this row

Basic Table Rules
Tables contain data that is organized into rows. The rows have one value for each column.

A relational table can be compared to a file.

Relational tables follow these basic rules:
Ø  All data is represented by values, not pointers or physical connectors.
Ø  Each column in a given row has only one value.
Ø  No duplicate rows occur in a table.
Ø  No duplicate column names exist in a table.
Ø  The order of the rows in the table is not significant.
Ø  The order of the colunms in the table is not significant.
Ø  All relationships exist through values.

Relational Operations
Relational operations can modify an existing table or create a new version of the table when the result set of values is returned.

This result set can then be further manipulated as a table itself.

The relational operations that are integral in the relational database model are:
1.     Selection
2.    Projection
3.    Join
4.    Insertion
5.    Deletion
6.    Update
7.    Ordering
8.    Set Operations

Selection
Selection is the process of obtaining a subset of rows from a given table.

The effect of a selection is called a results table or result set.

Projection
A projection is a subset of columns from a given table.

A selection and projection results in a subset of both rows and columns.

Join
The join operation takes input from two or more tables to create a new results table.

This operation matches a value in common columns in the joined tables.

The results table contains rows and columns from both tables.

It is the concatenation of the projected columns from the two matching rows.

Joins fall into a variety of categories:
Ø  A natural join or equi-join is a join on an equal condition. Joins on other conditions are permitted but are "unnatural."
Ø  A join of a table to itself (self-join) is permitted.
Ø  An outer join is a join that accounts for NULLs or rows missing column matching values.
Ø  A full join provides all rows from each table. When there is an equal join between the tables, the output will consist of rows with matching values. Also selected will be rows that do not have matching values from either joined table.

Insertion
The insertion operation adds a new row to a table.

The position of the new row has no significance.

The new row may or may not contain columns with NULLS, but should always contain at least the primary key (which is not null).



Deletion
The deletion operation removes zero, one or more rows from a table.

The number of rows deleted depends upon the search criteria specified as part of the DELETE statement.  It is not necessary to SELECT the rows before deletion.

Update
The update operation allows individual column values in zero, one or many rows to be modified.  The number of rows to be UPDATED will depend upon the search criteria provided.

The update operation, just as the other table operations, applies to a specified set of rows.  The update operation does not require a row to be SELECTed before a column or columns can be updated.

Explicit search criteria can be used to designate a single target row.

Ordering
The ordering operation enables users to see the results of a query presented in a particular sequence.  The sequence can be ascending or descending by column.

Table rows in a relational database model do not have a significant order.

Set Operations
A set is the group of values returned from one access of the database.
Three kinds of operations can be performed on a set:
1.     Intersection
2.    Difference
3.    Union

Intersection
The intersection of two tables yields a results table containing those rows which appear in both tables.

Difference
The difference of two tables is the set of rows from the first table that do not occur in the second table.  This is also known as mutual exclusivity.  The order of the tables is significant.

Union
A union of two or more tables yields a results table containing all the rows selected from both tables.  Duplicate rows may optionally be eliminated in the results table.

Integrity Rules
The relational model defines several integrity rules that, while not part of the definition of the Normal Forms are nonetheless a necessary part of any relational database. There are two types of integrity rules: general and database-specific.

General Integrity Rules
The relational model specifies two general integrity rules. They are referred to as general rules, because they apply to all databases.

They are:
1.     entity integrity and
2.    referential integrity.

Entity Integrity
The entity integrity rule is very simple. It says that primary keys cannot contain null (missing) data. The reason for this rule should be obvious. You can't uniquely identify or reference a row in a table, if the primary key of that table can be null. It's important to note that this rule applies to both simple and composite keys. For composite keys, none of the individual columns can be null.

Referential Integrity
In a relational database, we often need to get columns out of more than one table.  The operation we use to do this is called a join. The vast majority of the time, a join is done on an equal condition of the primary key of one table to the foreign key of another.

The referential integrity rule says that the database must not contain any unmatched foreign key values. This implies that:

A row may not be added to a table with a foreign key unless the referenced value exists in the referenced table.
If the value in a table that's referenced by a foreign key is changed (or the entire row is deleted), the rows in the table with the foreign key must not be "orphaned."

In general, there are three options available when a referenced primary key value changes or a row is deleted.

The options are:
1.     Disallow
2.    Cascade
3.    Nullify

Disallow
The change is completely disallowed.

Cascade
For updates, the change is cascaded to all dependent tables.

For deletions, the rows in all dependent tables are deleted.

Nullify
For deletions, the dependent foreign key values are set to Null.



Database-Specific Integrity Rules
All integrity constraints that do not fall under entity integrity or referential integrity are termed database-specific rules or business rules. These type of rules are specific to each database and come from the rules of the business being modeled by the database. It is important to note that the enforcement of business rules is as important as the enforcement of the general integrity rules.

Keys
A key is a column or a combination of columns that distinguishes a row from all other rows.

Keys can be used for:
Ø  Accessing data
Ø  Establishing uniqueness of data
Ø  Sequencing the data

Keys are somewhat transparent to the user.  You do not need to know the key of a given row to access that row.  You will need to know the primary key if you want a unique occurrence.

In database design, there are four types of keys:
1.     Candidate keys
2.    Primary keys
3.    Alternate keys
4.    Foreign keys

Candidate keys
One or more columns in a table that may possibly serve as a primary key.

Primary keys
Stable, unique values. One of the candidate keys is chosen as the primary key.



Alternate keys
The candidate keys that are not chosen as the primary key. Alternate keys represent other ways that a user might need to search for data.

Foreign keys
When a primary key is also a column in another table, it is called a foreign key column in that other table.

What Is Referential Integrity?
There is a set of 'Business' rules that must be applied to each relationship which determines the status of the foreign key columns for rows when inserted or updated.

It is not sufficient to identify primary and foreign keys that make up relationships between tables.  The integrity and usability of a relationship are greatly affected by the values contained within the foreign key column(s).

In general, a foreign key must always contain either a value within the domain of foreign key values (values currently in the primary key column), or nulls.  Guaranteeing that an acceptable value is always in the foreign key column is known as referential integrity.

When using system-managed referential integrity, the primary key must be unique and have a unique index or the table is flagged as incomplete.


FEATURES OF A RDBMS

The following are the some of the features that can be expected from a RDBMS:
Ø  The ability to create multiple relations (tables) and enter data into them.
Ø  The ability to logically (index) or physically (sort) reorder the data in the tables.
Ø  An interactive query language.
Ø  The ability to answer queries that require data stored in more than one table.
Ø  A built-in report generator.

Additionally some packages may provide:
Ø  A text editor for use in writing an application program.
Ø  An application generator.
Ø  An interpreter to run the application program.
Ø  A screen generator to create custom screen form.

Popular Relational Database Products:
Ø  Although IBM did most of the research, Oracle delivered the first commercial relational database in 1979.
Ø  IBM delivered their first product,' SQL/Data System,' in 1982
Ø  Microsoft originally worked in partnership with Sybase to deliver SQL Server 4.2 in 1992.
Ø  In 1993 the partnership broke up with Sybase going after the Unix market and Microsoft pursuing the Windows NT market
Ø  Microsoft SQL Server 6.0 was released in 1995; version 6.5 shipped in 1996.
Ø  Microsoft SQL Server 7.0 shipped in November of 1998.  

The following is a list of vendors and their RDBMS products:
Ø  Microsoft's SQL Server 7.0
Ø  Oracle's Oracle 8
Ø  Sybase's Sybase SQL Server
Ø  IBM's DB2
Ø  Informix Software's Informix-SQL
Ø  Computer Associates' (formerly ASK Group) Ingres
Ø  Centura Software's (formerly Gupta) SQLBase
Ø  Borland's Database Engine
Ø  Interbase's (formerly Borland) Interbase
Ø  T.c.X's mySQL
E. F. CODD’S RULE AND
RELATIONAL DATABASE THEORY

A relational database is more than just data organized into related tables.

The relational database model is based firmly in the mathematical theory of relational algebra and calculus.

The original concept for the model was proposed by Dr. E.F. Codd in a 1970 paper entitled ‘A Relational Model of Data for Large Shared Data Banks’ .

Later Dr. Codd clarified his model by defining twelve rules (Codd’s Rules) that a database management system (DBMS) must meet inn order to be considered a relational database.

In practice, many database products are considered 'relational' even if they do not strictly adhere to all 12 rules.


A summary of Dr. E. F. Codd’s 12 rules is presented below:

Rule #1 :      Data is Presented in Tables (Information Rule)

Explanation :          A set of related tables forms a database and all data is represented as tables; the data can be viewed in no other way.
A table (relation or entity) is a logical grouping of related data in tabular form (rows and columns).
Each row (record or tuple) describes an item (person, place or thing) and each row contains information about a single item in the table.
Each column (field or attribute) describes a single characteristic about an item.
Each value (datum) is defined by the intersection of a row and column.
Data is atomic; there is no more than one value associated with the intersection of a row and column.
There is no hierarchical ranking of tables.
The relationships among tables are logical; there are no physical relationships among tables.

Summary :    All information in a relational database including table names, column names are represented explicitly by values in tables.

Benefits :    This simple view of data speeds design and learning.  User productivity is improved since knowledge of only one language is necessary to access all data such as description of the table and attribute definitions, integrity constraints, action to be taken when constraints are violated and security information.  These information’s are stored in the form of tables containing rows and columns.

Rule #2 :     Data is Logically Accessible (Guaranteed access rule)

Explanation :A relational database does not reference data by  physical location; there is no such  thing as the ‘fifth row in the customers table'.
Each piece of data must be logically accessible by referencing 1) a table; 2) a primary or unique key value; and 3) a column.
EXAMPLE: A specific employee in the 'Employee' table (e.g., Kapil Dev) and related information (last name, first name, ID, phone number, salary, etc) constitute a row.  An employee’s last name is a column.  The name ‘Dev’ is a data value.  Kapil Dev’s last name can be precisely located by referencing the ‘employee’ table, the appropriate column (last_name) and a unique key value (employee_id)  

Summary :    Every piece of data in a relational database can be accessed by using combination of a table name, a primary key value that identifies the row and a column name that identifies the cell.

Benefits :    User productivity is improved since there is no need to resort to using physical pointers or addresses.  Provides data independence.  Possible to retrieve each individual piece of data stored in a relational database by specifying the name of the table in which it is stored, the column key which identifies the row in which it is stored.

Rule #3 :     Nulls are Treated Uniformly As Unknown (Systematic treatment of nulls rule)

Explanation :          Null must always be interpreted as an unknown value.
Null means no value has been entered; the value is not known 'Unknown' is not the same thing as an empty string ("") or zero.
EXAMPLE:  If you pick up an item in a store and the price is not marked, the price is unknown (NULL); it is not free.  If not handled properly, nulls can cause confusion in your database.
EXAMPLE: If you search for all of the authors whose home state is not Tamil Nadu, the results will not include any authors with NULL in the  'state' column.  SQL is very literal.  You asked for authors where the state was NOT Tamil Nadu and NULL means 'unknown'.  A NULL value for 'state' may mean that the state is Tamil Nadu and it may mean that it is not Tamil Nadu; you just don't know.  Because the database engine can't tell for sure whether the state is not Tamil Nadu, a record with NULL will not be returned.
Nulls propagate through arithmetic expressions (e.g., 2 + NULL = NULL).  Comparing a null to any value, including itself, returns NULL.  

Summary :    The RDBMS handles records that have unknown or inapplicable values in a predefined fashion.  Also the RDBMS distinguishes between zeros, blanks and nulls in records and handles such values in a consistent manner that produces correct answers, comparisons and calculation.

Benefits :    Through the set of rules for handling nulls, users can distinguish results of queries that involves nulls, zeros and blanks.  It may not be apparent why th handling of null is important.

Rule #4 :     Active on line catalog used on relational model (Database is Self-Describing)

Explanation :          In addition to user data, a relational database contains data about itself. 
There are two types of tables in a RDBMS: user tables that contain the 'working' data and system tables contain data about the database structure.
Metadata is data that describes the structure of the database itself and includes object definitions (tables, indexes, stored procedures, etc.) and how they relate to each other.
The collection of system tables is also referred to as the  system catalog or data dictionary.
System tables can be accessed in the same manner as user tables.  

Summary :    The description of a database and the contents are database tables and therefore can be queries on-line via the data language.

Benefits :    The DBA’s productivity is improved since changes and additions to the catalog can be done with the same commands that are used to access any other table.  All queries and reports can be done as with other tables.

Rule #5 :     A Single Language is Used to Communicate with the Database Management System (Comprehensive data sub-language rule)

Explanation :          There must be a single language that handles all communication with the database management system.
The language must support relational operations with respect to: data modification (i.e., SELECT, INSERT, UPDATE, DELETE), data definition (i.e., CREATE, ALTER, DROP) and administration (i.e., GRANT, REVOKE, DENY, BACKUP, RESTORE).
Structured Query Language (SQL) is the default standard for a relational database language.
SQL is a ‘nonprocedural’ or ‘declarative’ language; it allows users to express what they want from the RDBMS without specifying the details about where it's located or how to get it.

Summary :    A RDBMS may support several languages, but atleast one of them allows the user to do all of the following:
Defines table/view, query and update data, set integrity constraints, set authorization and define transactions.

Benefits :    User productivity is improved since there is just one approach that can be used for all database operations.  In a multi-user environment actions against a database performed by an application program is contained within a transaction.  The DBMS query language supports statements that indicate where a transaction begins and where it should end.

Rule #6 :     Provides Alternatives for Viewing Data (View updating rule)

Explanation :          A relational database must not be limited to source tables when presenting data to the user.
Views are virtual tables or abstractions of the source tables.
A view is an alternative way of looking at data from one or more tables.
A view definition does not duplicate data; a view is not a copy of the data in the source tables.
Once created, a view can be manipulated in the same way as a source table.
If you change data in a view, you are changing the underlying data in the source table (although there are limits on how data can be modified from a view).
Views allow the creation of ‘custom tables’ that are tailored to special needs.
EXAMPLE:  By not including the columns with sensitive information in a view definition, a view can be used to restrict a user’s access to the data.
Views can be used to simplify data access by ‘predefining’ complex joins; the concept is similar to that of a 'saved query'. 

Summary :    Any view that is theoretically updateable can be updated using the RDBMS.  A view is theoretically updateable if changes can be made to the tables that effect the desired changes in the view.

Benefits :    Data consistency is ensured since changes in the underlying tables are transmitted to the view they support.  Also logical data independence reduces maintenance costs.

Rule #7 :     Supports Set-Based or Relational Operations (High level insert, update and delete)

Explanation :          Rows are treated as sets for data manipulation operations  (SELECT, INSERT, UPDATE, DELETE).
A relational database must support basic relational algebra operations (selection, projection & join ) and set operations (union, intersection, division and difference).
Set operations and relational algebra are used to operate on 'relations' (tables) to produce other relations.
A database that supports only row-at-a-time (navigational) operations does not meet this requirement and is not considered 'relational'.

Summary :    The RDBMS supports insertion, updation and deletion at a table level.  For example in SQL, delete can be performed on a set of select records.

Benefits :    By using high-level insert, update and delete the RDBMS can improve performance by optimizing the path to be taken to execute the action.  This is especially important in distributed databases.  Additionally ease of use is improved since the commands act on a set of records.

Rule #8 :     Physical data independence

Explanation :          Applications that access data in a relational database must be unaffected by changes in the way the data is physically stored (i.e., the physical structure).
EXAMPLE: The code in an application that accesses data in a file-based database typically depends on the file format (e.g., the code references a  'phone number' field that is 10 characters wide, is preceded by the 'zip code' field, followed by the 'fax number' field...).  If the layout of the data in the file is changed, the application must also be changed.  In contrast, the storage and  access methods (physical) used in a relational database can change without affecting the user or application’s ability to work with the data.  The user still only sees tables (logical structure). 
An application that accesses data in a relational database contains only a basic definition of the data (data type and length); it does not need to know how the data is physically stored or accessed.

Summary :    The execution of adhoc requests and application programs is not affected by changes in the physical data access and storage methods.

Benefits :    DBA’s can make the changes to physical access and storage methods, which improve performance but do not require changes in application programs or adhoc requests.  This reduces maintenance costs.

Rule #9 :     Logical data independence

Explanation :          Logical independence means the relationships among tables can change without impairing the function of applications and adhoc queries.
The database schema or structure of tables and relationships (logical) can change without having to recreate the database or the applications that use it  

Summary :    Logical changes in tables and views such as adding/deleting columns or changing field lengths do not necessitate modifications in application programs or in the format of adhoc requests.

Benefits :    The databases can change and grow to reflect changes in reality without requiring user intervention or changes in applications.  This reduces maintenance costs.  For example adding an attribute to a base table should not affect the programs or the interactive commands that have no use for the new attribute.

Rule #10 :    Data Integrity is a Function of the DBMS (Integrity Independence)

Explanation :          In order to be considered relational, data integrity must be an internal function of the DBMS; not the application program.
Data integrity means the consistency and accuracy of the data in the database (i.e., keeping the garbage out of the database).
There are three types of data integrity: entity, domain, and referential.
Within the database, data integrity can be enforced procedurally or declaratively.
Declarative data integrity involves placing or ‘declaring’ constraints on columns.
Procedural data integrity is maintained through code (i.e., through stored procedures or triggers).

Summary :    Like table/view definitions, integrity constraints are stored in the on-line catalog and therefore can be changed without necessitating changes in application programs or in the format of adhoc requests.  Additionally integrity constraints cannot be bypassed.

Benefits :    Integrity constraints specific to a particular relational database, must be definable in the relational data sub-language and storable in the catalog.  Atleast entity integrity and referential integrity constraints must be supported.

Rule #11 :    Supports Distributed Operations (Distribution Independence)

Explanation :          Data in a relational database can be stored centrally or distributed.
Users can join data from tables on different servers (distributed queries) and from other relational databases (heterogeneous queries).
Data integrity must be maintained regardless of the number of copies of data and where it resides.

Summary :    Application programs and adhoc are not affected by changes in the distribution of the physical data.

Benefits :    Improved system reliability since application programs will work even if the programs and data are moved in different sites.

Rule #12 :    Data Integrity Cannot be Subverted (Non subversion rule)

Explanation :          There cannot be other paths into the database that subvert data integrity; in other words, you can't get in the 'back door' and change the data in such a manner as data integrity is violated.
The DBMS must prevent data from being modified by machine language intervention.

Summary :    If the RDBMS has language that accesses the information of a record at a time, this language cannot be used to by-pass the integrity constraints.  In order to adhere to his rule RDBMS must have an active catalog that contains the constraints must have logical data independence.

Benefits :    This is necessary for data integrity.



DATABASE NORMALIZATION
AND DESIGN TECHNIQUES

INTRODUCTION
The normal forms defined in relational database theory represent guidelines for record design. The guidelines corresponding to first through fifth normal forms are presented here, in terms that do not require an understanding of relational theory. The design guidelines are meaningful even if one is not using a relational database system. We present the guidelines without referring to the concepts of the relational model in order to emphasize their generality, and also to make them easier to understand. Our presentation conveys an intuitive sense of the intended constraints on record design, although in its informality it may be imprecise in some technical details. A comprehensive treatment of the subject is provided by Date.

The normalization rules are designed to prevent update anomalies and data inconsistencies. With respect to performance tradeoffs, these guidelines are biased toward the assumption that all non-key fields will be updated frequently. They tend to penalize retrieval, since data which may have been retrievable from one record in an unnormalized design may have to be retrieved from several records in the normalized form. There is no obligation to fully normalize all records when actual performance requirements are taken into account.

First Normal Form eliminates repeating groups by putting each into a separate table and connecting them with a one-to-many relationship.

Second Normal Form eliminates functional dependencies on a partial key by putting the fields in a separate table from those that are dependent on the whole key.

Third Normal Form eliminates functional dependencies on non-key fields by putting them in a separate table.  At this stage, all non-key fields are dependent on the key, the whole key and nothing but the key.

Fourth Normal Form separates independent multi-valued facts stored in one table into separate tables.

Fifth Normal Form breaks out data redundancy that is not covered by any of the previous normal forms.

The first three rules of normalization were outlined by Dr. E.F. Codd in his 1972 paper, "Further Normalization of the Data Base Relational Model".  Other rules have since been theorized by later Set Theory and Relational Algebra mathematicians.

FIRST NORMAL FORM

Ø  Eliminate repeating groups in individual tables.
Ø  Create a separate table for each set of related data.
Ø  Identify each set of related data with a primary key.

First normal form [1] deals with the "shape" of a record type.

Under first normal form, all occurrences of a record type must contain the same number of fields.

First normal form excludes variable repeating fields and groups. This is not so much a design guideline as a matter of definition. Relational database theory doesn't deal with records having a variable number of fields.

SECOND AND THIRD NORMAL FORMS
Second and third normal forms [2, 3, 7] deal with the relationship between non-key and key fields.

Under second and third normal forms, a non-key field must provide a fact about the key, us the whole key, and nothing but the key.  In addition, the record must satisfy first normal form.

We deal now only with "single-valued" facts. The fact could be a one-to-many relationship, such as the department of an employee, or a one-to-one relationship, such as the spouse of an employee.  Thus the phrase "Y is a fact about X" signifies a one-to-one or one-to-many relationship between Y and X. In the general case, Y might consist of one or more fields, and so might X. In the following example, QUANTITY is a fact about the combination of PART and WAREHOUSE.

Second Normal Form

Ø  Create separate tables for sets of values that apply to multiple records.
Ø  Relate these tables with a foreign key.

Second normal form is violated when a non-key field is a fact about a subset of a key.  It is only relevant when the key is composite, i.e., consists of several fields.  Consider the following inventory record:

PART
WAREHOUSE
QUANTITY
WAREHOUSE-ADDRESS





The key here consists of the PART and WAREHOUSE fields together, but WAREHOUSE-ADDRESS is a fact about the WAREHOUSE alone.  The basic problems with this design are:

The warehouse address is repeated in every record that refers to a part stored in that warehouse.

If the address of the warehouse changes, every record referring to a part stored in that warehouse must be updated.
Because of the redundancy, the data might become inconsistent, with different records showing different addresses for the same warehouse.

If at some point in time there are no parts stored in the warehouse, there may be no record in which to keep the warehouse's address.

To satisfy second normal form, the record shown above should be decomposed into (replaced by) the two records:

PART
WAREHOUSE
QUANTITY

WAREHOUSE
WAREHOUSE-ADDRESS







When a data design is changed in this way, replacing unnormalized records with normalized records, the process is referred to as normalization.  The term "normalization" is sometimes used relative to a particular normal form.  Thus a set of records may be normalized with respect to second normal form but not with respect to third.

The normalized design enhances the integrity of the data, by minimizing redundancy and inconsistency, but at some possible performance cost for certain retrieval applications.  Consider an application that wants the addresses of all warehouses stocking a certain part.  In the unnormalized form, the application searches one record type.  With the normalized design, the application has to search two record types, and connect the appropriate pairs.

Third Normal Form

Ø  Eliminate fields that do not depend on the key.

Third normal form is violated when a non-key field is a fact about another non-key field, as in
EMPLOYEE
DEPARTMENT
LOCATION




The EMPLOYEE field is the key.  If each department is located in one place, then the LOCATION field is a fact about the DEPARTMENT -- in addition to being a fact about the EMPLOYEE. The problems with this design are the same as those caused by violations of second normal form:

The department's location is repeated in the record of every employee assigned to that department.

If the location of the department changes, every such record must be updated.

Because of the redundancy, the data might become inconsistent, with different records showing different locations for the same department.

If a department has no employees, there may be no record in which to keep the department's location.

To satisfy third normal form, the record shown above should be decomposed into the two records:

EMPLOYEE
DEPARTMENT

DEPARTMENT
LOCATION






To summarize, a record is in second and third normal forms if every field is either part of the key or provides a (single-valued) fact about exactly the whole key and nothing else.

Functional Dependencies

Minimal Sets of Functional Dependencies:
Ø  All right hand sides have single attributes
Ø  If a functional dependency is removed the resulting set is not the same.
Ø  If the left hand side is reduced the resulting set is not the same.

In relational database theory, second and third normal forms are defined in terms of functional dependencies, which correspond approximately to our single-valued facts.  A field Y is "functionally dependent" on a field (or fields) X if it is invalid to have two records with the same X-value but different Y-values.  That is, a given X-value must always occur with the same Y-value.  When X is a key, then all fields are by definition functionally dependent on X in a trivial way, since there can't be two records having the same X value.

There is a slight technical difference between functional dependencies and single-valued facts as we have presented them.  Functional dependencies only exist when the things involved have unique and singular identifiers (representations).   For example, suppose a person's address is a single-valued fact, i.e., a person has only one address. If we don't provide unique identifiers for people, then there will not be a functional dependency in the data:

PERSON
ADDRESS
John Smith
123 Main St., New York
John Smith
321 Center St., San Francisco

Although each person has a unique address, a given name can appear with several different addresses.  Hence we do not have a functional dependency corresponding to our single-valued fact.

Similarly, the address has to be spelled identically in each occurrence in order to have a functional dependency.  In the following case the same person appears to be living at two different addresses, again precluding a functional dependency.



PERSON
ADDRESS
John Smith
123 Main St., New York
John Smith
123 Main Street, NYC

We are not defending the use of non-unique or non-singular representations.  Such practices often lead to data maintenance problems of their own.  We do wish to point out, however, that functional dependencies and the various normal forms are really only defined for situations in which there are unique and singular identifiers.  Thus the design guidelines as we present them are a bit stronger than those implied by the formal definitions of the normal forms.

For instance, we as designers know that in the following example there is a single-valued fact about a non-key field, and hence the design is susceptible to all the update anomalies mentioned earlier.

EMPLOYEE
FATHER
FATHER’S ADDRESS
Art Smith
John Smith
123 Main St., New York
Bob Smith
John Smith
123 Main Street, NYC
Cal Smith
John Smith
321 Center St., San Francisco

However, in formal terms, there is no functional dependency here between FATHER'S-ADDRESS and FATHER, and hence no violation of third normal form.

FOURTH AND FIFTH NORMAL FORMS

Fourth [5] and fifth [6] normal forms deal with multi-valued facts.  The multi-valued fact may correspond to a many-to-many relationship, as with employees and skills, or to a many-to-one relationship, as with the children of an employee (assuming only one parent is an employee).  By "many-to-many" we mean that an employee may have several skills, and a skill may belong to several employees.

Note that we look at the many-to-one relationship between children and fathers as a single-valued fact about a child but a multi-valued fact about a father.

In a sense, fourth and fifth normal forms are also about composite keys.  These normal forms attempt to minimize the number of fields involved in a composite key, as suggested by the examples to follow.

Fourth Normal Form

Ø  Independent entities can not be stored in the same table

Under fourth normal form, a record type should not contain two or more independent multi-valued facts about an entity.  In addition, the record must satisfy third normal form.

The term "independent" will be discussed after considering an example.

Consider employees, skills, and languages, where an employee may have several skills and several languages.  We have here two many-to-many relationships, one between employees and skills, and one between employees and languages.  Under fourth normal form, these two relationships should not be represented in a single record such as

EMPLOYEE
SKILL
LANGUAGE




Instead, they should be represented in the two records

EMPLOYEE
SKILL

EMPLOYEE
LANGUAGE






Note that other fields, not involving multi-valued facts, are permitted to occur in the record, as in the case of the QUANTITY field in the earlier PART/WAREHOUSE example.

The main problem with violating fourth normal form is that it leads to uncertainties in the maintenance policies.  Several policies are possible for maintaining two independent multi-valued facts in one record:

(1) A disjoint format, in which a record contains either a skill or a language, but not both:

EMPLOYEE
SKILL
LANGUAGE
Smith
Cook

Smith
Type

Smith

French
Smith

Greek
Smith

German

This is not much different from maintaining two separate record types.  (We note in passing that such a format also leads to ambiguities regarding the meanings of blank fields. A blank SKILL could mean the person has no skill, or the field is not applicable to this employee, or the data is unknown, or, as in this case, the data may be found in another record.)

(2) A random mix, with three variations:

(a) Minimal number of records, with repetitions:

EMPLOYEE
SKILL
LANGUAGE
Smith
Cook
French
Smith
Type
Greek
Smith
Type
German

(b) Minimal number of records, with null values:
EMPLOYEE
SKILL
LANGUAGE
Smith
Cook
French
Smith
Type
Greek
Smith

German

 (c) Unrestricted:
EMPLOYEE
SKILL
LANGUAGE
Smith
Cook
French
Smith
Type

Smith

German
Smith
Type
Greek

(3) A "cross-product" form, where for each employee, there must be a record for every possible pairing of one of his skills with one of his languages:

EMPLOYEE
SKILL
LANGUAGE
Smith
Cook
French
Smith
Cook
Greek
Smith
Cook
German
Smith
Type
French
Smith
Type
Greek
Smith
Type
German

Other problems caused by violating fourth normal form are similar in spirit to those mentioned earlier for violations of second or third normal form.  They take different variations depending on the chosen maintenance policy:

If there are repetitions, then updates have to be done in multiple records, and they could become inconsistent. 

Insertion of a new skill may involve looking for a record with a blank skill, or inserting a new record with a possibly blank language, or inserting multiple records pairing the new skill with some or all of the languages.
Deletion of a skill may involve blanking out the skill field in one or more records (perhaps with a check that this doesn't leave two records with the same language and a blank skill), or deleting one or more records, coupled with a check that the last mention of some language hasn't also been deleted.

Fourth normal form minimizes such update problems.

Independence
We mentioned independent multi-valued facts earlier, and we now illustrate what we mean in terms of the example.  The two many-to-many relationships, employee:skill and employee:language, are "independent" in that there is no direct connection between skills and languages.  There is only an indirect connection because they belong to some common employee.  That is, it does not matter which skill is paired with which language in a record; the pairing does not convey any information.  That's precisely why all the maintenance policies mentioned earlier can be allowed.

In contrast, suppose that an employee could only exercise certain skills in certain languages.  Perhaps Smith can cook French cuisine only, but can type in French, German, and Greek.  Then the pairings of skills and languages becomes meaningful, and there is no longer an ambiguity of maintenance policies.  In the present case, only the following form is correct:

EMPLOYEE
SKILL
LANGUAGE
Smith
Cook
French
Smith
Type
French
Smith
Type
Greek
Smith
Type
German

Thus the employee:skill and employee:language relationships are no longer independent.  These records do not violate fourth normal form.  When there is an interdependence among the relationships, then it is acceptable to represent them in a single record.

Multivalued Dependencies
For readers interested in pursuing the technical background of fourth normal form a bit further, we mention that fourth normal form is defined in terms of multivalued dependencies, which correspond to our independent multi-valued facts.  Multivalued dependencies, in turn, are defined essentially as relationships which accept the "cross-product" maintenance policy mentioned above.  That is, for our example, every one of an employee's skills must appear paired with every one of his languages.  It may or may not be obvious to the reader that this is equivalent to our notion of independence: since every possible pairing must be present, there is no "information" in the pairings.  Such pairings convey information only if some of them can be absent, that is, only if it is possible that some employee cannot perform some skill in some language.  If all pairings are always present, then the relationships are really independent.

We should also point out that multivalued dependencies and fourth normal form apply as well to relationships involving more than two fields.  For example, suppose we extend the earlier example to include projects, in the following sense:

An employee uses certain skills on certain projects.

An employee uses certain languages on certain projects.

If there is no direct connection between the skills and languages that an employee uses on a project, then we could treat this as two independent many-to-many relationships of the form EP:S and EP:L, where "EP" represents a combination of an employee with a project.  A record including employee, project, skill, and language would violate fourth normal form.  Two records, containing fields E,P,S and E,P,L, respectively, would satisfy fourth normal form.



Fifth Normal Form

Ø  The original table must be reconstructed from the tables into which it has been broken down.
Ø  The benefit of applying this rule ensures you have not created any extraneous columns in your tables, and that all of the table structures you have created are only as large as they need to be.

Fifth normal form deals with cases where information can be reconstructed from smaller pieces of information that can be maintained with less redundancy.  Second, third, and fourth normal forms also serve this purpose, but fifth normal form generalizes to cases not covered by the others.

We will not attempt a comprehensive exposition of fifth normal form, but illustrate the central concept with a commonly used example, namely one involving agents, companies, and products.  If agents represent companies, companies make products, and agents sell products, then we might want to keep a record of which agent sells which product for which company.  This information could be kept in one record type with three fields:

AGENT
COMPANY
PRODUCT
Smith
Ford
Car
Smith
GM
Truck

This form is necessary in the general case.  For example, although agent Smith sells cars made by Ford and trucks made by GM, he does not sell Ford trucks or GM cars.  Thus we need the combination of three fields to know which combinations are valid and which are not.

But suppose that a certain rule was in effect: if an agent sells a certain product, and he represents a company making that product, then he sells that product for that company.
AGENT
COMPANY
PRODUCT
Smith
Ford
Car
Smith
Ford
Truck
Smith
GM
Car
Smith
GM
Truck
Jones
Ford
Car

In this case, it turns out that we can reconstruct all the true facts from a normalized form consisting of three separate record types, each containing two fields:

AGENT
COMPANY

COMPANY
PRODUCT

AGENT
PRODUCT
Smith
Ford

Ford
Car

Smith
Car
Smith
GM

Ford
Truck

Smith
Truck
Jones
Ford

GM
Car

Jones
Car



GM
Truck




These three record types are in fifth normal form, whereas the corresponding three-field record shown previously is not.

Roughly speaking, we may say that a record type is in fifth normal form when its information content cannot be reconstructed from several smaller record types, i.e., from record types each having fewer fields than the original record.  The case where all the smaller records have the same key is excluded.  If a record type can only be decomposed into smaller records which all have the same key, then the record type is considered to be in fifth normal form without decomposition.  A record type in fifth normal form is also in fourth, third, second, and first normal forms.

Fifth normal form does not differ from fourth normal form unless there exists a symmetric constraint such as the rule about agents, companies, and products.  In the absence of such a constraint, a record type in fourth normal form is always in fifth normal form.

One advantage of fifth normal form is that certain redundancies can be eliminated.  In the normalized form, the fact that Smith sells cars is recorded only once; in the unnormalized form it may be repeated many times.

It should be observed that although the normalized form involves more record types, there may be fewer total record occurrences.  This is not apparent when there are only a few facts to record, as in the example shown above.  The advantage is realized as more facts are recorded, since the size of the normalized files increases in an additive fashion, while the size of the unnormalized file increases in a multiplicative fashion.  For example, if we add a new agent who sells x products for y companies, where each of these companies makes each of these products, we have to add x+y new records to the normalized form, but xy new records to the unnormalized form.

It should be noted that all three record types are required in the normalized form in order to reconstruct the same information. From the first two record types shown above we learn that Jones represents Ford and that Ford makes trucks.  But we can't determine whether Jones sells Ford trucks until we look at the third record type to determine whether Jones sells trucks at all.

The following example illustrates a case in which the rule about agents, companies, and products is satisfied, and which clearly requires all three record types in the normalized form.  Any two of the record types taken alone will imply something untrue.

AGENT
COMPANY
PRODUCT
Smith
Ford
Car
Smith
Ford
Truck
Smith
GM
Car
Smith
GM
Truck
Jones
Ford
Car
Jones
Ford
Truck
Brown
Ford
Car
Brown
GM
Car
Brown
Toyota
Car
Brown
Toyota
Bus

AGENT
COMPANY

COMPANY
PRODUCT

AGENT
PRODUCT
Smith
Ford

Ford
Car

Smith
Car
Smith
GM

Ford
Truck

Smith
Truck
Jones
Ford

GM
Car

Jones
Car
Brown
Ford

GM
Truck

Jones
Truck
Brown
GM

Toyota
Car

Brown
Car
Brown
Toyota

Toyota
Bus

Brown
Bus

Observe that:
·  Jones sells cars and GM makes cars, but Jones does not represent GM.
·  Brown represents Ford and Ford makes trucks, but Brown does not sell trucks.
·  Brown represents Ford and Brown sells buses, but Ford does not make buses.

Fourth and fifth normal forms both deal with combinations of multivalued facts.  One difference is that the facts dealt with under fifth normal form are not independent, in the sense discussed earlier. Another difference is that, although fourth normal form can deal with more than two multivalued facts, it only recognizes them in pairwise groups.  We can best explain this in terms of the normalization process implied by fourth normal form.  If a record violates fourth normal form, the associated normalization process decomposes it into two records, each containing fewer fields than the original record. Any of these violating fourth normal form is again decomposed into two records, and so on until the resulting records are all in fourth normal form.  At each stage, the set of records after decomposition contains exactly the same information as the set of records before decomposition.

In the present example, no pairwise decomposition is possible. There is no combination of two smaller records which contains the same total information as the original record.  All three of the smaller records are needed.  Hence an information-preserving pairwise decomposition is not possible, and the original record is not in violation of fourth normal form.  Fifth normal form is needed in order to deal with the redundancies in this case.

UNAVOIDABLE REDUNDANCIES
Normalization certainly doesn't remove all redundancies. Certain redundancies seem to be unavoidable, particularly when several multivalued facts are dependent rather than independent.  In the example, it seems unavoidable that we record the fact that "Smith can type" several times.  Also, when the rule about agents, companies, and products is not in effect, it seems unavoidable that we record the fact that "Smith sells cars" several times.

INTER-RECORD REDUNDANCY
The normal forms discussed here deal only with redundancies occurring within a single record type.  Fifth normal form is considered to be the "ultimate" normal form with respect to such redundancies.

Other redundancies can occur across multiple record types.  For the example concerning employees, departments, and locations, the following records are in third normal form in spite of the obvious redundancy:

EMPLOYEE
DEPARTMENT

DEPARTMENT
LOCATION






EMPLOYEE
LOCATION



In fact, two copies of the same record type would constitute the ultimate in this kind of undetected redundancy.

Inter-record redundancy has been recognized for some time [1], and has recently been addressed in terms of normal forms and normalization [8].


DE-NORMALIZATION

          When de-normalizing, the size of row has to be taken into account.  If the size of the row is very large, very few rows fit into a single block on disk.  Thus retrieval will be slow as large number of blocks are to be retrieved to get the required row.

Breaking the Rules:
When to Denormalize, Sometimes it's necessary to break the rules of normalization and create a database that is deliberately less normal than it otherwise could be. You'll usually do this for performance reasons or because the users of the database demand it. While this won't get you any points with database design purists, ultimately you have to deliver a solution that satisfies your users.  If you do break the rules, however, and decide to denormalize you database, it's important that you follow these guidelines:

Break the rules deliberately; have a good reason for denormalizing.

Be fully aware of the tradeoffs this decision entails.
Thoroughly document this decision.

Create the necessary application adjustments to avoid anomalies.

This last point is worth elaborating on.  In most cases, when you denormalize, you will be required to create additional application code to avoid insertion, update, and deletion anomalies that a more normalized design would avoid.  For example, if you decide to store a calculation in a table, you'll need to create extra event procedure code and attach it to the appropriate event properties of forms that are used to update the data on which the calculation is based.

If you're considering denormalizing for performance reasons, don't always assume that the denormalized approach is the best. Instead, first fully normalize the database (to Third Normal Form or higher) and then denormalize only if it becomes necessary for reasons of performance.

If you're considering denormalizing because your users think they need it, investigate why.  Often they will be concerned about simplifying data entry, which you can usually accomplish by basing forms on queries while keeping your base tables fully normalized.

REFERENCE
The first three rules of normalization were outlined by Dr. E.F. Codd in his 1972 paper, "Further Normalization of the Data Base Relational Model". Other rules have since been theorized by later Set Theory and Relational Algebra mathematicians.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.