Introduction:
The term 'database' has many
interpretations; one definition is a 'collection of persistent data'. A relational database is one in which the
data consists of a 'collection of tables related to each other through common
values'.
What is Relational Database?
Definition:
A relational database is a collection of data where:
Ø Logically
related data is grouped into tables within the database
Ø Tables
within a relational -database hold information that is of value to an
organization.
Ø Each
table contains individual data items
Characteristics:
A relational database has the following characteristics:
Ø Its
components are conceptually simple
Ø It
has a limited number of components or objects
Ø It
has well-defined data operations that are based on a sound theoretical
foundation
Ø Its
data design and definition processes are simple
Ø Installation
and operation of a relational database implementation are simple
Ø A
relationship is a logical link between two tables.
Ø A
relational database management system (RDBMS) uses matching values in multiple
tables to relate the information in one table with the information in the other
table.
Ø The
presentation of data as tables is a logical construct, it is independent of the
way the data is physically stored on disk.
The two most prominent
characteristics of a relational database are:
1) data
stored in tables and
2) relationships
between tables
Basic Terms:
There are three fundamental relational terms:
TABLE-is the same familiar and
easy-to-use data structure you see every day in newspapers, telephone books,
schedules, etc.
A table:
Ø Is
a list of related data (an entity or a relation)
Ø Is
comprised of one or more column(s)
Ø Is
organized into rows
Ø Cannot
have duplicate rows (per relational theory)
Ø COLUMN-describes
a particular aspect of some entity such as a student or a course.
A column (also called as a field or attribute):
Ø Characteristics
of an item
Ø Identifies
a specific data type that dictates -what you, can and cannot do with the-column
Ø Gives
the data a name
Ø Determines
the relative position of the data in the table
Ø Is
analogous to a data item, element or field (in a non- relational database
system)
Ø Is
accessed by its name, which is unique within the table
Ø ROW-is
a set of columns and their values in a particular table.
A row (also called as a record or tuple):
Ø Information
about a separate item
Ø Contains
only one occurrence of each column in the table
Ø Is
analogous to a single record in a file of many records
Ø Is
the smallest unit of insertion or deletion in a table
Ø Has
no inherent order in a table
Ø Must
be unique within a table
Summary
In a relational database, references are to:
Ø Tables,
rather than files
Ø Colunms,
rather than data elements or fields
Ø Rows,
rather than record occurrences
Ø Data
that is logically related
Biggest difference:
Ø We
select only the columns we want, not the whole row
Ø We
select just the rows we want, not all the records
Ø We
update just the column(s) we need to, not the whole record
More Terms
There are several additional
important terms used in the relational database environment:
1. Domain
2. Result
Set
3. View
4. Null
Domain
A domain is the set of all possible values for a given
column. A domain:
Ø Is
not a physical part of a table or database
Ø Is
a conceptual object
Ø Includes
all possible values that may ever occur, even if the value is not currently in
a table
Ø May
be shared by multiple columns
Ø May
be restricted based upon the data, type associated with the column (numeric
versus character).
Result Set
A result set is:
Ø The
group of values returned from one query against the database
View
A view is a logical (or virtual)
table that is derived from one or more tables.
It does not physically exist.
A view can:
Ø Be
made up of selected columns from a table
Ø Be
made up of selected rows from a table
Ø Include
data from more than one table
Ø Include
calculated fields (that is, derived data fields that are created as a result of
a request and not part of an actual table)
Ø Be
created from other views
Ø Be
thought of as a subset of a table
Ø Be
used as a table
Ø Be
all of the above
Views are used to:
Ø Simplify
data retrieval commands by accessing a smaller and simpler version of a table
Ø Limit
access to certain data for a particular user or application program
NULLs
A null means that no value is
currently assigned to a particular column in a specific row in a table.
Ø May
exist because the value of the column may not be known at the time the row is
created
Ø May
exist because a value just does not apply for this row
Basic Table Rules
Tables contain data that is
organized into rows. The rows have one value for each column.
A relational table can be
compared to a file.
Relational tables follow these basic rules:
Ø All
data is represented by values, not pointers or physical connectors.
Ø Each
column in a given row has only one value.
Ø No
duplicate rows occur in a table.
Ø No
duplicate column names exist in a table.
Ø The
order of the rows in the table is not significant.
Ø The
order of the colunms in the table is not significant.
Ø All
relationships exist through values.
Relational Operations
Relational operations can modify
an existing table or create a new version of the table when the result set of
values is returned.
This result set can then be
further manipulated as a table itself.
The relational operations that
are integral in the relational database model are:
1. Selection
2. Projection
3. Join
4. Insertion
5. Deletion
6. Update
7. Ordering
8. Set
Operations
Selection
Selection is the process of
obtaining a subset of rows from a given table.
The effect of a selection is
called a results table or result set.
Projection
A projection is a subset of
columns from a given table.
A selection and projection
results in a subset of both rows and columns.
Join
The join operation takes input
from two or more tables to create a new results table.
This operation matches a value in
common columns in the joined tables.
The results table contains rows
and columns from both tables.
It is the concatenation of the
projected columns from the two matching rows.
Joins fall into a variety of categories:
Ø A
natural join or equi-join is a join on an equal condition. Joins on other
conditions are permitted but are "unnatural."
Ø A
join of a table to itself (self-join) is permitted.
Ø An
outer join is a join that accounts for NULLs or rows missing column matching
values.
Ø A
full join provides all rows from each table. When there is an equal join
between the tables, the output will consist of rows with matching values. Also
selected will be rows that do not have matching values from either joined
table.
Insertion
The insertion operation adds a
new row to a table.
The position of the new row has
no significance.
The new row may or may not
contain columns with NULLS, but should always contain at least the primary key
(which is not null).
Deletion
The deletion operation removes
zero, one or more rows from a table.
The number of rows deleted
depends upon the search criteria specified as part of the DELETE
statement. It is not necessary to SELECT
the rows before deletion.
Update
The update operation allows
individual column values in zero, one or many rows to be modified. The number of rows to be UPDATED will depend
upon the search criteria provided.
The update operation, just as the
other table operations, applies to a specified set of rows. The update operation does not require a row
to be SELECTed before a column or columns can be updated.
Explicit search criteria can be
used to designate a single target row.
Ordering
The ordering operation enables
users to see the results of a query presented in a particular sequence. The sequence can be ascending or descending
by column.
Table rows in a relational
database model do not have a significant order.
Set Operations
A set is the group of values
returned from one access of the database.
Three kinds of operations can be performed on a set:
1. Intersection
2. Difference
3. Union
Intersection
The intersection of two tables
yields a results table containing those rows which appear in both tables.
Difference
The difference of two tables is
the set of rows from the first table that do not occur in the second
table. This is also known as mutual
exclusivity. The order of the tables is
significant.
Union
A union of two or more tables
yields a results table containing all the rows selected from both tables. Duplicate rows may optionally be eliminated
in the results table.
Integrity Rules
The relational model defines
several integrity rules that, while not part of the definition of the Normal
Forms are nonetheless a necessary part of any relational database. There are
two types of integrity rules: general and database-specific.
General Integrity Rules
The relational model specifies
two general integrity rules. They are referred to as general rules, because
they apply to all databases.
They are:
1. entity
integrity and
2. referential
integrity.
Entity Integrity
The entity integrity rule is very
simple. It says that primary keys cannot contain null (missing) data. The
reason for this rule should be obvious. You can't uniquely identify or
reference a row in a table, if the primary key of that table can be null. It's
important to note that this rule applies to both simple and composite keys. For
composite keys, none of the individual columns can be null.
Referential Integrity
In a relational database, we
often need to get columns out of more than one table. The operation we use to do this is called a
join. The vast majority of the time, a join is done on an equal condition of
the primary key of one table to the foreign key of another.
The referential integrity rule
says that the database must not contain any unmatched foreign key values. This
implies that:
A row may not be added to a table
with a foreign key unless the referenced value exists in the referenced table.
If the value in a table that's
referenced by a foreign key is changed (or the entire row is deleted), the rows
in the table with the foreign key must not be "orphaned."
In general, there are three
options available when a referenced primary key value changes or a row is
deleted.
The options are:
1. Disallow
2. Cascade
3. Nullify
Disallow
The change is completely
disallowed.
Cascade
For updates, the change is
cascaded to all dependent tables.
For deletions, the rows in all
dependent tables are deleted.
Nullify
For deletions, the dependent
foreign key values are set to Null.
Database-Specific Integrity Rules
All integrity constraints that do
not fall under entity integrity or referential integrity are termed
database-specific rules or business rules. These type of rules are specific to
each database and come from the rules of the business being modeled by the
database. It is important to note that the enforcement of business rules is as
important as the enforcement of the general integrity rules.
Keys
A key is a column or a
combination of columns that distinguishes a row from all other rows.
Keys can be used for:
Ø Accessing
data
Ø Establishing
uniqueness of data
Ø Sequencing
the data
Keys are somewhat transparent to
the user. You do not need to know the
key of a given row to access that row.
You will need to know the primary key if you want a unique occurrence.
In database design, there are four types of keys:
1. Candidate
keys
2. Primary
keys
3. Alternate
keys
4. Foreign
keys
Candidate keys
One or more columns in a table
that may possibly serve as a primary key.
Primary keys
Stable, unique values. One of the
candidate keys is chosen as the primary key.
Alternate keys
The candidate keys that are not
chosen as the primary key. Alternate keys represent other ways that a user
might need to search for data.
Foreign keys
When a primary key is also a
column in another table, it is called a foreign key column in that other table.
What Is Referential Integrity?
There is a set of 'Business'
rules that must be applied to each relationship which determines the status of
the foreign key columns for rows when inserted or updated.
It is not sufficient to identify
primary and foreign keys that make up relationships between tables. The integrity and usability of a relationship
are greatly affected by the values contained within the foreign key column(s).
In general, a foreign key must
always contain either a value within the domain of foreign key values (values
currently in the primary key column), or nulls.
Guaranteeing that an acceptable value is always in the foreign key
column is known as referential integrity.
When using system-managed
referential integrity, the primary key must be unique and have a unique index
or the table is flagged as incomplete.
FEATURES OF A RDBMS
The following are the some of the features that can be
expected from a RDBMS:
Ø The
ability to create multiple relations (tables) and enter data into them.
Ø The
ability to logically (index) or physically (sort) reorder the data in the
tables.
Ø An
interactive query language.
Ø The
ability to answer queries that require data stored in more than one table.
Ø A
built-in report generator.
Additionally some packages may provide:
Ø A
text editor for use in writing an application program.
Ø An
application generator.
Ø An
interpreter to run the application program.
Ø A
screen generator to create custom screen form.
Popular Relational Database Products:
Ø Although
IBM did most of the research, Oracle delivered the first commercial relational
database in 1979.
Ø IBM
delivered their first product,' SQL/Data System,' in 1982
Ø Microsoft
originally worked in partnership with Sybase to deliver SQL Server 4.2 in 1992.
Ø In
1993 the partnership broke up with Sybase going after the Unix market and
Microsoft pursuing the Windows NT market
Ø Microsoft
SQL Server 6.0 was released in 1995; version 6.5 shipped in 1996.
Ø Microsoft
SQL Server 7.0 shipped in November of 1998.
The following is a list of vendors and their RDBMS
products:
Ø Microsoft's
SQL Server 7.0
Ø Oracle's
Oracle 8
Ø Sybase's
Sybase SQL Server
Ø IBM's
DB2
Ø Informix
Software's Informix-SQL
Ø Computer
Associates' (formerly ASK Group) Ingres
Ø Centura
Software's (formerly Gupta) SQLBase
Ø Borland's
Database Engine
Ø Interbase's
(formerly Borland) Interbase
Ø T.c.X's
mySQL
E. F. CODD’S RULE AND
RELATIONAL DATABASE THEORY
A relational database is more
than just data organized into related tables.
The relational database model is
based firmly in the mathematical theory of relational algebra and calculus.
The original concept for the
model was proposed by Dr. E.F. Codd in a 1970 paper entitled ‘A Relational
Model of Data for Large Shared Data Banks’ .
Later Dr. Codd clarified his
model by defining twelve rules (Codd’s Rules) that a database management system
(DBMS) must meet inn order to be considered a relational database.
In practice, many database
products are considered 'relational' even if they do not strictly adhere to all
12 rules.
A summary of Dr. E. F. Codd’s 12 rules is presented below:
Rule #1 : Data
is Presented in Tables (Information Rule)
Explanation : A set of related tables forms a
database and all data is represented as tables; the data can be viewed in no
other way.
A table (relation or entity) is
a logical grouping of related data in tabular form (rows and columns).
Each row (record or tuple)
describes an item (person, place or thing) and each row contains information
about a single item in the table.
Each column (field or attribute)
describes a single characteristic about an item.
Each value (datum) is defined by
the intersection of a row and column.
Data is atomic; there is no more
than one value associated with the intersection of a row and column.
There is no hierarchical ranking
of tables.
The relationships among tables
are logical; there are no physical relationships among tables.
Summary : All information in a relational database
including table names, column names are represented explicitly by values in
tables.
Benefits : This simple view of data speeds design and
learning. User productivity is improved
since knowledge of only one language is necessary to access all data such as
description of the table and attribute definitions, integrity constraints,
action to be taken when constraints are violated and security information. These information’s are stored in the form of
tables containing rows and columns.
Rule #2 : Data
is Logically Accessible (Guaranteed access rule)
Explanation
:A relational database does not reference data by physical location; there is no such thing as the ‘fifth row in the customers
table'.
Each piece of data must be
logically accessible by referencing 1) a table; 2) a primary or unique key
value; and 3) a column.
EXAMPLE: A specific employee in
the 'Employee' table (e.g., Kapil Dev) and related information (last name,
first name, ID, phone number, salary, etc) constitute a row. An employee’s last name is a column. The name ‘Dev’ is a data value. Kapil Dev’s last name can be precisely
located by referencing the ‘employee’ table, the appropriate column (last_name)
and a unique key value (employee_id)
Summary : Every piece of data in a relational database
can be accessed by using combination of a table name, a primary key value that identifies
the row and a column name that identifies the cell.
Benefits : User productivity is improved since there is
no need to resort to using physical pointers or addresses. Provides data independence. Possible to retrieve each individual piece of
data stored in a relational database by specifying the name of the table in
which it is stored, the column key which identifies the row in which it is
stored.
Rule #3
: Nulls are Treated Uniformly As
Unknown (Systematic treatment of nulls rule)
Explanation : Null must always be interpreted as an
unknown value.
Null means no value has been
entered; the value is not known 'Unknown' is not the same thing as an empty
string ("") or zero.
EXAMPLE: If you pick up an item in a store and the
price is not marked, the price is unknown (NULL); it is not free. If not handled properly, nulls can cause
confusion in your database.
EXAMPLE: If you search for all
of the authors whose home state is not Tamil Nadu, the results will not include
any authors with NULL in the 'state'
column. SQL is very literal. You asked for authors where the state was NOT
Tamil Nadu and NULL means 'unknown'. A
NULL value for 'state' may mean that the state is Tamil Nadu and it may mean
that it is not Tamil Nadu; you just don't know.
Because the database engine can't tell for sure whether the state is not
Tamil Nadu, a record with NULL will not be returned.
Nulls propagate through
arithmetic expressions (e.g., 2 + NULL = NULL).
Comparing a null to any value, including itself, returns NULL.
Summary : The RDBMS handles records that have unknown
or inapplicable values in a predefined fashion.
Also the RDBMS distinguishes between zeros, blanks and nulls in records
and handles such values in a consistent manner that produces correct answers,
comparisons and calculation.
Benefits : Through the set of rules for handling nulls,
users can distinguish results of queries that involves nulls, zeros and
blanks. It may not be apparent why th
handling of null is important.
Rule #4 : Active on line catalog used on relational
model (Database is Self-Describing)
Explanation : In addition to user data, a relational
database contains data about itself.
There are two types of tables in
a RDBMS: user tables that contain the 'working' data and system tables contain
data about the database structure.
Metadata is data that describes
the structure of the database itself and includes object definitions (tables,
indexes, stored procedures, etc.) and how they relate to each other.
The collection of system tables
is also referred to as the system
catalog or data dictionary.
System tables can be accessed in
the same manner as user tables.
Summary : The description of a database and the
contents are database tables and therefore can be queries on-line via the data
language.
Benefits : The DBA’s productivity is improved since
changes and additions to the catalog can be done with the same commands that
are used to access any other table. All
queries and reports can be done as with other tables.
Rule #5
: A Single Language is Used to
Communicate with the Database Management System (Comprehensive data
sub-language rule)
Explanation : There must be a single language that
handles all communication with the database management system.
The language must support
relational operations with respect to: data modification (i.e., SELECT, INSERT,
UPDATE, DELETE), data definition (i.e., CREATE, ALTER, DROP) and administration
(i.e., GRANT, REVOKE, DENY, BACKUP, RESTORE).
Structured Query Language (SQL)
is the default standard for a relational database language.
SQL is a ‘nonprocedural’ or
‘declarative’ language; it allows users to express what they want from the
RDBMS without specifying the details about where it's located or how to get it.
Summary : A RDBMS may support several languages, but
atleast one of them allows the user to do all of the following:
Defines table/view, query and
update data, set integrity constraints, set authorization and define
transactions.
Benefits : User productivity is improved since there is
just one approach that can be used for all database operations. In a multi-user environment actions against a
database performed by an application program is contained within a transaction. The DBMS query language supports statements
that indicate where a transaction begins and where it should end.
Rule #6
: Provides Alternatives for Viewing
Data (View updating rule)
Explanation : A relational database must not be
limited to source tables when presenting data to the user.
Views are virtual tables or
abstractions of the source tables.
A view is an alternative way of
looking at data from one or more tables.
A view definition does not
duplicate data; a view is not a copy of the data in the source tables.
Once created, a view can be
manipulated in the same way as a source table.
If you change data in a view,
you are changing the underlying data in the source table (although there are
limits on how data can be modified from a view).
Views allow the creation of
‘custom tables’ that are tailored to special needs.
EXAMPLE: By not including the columns with sensitive
information in a view definition, a view can be used to restrict a user’s
access to the data.
Views can be used to simplify
data access by ‘predefining’ complex joins; the concept is similar to that of a
'saved query'.
Summary : Any view that is theoretically updateable
can be updated using the RDBMS. A view
is theoretically updateable if changes can be made to the tables that effect
the desired changes in the view.
Benefits : Data consistency is ensured since changes in
the underlying tables are transmitted to the view they support. Also logical data independence reduces
maintenance costs.
Rule #7 : Supports Set-Based or Relational Operations
(High level insert, update and delete)
Explanation : Rows are treated as sets for data
manipulation operations (SELECT, INSERT,
UPDATE, DELETE).
A relational database must
support basic relational algebra operations (selection, projection & join )
and set operations (union, intersection, division and difference).
Set operations and relational
algebra are used to operate on 'relations' (tables) to produce other relations.
A database that supports only
row-at-a-time (navigational) operations does not meet this requirement and is
not considered 'relational'.
Summary : The RDBMS supports insertion, updation and
deletion at a table level. For example
in SQL, delete can be performed on a set of select records.
Benefits : By using high-level insert, update and
delete the RDBMS can improve performance by optimizing the path to be taken to
execute the action. This is especially
important in distributed databases.
Additionally ease of use is improved since the commands act on a set of
records.
Rule #8 : Physical
data independence
Explanation : Applications that access data in a
relational database must be unaffected by changes in the way the data is
physically stored (i.e., the physical structure).
EXAMPLE: The code in an
application that accesses data in a file-based database typically depends on
the file format (e.g., the code references a
'phone number' field that is 10 characters wide, is preceded by the 'zip
code' field, followed by the 'fax number' field...). If the layout of the data in the file is
changed, the application must also be changed.
In contrast, the storage and
access methods (physical) used in a relational database can change
without affecting the user or application’s ability to work with the data. The user still only sees tables (logical
structure).
An application that accesses
data in a relational database contains only a basic definition of the data
(data type and length); it does not need to know how the data is physically
stored or accessed.
Summary : The execution of adhoc requests and
application programs is not affected by changes in the physical data access and
storage methods.
Benefits : DBA’s can make the changes to physical
access and storage methods, which improve performance but do not require
changes in application programs or adhoc requests. This reduces maintenance costs.
Rule #9 : Logical
data independence
Explanation : Logical independence means the
relationships among tables can change without impairing the function of
applications and adhoc queries.
The database schema or structure
of tables and relationships (logical) can change without having to recreate the
database or the applications that use it
Summary : Logical changes in tables and views such as
adding/deleting columns or changing field lengths do not necessitate
modifications in application programs or in the format of adhoc requests.
Benefits : The databases can change and grow to reflect
changes in reality without requiring user intervention or changes in
applications. This reduces maintenance
costs. For example adding an attribute
to a base table should not affect the programs or the interactive commands that
have no use for the new attribute.
Rule #10 : Data Integrity is a Function of the DBMS
(Integrity Independence)
Explanation : In order to be considered relational,
data integrity must be an internal function of the DBMS; not the application
program.
Data integrity means the
consistency and accuracy of the data in the database (i.e., keeping the garbage
out of the database).
There are three types of data
integrity: entity, domain, and referential.
Within the database, data
integrity can be enforced procedurally or declaratively.
Declarative data integrity
involves placing or ‘declaring’ constraints on columns.
Procedural data integrity is
maintained through code (i.e., through stored procedures or triggers).
Summary : Like table/view definitions, integrity
constraints are stored in the on-line catalog and therefore can be changed
without necessitating changes in application programs or in the format of adhoc
requests. Additionally integrity
constraints cannot be bypassed.
Benefits : Integrity constraints specific to a
particular relational database, must be definable in the relational data
sub-language and storable in the catalog.
Atleast entity integrity and referential integrity constraints must be
supported.
Rule #11 : Supports Distributed Operations
(Distribution Independence)
Explanation : Data in a relational database can be
stored centrally or distributed.
Users can join data from tables
on different servers (distributed queries) and from other relational databases
(heterogeneous queries).
Data integrity must be
maintained regardless of the number of copies of data and where it resides.
Summary : Application programs and adhoc are not
affected by changes in the distribution of the physical data.
Benefits : Improved system reliability since
application programs will work even if the programs and data are moved in
different sites.
Rule #12 : Data Integrity Cannot be Subverted (Non
subversion rule)
Explanation : There cannot be other paths into the
database that subvert data integrity; in other words, you can't get in the
'back door' and change the data in such a manner as data integrity is violated.
The DBMS must prevent data from
being modified by machine language intervention.
Summary : If the RDBMS has language that accesses the
information of a record at a time, this language cannot be used to by-pass the
integrity constraints. In order to
adhere to his rule RDBMS must have an active catalog that contains the
constraints must have logical data independence.
Benefits : This is necessary for data integrity.
DATABASE NORMALIZATION
AND DESIGN TECHNIQUES
INTRODUCTION
The normal forms defined in
relational database theory represent guidelines for record design. The
guidelines corresponding to first through fifth normal forms are presented
here, in terms that do not require an understanding of relational theory. The
design guidelines are meaningful even if one is not using a relational database
system. We present the guidelines without referring to the concepts of the
relational model in order to emphasize their generality, and also to make them
easier to understand. Our presentation conveys an intuitive sense of the
intended constraints on record design, although in its informality it may be
imprecise in some technical details. A comprehensive treatment of the subject
is provided by Date.
The normalization rules are
designed to prevent update anomalies and data inconsistencies. With respect to
performance tradeoffs, these guidelines are biased toward the assumption that
all non-key fields will be updated frequently. They tend to penalize retrieval,
since data which may have been retrievable from one record in an unnormalized
design may have to be retrieved from several records in the normalized form.
There is no obligation to fully normalize all records when actual performance
requirements are taken into account.
First Normal Form
eliminates repeating groups by putting each into a separate table and connecting
them with a one-to-many relationship.
Second Normal Form
eliminates functional dependencies on a partial key by putting the fields in a
separate table from those that are dependent on the whole key.
Third Normal Form
eliminates functional dependencies on non-key fields by putting them in a
separate table. At this stage, all
non-key fields are dependent on the key, the whole key and nothing but the key.
Fourth Normal Form
separates independent multi-valued facts stored in one table into separate
tables.
Fifth Normal Form breaks
out data redundancy that is not covered by any of the previous normal forms.
The first three rules of
normalization were outlined by Dr. E.F. Codd in his 1972 paper, "Further
Normalization of the Data Base Relational Model". Other rules have since been theorized by
later Set Theory and Relational Algebra mathematicians.
FIRST NORMAL FORM
Ø Eliminate
repeating groups in individual tables.
Ø Create
a separate table for each set of related data.
Ø Identify
each set of related data with a primary key.
First normal form [1] deals with
the "shape" of a record type.
Under first normal form, all
occurrences of a record type must contain the same number of fields.
First normal form excludes
variable repeating fields and groups. This is not so much a design guideline as
a matter of definition. Relational database theory doesn't deal with records
having a variable number of fields.
SECOND AND THIRD NORMAL FORMS
Second and third normal forms [2,
3, 7] deal with the relationship between non-key and key fields.
Under second and third normal
forms, a non-key field must provide a fact about the key, us the whole key, and
nothing but the key. In addition, the
record must satisfy first normal form.
We deal now only with "single-valued"
facts. The fact could be a one-to-many relationship, such as the department of
an employee, or a one-to-one relationship, such as the spouse of an
employee. Thus the phrase "Y is a
fact about X" signifies a one-to-one or one-to-many relationship between Y
and X. In the general case, Y might consist of one or more fields, and so might
X. In the following example, QUANTITY is a fact about the combination of PART
and WAREHOUSE.
Second Normal Form
Ø Create
separate tables for sets of values that apply to multiple records.
Ø Relate
these tables with a foreign key.
Second normal form is violated
when a non-key field is a fact about a subset of a key. It is only relevant when the key is
composite, i.e., consists of several fields.
Consider the following inventory record:
PART
|
WAREHOUSE
|
QUANTITY
|
WAREHOUSE-ADDRESS
|
The key here consists of the PART
and WAREHOUSE fields together, but WAREHOUSE-ADDRESS is a fact about the
WAREHOUSE alone. The basic problems with
this design are:
The warehouse address is repeated
in every record that refers to a part stored in that warehouse.
If the address of the warehouse
changes, every record referring to a part stored in that warehouse must be
updated.
Because of the redundancy, the data might become inconsistent,
with different records showing different addresses for the same warehouse.
If at some point in time there
are no parts stored in the warehouse, there may be no record in which to keep
the warehouse's address.
To satisfy second normal form, the
record shown above should be decomposed into (replaced by) the two records:
PART
|
WAREHOUSE
|
QUANTITY
|
WAREHOUSE
|
WAREHOUSE-ADDRESS
|
|
When a data design is changed in
this way, replacing unnormalized records with normalized records, the process is
referred to as normalization. The term
"normalization" is sometimes used relative to a particular normal
form. Thus a set of records may be
normalized with respect to second normal form but not with respect to third.
The normalized design enhances the
integrity of the data, by minimizing redundancy and inconsistency, but at some
possible performance cost for certain retrieval applications. Consider an application that wants the
addresses of all warehouses stocking a certain part. In the unnormalized form, the application
searches one record type. With the
normalized design, the application has to search two record types, and connect
the appropriate pairs.
Third Normal Form
Ø Eliminate
fields that do not depend on the key.
Third normal form is violated
when a non-key field is a fact about another non-key field, as in
EMPLOYEE
|
DEPARTMENT
|
LOCATION
|
The EMPLOYEE field is the
key. If each department is located in
one place, then the LOCATION field is a fact about the DEPARTMENT -- in
addition to being a fact about the EMPLOYEE. The problems with this design are
the same as those caused by violations of second normal form:
The department's location is
repeated in the record of every employee assigned to that department.
If the location of the department
changes, every such record must be updated.
Because of the redundancy, the
data might become inconsistent, with different records showing different
locations for the same department.
If a department has no employees,
there may be no record in which to keep the department's location.
To satisfy third normal form, the
record shown above should be decomposed into the two records:
EMPLOYEE
|
DEPARTMENT
|
DEPARTMENT
|
LOCATION
|
|
To summarize, a record is in
second and third normal forms if every field is either part of the key or
provides a (single-valued) fact about exactly the whole key and nothing else.
Functional Dependencies
Minimal Sets of Functional Dependencies:
Ø All
right hand sides have single attributes
Ø If
a functional dependency is removed the resulting set is not the same.
Ø If
the left hand side is reduced the resulting set is not the same.
In relational database theory,
second and third normal forms are defined in terms of functional dependencies,
which correspond approximately to our single-valued facts. A field Y is "functionally
dependent" on a field (or fields) X if it is invalid to have two records
with the same X-value but different Y-values.
That is, a given X-value must always occur with the same Y-value. When X is a key, then all fields are by
definition functionally dependent on X in a trivial way, since there can't be
two records having the same X value.
There is a slight technical
difference between functional dependencies and single-valued facts as we have presented
them. Functional dependencies only exist
when the things involved have unique and singular identifiers
(representations). For example, suppose
a person's address is a single-valued fact, i.e., a person has only one
address. If we don't provide unique identifiers for people, then there will not
be a functional dependency in the data:
PERSON
|
ADDRESS
|
John Smith
|
123 Main St., New York
|
John Smith
|
321 Center St., San Francisco
|
Although each person has a unique
address, a given name can appear with several different addresses. Hence we do not have a functional dependency
corresponding to our single-valued fact.
Similarly, the address has to be
spelled identically in each occurrence in order to have a functional
dependency. In the following case the
same person appears to be living at two different addresses, again precluding a
functional dependency.
PERSON
|
ADDRESS
|
John Smith
|
123 Main St., New York
|
John Smith
|
123 Main Street, NYC
|
We are not defending the use of
non-unique or non-singular representations.
Such practices often lead to data maintenance problems of their
own. We do wish to point out, however,
that functional dependencies and the various normal forms are really only
defined for situations in which there are unique and singular identifiers. Thus the design guidelines as we present them
are a bit stronger than those implied by the formal definitions of the normal
forms.
For instance, we as designers
know that in the following example there is a single-valued fact about a non-key
field, and hence the design is susceptible to all the update anomalies
mentioned earlier.
EMPLOYEE
|
FATHER
|
FATHER’S ADDRESS
|
Art Smith
|
John Smith
|
123 Main St., New York
|
Bob Smith
|
John Smith
|
123 Main Street, NYC
|
Cal Smith
|
John Smith
|
321 Center St., San Francisco
|
However, in formal terms, there
is no functional dependency here between FATHER'S-ADDRESS and FATHER, and hence
no violation of third normal form.
FOURTH AND FIFTH NORMAL FORMS
Fourth [5] and fifth [6] normal
forms deal with multi-valued facts. The
multi-valued fact may correspond to a many-to-many relationship, as with
employees and skills, or to a many-to-one relationship, as with the children of
an employee (assuming only one parent is an employee). By "many-to-many" we mean that an employee
may have several skills, and a skill may belong to several employees.
Note that we look at the
many-to-one relationship between children and fathers as a single-valued fact
about a child but a multi-valued fact about a father.
In a sense, fourth and fifth
normal forms are also about composite keys.
These normal forms attempt to minimize the number of fields involved in
a composite key, as suggested by the examples to follow.
Fourth Normal Form
Ø Independent
entities can not be stored in the same table
Under fourth normal form, a
record type should not contain two or more independent multi-valued facts about
an entity. In addition, the record must
satisfy third normal form.
The term "independent"
will be discussed after considering an example.
Consider employees, skills, and
languages, where an employee may have several skills and several
languages. We have here two many-to-many
relationships, one between employees and skills, and one between employees and
languages. Under fourth normal form,
these two relationships should not be represented in a single record such as
EMPLOYEE
|
SKILL
|
LANGUAGE
|
Instead, they should be
represented in the two records
EMPLOYEE
|
SKILL
|
EMPLOYEE
|
LANGUAGE
|
|
Note that other fields, not
involving multi-valued facts, are permitted to occur in the record, as in the
case of the QUANTITY field in the earlier PART/WAREHOUSE example.
The main problem with violating
fourth normal form is that it leads to uncertainties in the maintenance
policies. Several policies are possible
for maintaining two independent multi-valued facts in one record:
(1) A disjoint format, in which a
record contains either a skill or a language, but not both:
EMPLOYEE
|
SKILL
|
LANGUAGE
|
Smith
|
Cook
|
|
Smith
|
Type
|
|
Smith
|
French
|
|
Smith
|
Greek
|
|
Smith
|
German
|
This is not much different from
maintaining two separate record types.
(We note in passing that such a format also leads to ambiguities
regarding the meanings of blank fields. A blank SKILL could mean the person has
no skill, or the field is not applicable to this employee, or the data is
unknown, or, as in this case, the data may be found in another record.)
(2) A random mix, with three
variations:
(a) Minimal number of records,
with repetitions:
EMPLOYEE
|
SKILL
|
LANGUAGE
|
Smith
|
Cook
|
French
|
Smith
|
Type
|
Greek
|
Smith
|
Type
|
German
|
(b) Minimal number of records,
with null values:
EMPLOYEE
|
SKILL
|
LANGUAGE
|
Smith
|
Cook
|
French
|
Smith
|
Type
|
Greek
|
Smith
|
German
|
(c) Unrestricted:
EMPLOYEE
|
SKILL
|
LANGUAGE
|
Smith
|
Cook
|
French
|
Smith
|
Type
|
|
Smith
|
German
|
|
Smith
|
Type
|
Greek
|
(3) A "cross-product"
form, where for each employee, there must be a record for every possible
pairing of one of his skills with one of his languages:
EMPLOYEE
|
SKILL
|
LANGUAGE
|
Smith
|
Cook
|
French
|
Smith
|
Cook
|
Greek
|
Smith
|
Cook
|
German
|
Smith
|
Type
|
French
|
Smith
|
Type
|
Greek
|
Smith
|
Type
|
German
|
Other problems caused by
violating fourth normal form are similar in spirit to those mentioned earlier
for violations of second or third normal form.
They take different variations depending on the chosen maintenance
policy:
If there are repetitions, then
updates have to be done in multiple records, and they could become
inconsistent.
Insertion of a new skill may
involve looking for a record with a blank skill, or inserting a new record with
a possibly blank language, or inserting multiple records pairing the new skill
with some or all of the languages.
Deletion of a skill may involve
blanking out the skill field in one or more records (perhaps with a check that
this doesn't leave two records with the same language and a blank skill), or
deleting one or more records, coupled with a check that the last mention of
some language hasn't also been deleted.
Fourth normal form minimizes such
update problems.
Independence
We mentioned independent
multi-valued facts earlier, and we now illustrate what we mean in terms of the
example. The two many-to-many
relationships, employee:skill and employee:language, are
"independent" in that there is no direct connection between skills
and languages. There is only an indirect
connection because they belong to some common employee. That is, it does not matter which skill is
paired with which language in a record; the pairing does not convey any
information. That's precisely why all
the maintenance policies mentioned earlier can be allowed.
In contrast, suppose that an
employee could only exercise certain skills in certain languages. Perhaps Smith can cook French cuisine only,
but can type in French, German, and Greek.
Then the pairings of skills and languages becomes meaningful, and there
is no longer an ambiguity of maintenance policies. In the present case, only the following form
is correct:
EMPLOYEE
|
SKILL
|
LANGUAGE
|
Smith
|
Cook
|
French
|
Smith
|
Type
|
French
|
Smith
|
Type
|
Greek
|
Smith
|
Type
|
German
|
Thus the employee:skill and
employee:language relationships are no longer independent. These records do not violate fourth normal
form. When there is an interdependence
among the relationships, then it is acceptable to represent them in a single record.
Multivalued Dependencies
For readers interested in
pursuing the technical background of fourth normal form a bit further, we
mention that fourth normal form is defined in terms of multivalued
dependencies, which correspond to our independent multi-valued facts. Multivalued dependencies, in turn, are
defined essentially as relationships which accept the "cross-product"
maintenance policy mentioned above. That
is, for our example, every one of an employee's skills must appear paired with
every one of his languages. It may or
may not be obvious to the reader that this is equivalent to our notion of
independence: since every possible pairing must be present, there is no
"information" in the pairings.
Such pairings convey information only if some of them can be absent,
that is, only if it is possible that some employee cannot perform some skill in
some language. If all pairings are
always present, then the relationships are really independent.
We should also point out that
multivalued dependencies and fourth normal form apply as well to relationships
involving more than two fields. For
example, suppose we extend the earlier example to include projects, in the
following sense:
An employee uses certain skills
on certain projects.
An employee uses certain
languages on certain projects.
If there is no direct connection
between the skills and languages that an employee uses on a project, then we
could treat this as two independent many-to-many relationships of the form EP:S
and EP:L, where "EP" represents a combination of an employee with a
project. A record including employee,
project, skill, and language would violate fourth normal form. Two records, containing fields E,P,S and
E,P,L, respectively, would satisfy fourth normal form.
Fifth Normal Form
Ø The
original table must be reconstructed from the tables into which it has been
broken down.
Ø The
benefit of applying this rule ensures you have not created any extraneous
columns in your tables, and that all of the table structures you have created
are only as large as they need to be.
Fifth normal form deals with
cases where information can be reconstructed from smaller pieces of information
that can be maintained with less redundancy.
Second, third, and fourth normal forms also serve this purpose, but
fifth normal form generalizes to cases not covered by the others.
We will not attempt a
comprehensive exposition of fifth normal form, but illustrate the central
concept with a commonly used example, namely one involving agents, companies,
and products. If agents represent
companies, companies make products, and agents sell products, then we might
want to keep a record of which agent sells which product for which
company. This information could be kept
in one record type with three fields:
AGENT
|
COMPANY
|
PRODUCT
|
Smith
|
Ford
|
Car
|
Smith
|
GM
|
Truck
|
This form is necessary in the
general case. For example, although
agent Smith sells cars made by Ford and trucks made by GM, he does not sell
Ford trucks or GM cars. Thus we need the
combination of three fields to know which combinations are valid and which are
not.
But suppose that a certain rule
was in effect: if an agent sells a certain product, and he represents a company
making that product, then he sells that product for that company.
AGENT
|
COMPANY
|
PRODUCT
|
Smith
|
Ford
|
Car
|
Smith
|
Ford
|
Truck
|
Smith
|
GM
|
Car
|
Smith
|
GM
|
Truck
|
Jones
|
Ford
|
Car
|
In this case, it turns out that
we can reconstruct all the true facts from a normalized form consisting of
three separate record types, each containing two fields:
AGENT
|
COMPANY
|
COMPANY
|
PRODUCT
|
AGENT
|
PRODUCT
|
||
Smith
|
Ford
|
Ford
|
Car
|
Smith
|
Car
|
||
Smith
|
GM
|
Ford
|
Truck
|
Smith
|
Truck
|
||
Jones
|
Ford
|
GM
|
Car
|
Jones
|
Car
|
||
GM
|
Truck
|
These three record types are in
fifth normal form, whereas the corresponding three-field record shown
previously is not.
Roughly speaking, we may say that
a record type is in fifth normal form when its information content cannot be
reconstructed from several smaller record types, i.e., from record types each
having fewer fields than the original record.
The case where all the smaller records have the same key is
excluded. If a record type can only be
decomposed into smaller records which all have the same key, then the record
type is considered to be in fifth normal form without decomposition. A record type in fifth normal form is also in
fourth, third, second, and first normal forms.
Fifth normal form does not differ
from fourth normal form unless there exists a symmetric constraint such as the
rule about agents, companies, and products.
In the absence of such a constraint, a record type in fourth normal form
is always in fifth normal form.
One advantage of fifth normal
form is that certain redundancies can be eliminated. In the normalized form, the fact that Smith
sells cars is recorded only once; in the unnormalized form it may be repeated
many times.
It should be observed that
although the normalized form involves more record types, there may be fewer
total record occurrences. This is not
apparent when there are only a few facts to record, as in the example shown
above. The advantage is realized as more
facts are recorded, since the size of the normalized files increases in an
additive fashion, while the size of the unnormalized file increases in a
multiplicative fashion. For example, if
we add a new agent who sells x products for y companies, where each of these
companies makes each of these products, we have to add x+y new records to the
normalized form, but xy new records to the unnormalized form.
It should be noted that all three
record types are required in the normalized form in order to reconstruct the
same information. From the first two record types shown above we learn that
Jones represents Ford and that Ford makes trucks. But we can't determine whether Jones sells
Ford trucks until we look at the third record type to determine whether Jones
sells trucks at all.
The following example illustrates
a case in which the rule about agents, companies, and products is satisfied,
and which clearly requires all three record types in the normalized form. Any two of the record types taken alone will
imply something untrue.
AGENT
|
COMPANY
|
PRODUCT
|
Smith
|
Ford
|
Car
|
Smith
|
Ford
|
Truck
|
Smith
|
GM
|
Car
|
Smith
|
GM
|
Truck
|
Jones
|
Ford
|
Car
|
Jones
|
Ford
|
Truck
|
Brown
|
Ford
|
Car
|
Brown
|
GM
|
Car
|
Brown
|
Toyota
|
Car
|
Brown
|
Toyota
|
Bus
|
AGENT
|
COMPANY
|
COMPANY
|
PRODUCT
|
AGENT
|
PRODUCT
|
||
Smith
|
Ford
|
Ford
|
Car
|
Smith
|
Car
|
||
Smith
|
GM
|
Ford
|
Truck
|
Smith
|
Truck
|
||
Jones
|
Ford
|
GM
|
Car
|
Jones
|
Car
|
||
Brown
|
Ford
|
GM
|
Truck
|
Jones
|
Truck
|
||
Brown
|
GM
|
Toyota
|
Car
|
Brown
|
Car
|
||
Brown
|
Toyota
|
Toyota
|
Bus
|
Brown
|
Bus
|
Observe that:
· Jones
sells cars and GM makes cars, but Jones does not represent GM.
· Brown
represents Ford and Ford makes trucks, but Brown does not sell trucks.
· Brown
represents Ford and Brown sells buses, but Ford does not make buses.
Fourth and fifth normal forms
both deal with combinations of multivalued facts. One difference is that the facts dealt with
under fifth normal form are not independent, in the sense discussed earlier.
Another difference is that, although fourth normal form can deal with more than
two multivalued facts, it only recognizes them in pairwise groups. We can best explain this in terms of the
normalization process implied by fourth normal form. If a record violates fourth normal form, the
associated normalization process decomposes it into two records, each
containing fewer fields than the original record. Any of these violating fourth
normal form is again decomposed into two records, and so on until the resulting
records are all in fourth normal form.
At each stage, the set of records after decomposition contains exactly
the same information as the set of records before decomposition.
In the present example, no
pairwise decomposition is possible. There is no combination of two smaller
records which contains the same total information as the original record. All three of the smaller records are
needed. Hence an information-preserving
pairwise decomposition is not possible, and the original record is not in
violation of fourth normal form. Fifth
normal form is needed in order to deal with the redundancies in this case.
UNAVOIDABLE REDUNDANCIES
Normalization certainly doesn't
remove all redundancies. Certain redundancies seem to be unavoidable,
particularly when several multivalued facts are dependent rather than
independent. In the example, it seems
unavoidable that we record the fact that "Smith can type" several
times. Also, when the rule about agents,
companies, and products is not in effect, it seems unavoidable that we record
the fact that "Smith sells cars" several times.
INTER-RECORD REDUNDANCY
The normal forms discussed here
deal only with redundancies occurring within a single record type. Fifth normal form is considered to be the
"ultimate" normal form with respect to such redundancies.
Other redundancies can occur
across multiple record types. For the
example concerning employees, departments, and locations, the following records
are in third normal form in spite of the obvious redundancy:
EMPLOYEE
|
DEPARTMENT
|
DEPARTMENT
|
LOCATION
|
|
EMPLOYEE
|
LOCATION
|
In fact, two copies of the same
record type would constitute the ultimate in this kind of undetected
redundancy.
Inter-record redundancy has been
recognized for some time [1], and has recently been addressed in terms of
normal forms and normalization [8].
DE-NORMALIZATION
When
de-normalizing, the size of row has to be taken into account. If the size of the row is very large, very
few rows fit into a single block on disk.
Thus retrieval will be slow as large number of blocks are to be
retrieved to get the required row.
Breaking the Rules:
When to Denormalize, Sometimes
it's necessary to break the rules of normalization and create a database that
is deliberately less normal than it otherwise could be. You'll usually do this
for performance reasons or because the users of the database demand it. While
this won't get you any points with database design purists, ultimately you have
to deliver a solution that satisfies your users. If you do break the rules, however, and
decide to denormalize you database, it's important that you follow these
guidelines:
Break the rules deliberately;
have a good reason for denormalizing.
Be fully aware of the tradeoffs
this decision entails.
Thoroughly document this decision.
Create the necessary application
adjustments to avoid anomalies.
This last point is worth
elaborating on. In most cases, when you
denormalize, you will be required to create additional application code to
avoid insertion, update, and deletion anomalies that a more normalized design
would avoid. For example, if you decide
to store a calculation in a table, you'll need to create extra event procedure
code and attach it to the appropriate event properties of forms that are used
to update the data on which the calculation is based.
If you're considering
denormalizing for performance reasons, don't always assume that the
denormalized approach is the best. Instead, first fully normalize the database
(to Third Normal Form or higher) and then denormalize only if it becomes
necessary for reasons of performance.
If you're considering
denormalizing because your users think they need it, investigate why. Often they will be concerned about
simplifying data entry, which you can usually accomplish by basing forms on
queries while keeping your base tables fully normalized.
REFERENCE
The first three rules of
normalization were outlined by Dr. E.F. Codd in his 1972 paper, "Further
Normalization of the Data Base Relational Model". Other rules have since
been theorized by later Set Theory and Relational Algebra mathematicians.