Sybase® SQL Server Performance and Tuning Guide


Chapter 2

Database Design and Denormalizing for Performance

How Design Is Related to Performance

Performance and tuning are built on top of good database design. They aren't panaceas. If you start with a bad database design, the information in the other chapters of this book may help you speed up your queries a little, but good overall performance starts with good design.

This chapter doesn't attempt to discuss all of the material presented in database design courses. It cannot teach you nearly as much as the many excellent books available on relational database design. This chapter presents some of the major design concepts and a few additional tips to help you move from a logical database design to a physical design on SQL Server.

Database Design

Database design is the process of moving from real-world business models and requirements to a database model that meets these requirements. For relational databases such as SQL Server, the standard design creates tables in Third Normal Form.

When you translate an Entity-Relationship model, in Third Normal Form (3NF), to a relational model:

Physical Database Design for SQL Server

Based on access requirements and constraints, implement your physical database design as follows:

Normalization

When a table is normalized, the non-key columns depend on the key, the whole key, and nothing but the key.

From a relational model point of view, it is standard to have tables that are in Third Normal Form. Normalized physical design provides the greatest ease of maintenance, and databases in this form are clearly understood by teams of developers.

However, a fully normalized design may not always yield the best performance. It is recommended that you design for Third Normal Form, and then, as performance issues arise, denormalize to solve them.

Levels of Normalization

Each level of normalization relies on the previous level, as shown in Figure 2-2. For example, to conform to 2NF, entities must be in 1NF.

Figure 2-2: Levels of normalization
[IMAGE]

When determining if a database is in a normal form, start with the assumption that the relation (or table) is not normalized. Then apply the rigor of each normal form level to it.

Benefits of Normalization

Normalization produces smaller tables with smaller rows:

The benefits of normalization include:

While fully normalized databases require more joins, joins are generally very fast if indexes are available on the join columns. SQL Server is optimized to keep higher levels of the index in cache, so each join performs only one or two physical I/Os for each matching row. The cost of finding rows already in the data cache is extremely low.

First Normal Form

The rules for First Normal Form are:

The table in Figure 2-3 violates first normal form, since the dept_no column contains a repeating group:

Figure 2-3: A table that violates first normal form
[IMAGE]

Normalization creates two tables and moves dept_no to the second table:

Figure 2-4: Correcting first normal form violations by creating two tables
[IMAGE]

Second Normal Form

For a table to be in Second Normal Form, every non-key field must depend on the entire primary key, not on part of a composite primary key. If a database has only single-field primary keys, it is automatically in Second Normal Form.

In the table in Figure 2-5, the primary key is a composite key on emp_num and dept_no. But the value of dept_name depends only on dept_no, not on the entire primary key.

Figure 2-5: A table that violates second normal form
[IMAGE]

To normalize this table, move dept_name to a second table as shown in Figure 2-6.

Figure 2-6: Correcting second normal form violations by creating two tables
[IMAGE]

Third Normal Form

For a table to be in Third Normal Form, a non-key field cannot depend on another non-key field. The table in Figure 2-7 violates Third Normal Form because the mgr_lname field depends on the mgr_emp_num field, which is not a key field.

Figure 2-7: A table that violates Third Normal Form
[IMAGE]

The solution is to split the Dept table into two tables, as shown in Figure 2-8. In this case, the Employees table, shown in Figure 2-4 already stores this information, so removing the mgr_lname field from Dept brings the table into Third Normal Form.

Figure 2-8: Correcting Third Normal Form violations by creating two tables
[IMAGE]

Denormalizing for Performance

Once you have created your database in normalized form, you can perform benchmarks and decide to back away from normalization to improve performance for specific queries or applications.

The process of denormalizing:

Good reasons for denormalizing are:

Risks of Denormalization

Denormalization should be based on thorough knowledge of the application, and it should be performed only if performance issues indicate that it is needed. For example, the ytd_sales column in the titles table of the pubs2 database is a denormalized column that is maintained by a trigger on the salesdetail table. The same values can be obtained using this query:

select title_id, sum(qty)
    from salesdetail
    group by title_id

To obtain the summary values and the document title requires a join with the titles table:

select title, sum(qty)
    from titles t, salesdetail sd
    where t.title_id = sd.title_id
    group by title

It makes sense to denormalize this table if the query is run frequently. But there is a price to pay: you must create an insert/update/delete trigger on the salesdetail table to maintain the aggregate values in the titles table. Executing the trigger and performing the changes to titles adds processing cost to each data modification of the qty column value.

This situation is a good example of the tension between decision support applications, which frequently need summaries of large amounts of data, and transaction processing applications, which perform discrete data modifications. Denormalization usually favors one form of processing at a cost to others.

Figure 2-9: Balancing denormalization issues
[IMAGE]

Whatever form of denormalization you choose, it has the potential for data integrity problems that must be carefully documented and addressed in application design.

Disadvantages of Denormalization

Denormalization has these disadvantages:

Performance Advantages of Denormalization

Denormalization can improve performance by:

Denormalization Input

When deciding whether to denormalize, you need to analyze the data access requirements of the applications in your environment, and their actual performance characteristics. Often, good indexing and other solutions solve many performance problems.

Some of the issues to examine when considering denormalization include:

Denormalization Techniques

The most prevalent denormalization techniques are:

In addition, you can duplicate or split tables to improve performance. While these are not denormalization techniques, they achieve the same purposes and require the same safeguards.

Adding Redundant Columns

You can add redundant columns to eliminate frequent joins. For example, if frequent joins are performed on the titleauthor and authors tables in order to retrieve the author's last name, you can add the au_lname column to titleauthor.

Figure 2-10: Denormalizing by adding redundant columns
[IMAGE]

Adding redundant columns eliminates joins for many queries. The problems with this solution are that it:

Adding Derived Columns

Adding derived columns can help eliminate joins and reduce the time needed to produce aggregate values. The total_sales column in the titles table of the pubs2 database provides one example of a derived column used to reduce aggregate value processing time.

The example in Figure 2-11 shows both benefits. Frequent joins are needed between the titleauthor and titles tables to provide the total advance for a particular book title.

Figure 2-11: Denormalizing by adding derived columns
[IMAGE]

You can create and maintain a derived data column in the titles table, eliminating both the join and the aggregate at run time. This increases storage needs, and requires maintenance of the derived column whenever changes are made to the titles table.

Collapsing Tables

If most users need to see the full set of joined data from two tables, collapsing the two tables into one can improve performance by eliminating the join.

For example, users frequently need to see the author name, author ID, and the blurbs copy data at the same time. The solution is to collapse the two tables into one. The data from the two tables must be in a one-to-one relationship to collapse tables.

Figure 2-12: Denormalizing by collapsing tables
[IMAGE]

Collapsing the tables eliminates the join, but loses the conceptual separation of the data. If some users still need access to just the pairs of data from the two tables, this access can be restored by queries that select only the needed columns or by using views.

Duplicating Tables

If a group of users regularly needs only a subset of data, you can duplicate the critical table subset for that group.

Figure 2-13: Denormalizing by duplicating tables
[IMAGE]

The kind of split shown in Figure 2-13 minimizes contention, but requires that you manage redundancy and possible latency.

Splitting Tables

Sometimes, splitting normalized tables can improve performance. You can split tables in two ways:

Splitting tables--either horizontally or vertically--adds complexity to your applications. There usually needs to be a very good performance reason.

Horizontal Splitting

Use horizontal splitting in the following circumstances:

Generally, horizontal splitting adds a high degree of complication to applications. It usually requires different table names in queries, depending on values in the tables. This complexity alone usually far outweighs the advantages of table splitting in most database applications. As long as the index keys are short, and the indexes are used for queries on the table (rather than table scans being used), doubling or tripling the number of rows in the table may increase the number of disk reads required for a query by only one index level.

Figure 2-15 shows how the authors table might be split to separate active and inactive authors:

Figure 2-15: Horizontal partitioning of active and inactive data
[IMAGE]

Vertical Splitting

Use vertical splitting in the following circumstances:

Vertical table splitting makes even more sense when both of the above conditions are true. When a table contains very long columns that are not accessed frequently, placing them in a separate table can greatly speed the retrieval of the more frequently used columns. With shorter rows, more data rows fit on a data page, so fewer pages can be accessed for many queries.

Figure 2-16 shows how the authors table can be partitioned.

Figure 2-16: Vertically partitioning a table
[IMAGE]

Managing Denormalized Data

Whatever denormalization techniques you use, you need to develop management techniques to ensure data integrity. Choices include:

From an integrity point of view, triggers provide the best solution, although they can be costly in terms of performance.

Using Triggers to Manage Denormalized Data

In Figure 2-17, the sum_adv column in the titles table stores denormalized data. A trigger updates the sum_adv column whenever the advance column in titleauthor changes.

Figure 2-17: Using triggers to maintain normalized data
[IMAGE]

Using Application Logic to Manage Denormalized Data

If your application has to ensure data integrity, it will have to ensure that the inserts, deletes, or updates to both tables occur in a single transaction.

Figure 2-18: Maintaining denormalized data via application logic
[IMAGE]

If you use application logic, be very sure that the data integrity requirements are well documented and well known to all application developers and to those who must maintain applications.

Note: Using application logic to manage denormalized data is risky. The same logic must be used and maintained in all applications that modify the data.

Batch Reconciliation

If 100-percent consistency is not required at all times, you can run a batch job or stored procedure during off hours to reconcile duplicate or derived data.

You can run short, frequent batches or longer, less frequent batches.

Figure 2-19: Using batch reconciliation to maintain data
[IMAGE]