File Reorganization
by Alex Kodat
Background
Model 204 is a high performance database package with an integrated
teleprocessing monitor and programming language (User Language). Because of
these latter features it is easy to forget that the primary function of
Model 204 is to provide your end users with rapid access to the data stored
in your database files. While Model 204 is relatively forgiving of a bad
database file design, it is the rare Model 204 user that cannot reduce
Model 204 resource consumption and/or improve end user response time
with some changes to database file design.
Often, the only way to achieve the changes required to improve Model 204
performance is to reorganize your Model 204 database files. This generally
requires that you unload the contents of your database files to intermediate
sequential datasets and then reload the sequential datasets back into the
same or new database files. Most Model 204 customers must weigh the
possible benefits of a file reorganization against the cost in both machine
resources and programmer time.
Fast Reorg
has dramatically changed this equation.
By reducing resource consumption of a file reorganization by 50 to 90
percent and by greatly simplifying the progamming required to perform a file
reorganization, users who could not justify performing
reorgs before should consider them now, and users who are performing reorgs
now can perform them more frequently and/or at lower cost.
Fast Reorg is a product developed by
Sirius Software that is intended to simplify and speed
up the Model 204 database file reorganization process. It is made up
of two component products - Fast Unload and
Fast Reload that simplify and speed up the unload
and reload parts of the reorg process.
This document is intended to help you determine whether you
need to reconsider or revise your current reorg strategies given the
added speed and ease of use of the Fast Reorg package.
Reasons for Performing Reorganizations
Model 204 database files can be broadly classified into three types
based on their usage patterns.
- Read only files
- These are files that do not change over time and are
used for retrieval purposes only. These types of files can often benefit from
a reorganization but do not require regular reorganization.
- Add only files
- These are files that have data added to them but do not have data deleted.
These files often need reorganization especially if they are sort or hash key
files.
- General update files
- These are files that have data added and deleted. These are the most
complex files to deal with and almost always need regular reorganization to
maintain good performance.
There are several reasons why one might wish to reorganize Model 204
database files :
- To change the basic file organization (entry order, sort key, hash key,
unordered, reuse record number). You might do this as speculative thing
since it is often difficult to predict the effect of changing the basic
file organization. One of the great strengths of Fast Reorg
is that it greatly simplifies changing the file organization. Many shops
could benefit greatly by changing file organization but do not do so
because the level of effort required to effect the change is too high.
- To increase table B density. In read only or add only files, BRECPPG
(table B records per page) or BRESERVE (table B reserved space)
might be set to a lower value than necessary. In general update files,
a random pattern of additions and deletions will often result in relatively
empty pages interspersed with relatively full pages. In addition, users
generally set BRECPPG lower or BRESERVE higher than necessary to avoid
extension records in general update files.
All these factors can result in wasted table B
space that results in buffer pool and disk space utilizations that are
unnecessarily high.
- To eliminate extension records. This is a bit of an issue in
add only files and a very important one for general update files.
A random pattern of additions and deletions will often result in extension
records while at the same time many table B pages are left almost empty.
To avoid this, users often set BRECPPG much lower or BRESERVE much higher
than they need, resulting in a lot of wasted space in table B. By doing
frequent reorgs, you can monitor the creation of extension records and set
BRECPPG and BRESERVE as required. The ability to perform a fast reorg
allows you to set these parameters "tighter" than you ordinarily would,
resulting in better online performance because of the higher table B density.
- To ensure extension records are physically close to their master record.
This is an issue in general update files. In certain files, with widely
varying record lengths and/or extremely long records extension records are
unavoidable.
In general update files extension records will often be far away from the
master records, resulting in disk head movement delays and low DASD cache hit
ratios. A reorg will place extension records on pages physically contiguous
with the pages containing the master records.
- To cluster data by the value in certain "key" fields. Often, when we
refer to a record with a particular value for a field, we'll also refer
to other records with that same value. By clustering data with the same values
for a particular field, you can reduce I/O by increasing the likelihood
that all or most records you will be examining for a request will be on
a single or a few table B pages. The way this is ordinarily accomplished
is by unloading data from the database file, sorting it and then reloading
it. The sort step precludes the use of the generic PAI format for the
reorg. Because of this, many users do not cluster their data even though
the benefits might be tremendous. Fast Reorg provides a generic
way of sorting unloaded data allowing you to easily take advantage of this
technique. For example, to unload data sorted by field FOO, you would
simply specify
UAI SORT FOO
in your Fast Unload program.
- To eliminate the need to use "reuse record number" files.
In general, these files are only an issue with general update files.
These type of files
have problems associated with them that make them best avoided if possible.
Among these problems are extra record allocation overhead, more potential
for bugs, more severe complications when doing FIND WITHOUT LOCKS or
FOR RECORD NUMBER statements and difficulties in dealing with invisible keys.
If you have a file that has frequent record additions and deletions, your
choices are reuse record number files or frequent reorgs.
Fast Reorg might reduce the cost of doing frequent reorgs to the
point where you can afford the luxury of avoiding reuse record number files.
- To recover record numbers in non-reuse record number files. In
general update files of this type where entire records are deleted
your record number usage will tend to resemble a swiss cheese after a
while. Outside of the negative effects on table B utilization, this will
also increase table C and table D utilization if the number of wasted
record numbers in the file causes it to require more segments than the
number of "non-deleted" records would justify. For example, if a
non-reuse record number file had 300,000 records added to it and 120,000
records deleted from it over time, it would require index data for 7
segments. By reorganizing this file you could reduce this requirement
to 4 segments. This could reduce the space requirements for table C, the
ordered index, bitmaps and lists in table D and CPU requirements for
performing finds on this file.
- To eliminate spill page usage in hash and sort key files. In add
only and general update hash and sort key files, spill page usage will
naturally go up over time. Reorganizing the file will eliminate the
use of spill pages. Note that in a hash key file you will have to
increase the size of table B to reduce spill page usage.
- To correct the size of table C. This is mainly an issue in add
only and general update files. In both these files, if you have underestimated
your table C requirements, you will start seeing table C retries. If you
have overestimated your table C requirements, your buffer pool could be
unnecessarily filled with table C pages containing only entries for a single
fieldname/value pair. This can result in unnecessary disk I/O and hence
slow response time.
- To produce better locality on table D lists and bitmaps.
This is an issue for add only and general update files that are larger than 1
segment (49,152 records). Any key fieldname/value pair that occurrs
more than once in a segment will have a list or bitmap in table D.
Since bitmaps are placed on the first available page in table D and lists are
added to a current list page that is independent of fieldname/value, the
bitmaps and lists associated with a particular fieldname/value pair will tend
to become scattered throughout table D over time. For example, if a file
grows from 1 to 58 segments
over time, a FIND on a particular fieldname/value pair might require examining
58 different, widely scattered table D pages for bitmaps and lists. This could
produce excessive I/O, high buffer pool requirements and slow response times
for simple queries. Reorganizing the file will result in the list and bitmap
pages for a particular fieldname/value to be close together. In fact, list
data for a single fieldname/value pair will generally be packed on as few
table D pages as possible. Thus, it is possible that a reorg of the file in
this example will reduce the number of table D pages requiring examination
for a simple query from 58 to 1. Even in the case of bitmap pages, you will
reap the benefits of reducing disk head movement and improving cache hit
ratios on cached DASD by
performing a file reorganization. Note that if you reload your data in multiple
passes in a FLOD or FILELOAD, the table D list and bitmap data will be
clustered by pass number and then fieldname/value.
- To reorganize the ordered index. This is an issue in
add only and general update files. As a database file grows, the ordered
index will tend to be scattered throughout table D. By performing a
reorganization, related ordered index pages will tend to be grouped together,
reducing disk head movement
and increasing cache hit ratios on cached DASD. In addition, a reorganization
will produce relatively full leaf and node pages (LRESERVE and NRESERVE
percent full) in the ordered index. Over time, the leaf and node pages will
tend to approach 50 percent
full, increasing disk space and buffer pool requirements.
Note that the Model 204 REORGANIZE command allows you to easily reorganize
your ordered index
on a regular basis without performing a full file reorg. However, this command
will not provide the other benefits of a full file reorg.
- To split a file into groups or join group files into a single file. In a
heavy update
environment, there might be resource locking problems that will be reduced
by splitting a file into multiple group files. This will often be the case in
add only
files where you can limit most update activity to a single file, while
retrieval activity will be spread among several files.
- To experiment with file settings. There are a myriad of Model 204 file
parameters that
will affect performance. Among these are ASIZE, FILEORG, BRECPPG, BRESERVE,
CSIZE, CRETRIES, DRESERVE, IMMED, LRESERVE, NRESERVE and SPLITPCT.
You will undoubtedly find experts who disagree on the appropriate
settings for these parameters in your environment and it is often difficult to
predict the exact results of a change on system performance. Because, of this
and the cost of doing a reorg, many shops use the "if it ain't broke don't fix
it" philosophy, often forgoing tremendous performance improvements for the
safety of the status quo.
Because Fast Reorg reduces the time required to perform a reorg, you can
more easily experiment with parameter settings in the knowledge that you can
quickly back out of a failed experiment.
The above list is not meant to be all inclusive. There are many other
reasons for reorganizing your Model 204 database files.
However, the list does include most of the common reasons.
In summary, the benefits of performing file reorganizations tend to
be reduced disk I/O, reduced disk buffer pool requirements and better
use of DASD cache. These all come as the result of denser packing of
data in the various Model 204 tables and as the result of clustering
of related data. In some cases, these will even produce reduced CPU
utilization. Ultimately, all of these produce better response to
your end users.
The Ordered Index
One of the areas of database design that is a source of mystery to
many Model 204 DBAs is the use of the ordered index. Specifically,
- When should a field be defined as "KEY" and when should it be defined
as "ORDERED".
- How can you tell if your ordered index needs to be reorganized ?
- What are the correct settings of the ordered field parameters ?
The following are the advantages of "KEY" fields over "ORDERED" fields.
- A find on a "KEY" field is somewhat more CPU efficient than an "ORDERED"
field. The CPU savings on a FIND are typically from 5% to 70%. These CPU
savings should be considered in the light that FIND processing is typically
a relatively small percentage of the CPU used by most applications. For
example,
if 20% of the CPU used by an application is for FIND processing (most
applications are lower) and the CPU difference between a "KEY" find and an
"ORDERED" find is 30%, the overall CPU difference between a "KEY" find and
an "ORDERED" find is about 6%.
- A find on on a "KEY" field will generally require only a single index page
to be read. A find on an "ORDERED" field will always require reading a page
at each layer of the ordered index. Thus, if the the ordered index has a
depth of 3, a find on an "ORDERED" field will require 3 logical page reads
to find the key value. Note that if the buffer pool is large enough and the
ordered index activity is heavy,
the higher level node pages will tend to remain in the buffer pool. In this
case, the real cost of using "ORDERED" fields over "KEY" fields is the extra
buffer pool pages required for the higher level ordered index nodes.
- An addition or deletion of a "KEY" field value is somewhat more CPU
efficient than the same processing for an "ORDERED" field. Except in the most
update intensive
environments this difference will be a negligible percentage of overall CPU.
The following are the advantages of "ORDERED" fields over "KEY" fields.
- One doesn't have to worry about the ordered index filling up. As long as
table D usage is monitored reasonably frequently and the number of table D
pages is increased whenever there are insufficient extra pages, the ordered
index will never
fill up. Once the size of table C is set it can not be changed. The only
way to get out of a table C full condition is to do a reorg.
- The ordered index always has relatively good data density. Unless you set
LRESERVE, NRESERVE or SPLITPCT incorrectly, the ordered index will rarely waste
more than an average of 50% of the space on ordered index pages. If table C
is not sized correctly,
on the other hand, it is quite possible to have an most of each table C page
empty. In fact, if table C density gets too high, the table C index starts
performing badly because of rehashes and page retries.
- The data in the ordered index is generally more compact than that in
table C.
Table C uses 7 bytes for a fieldname/value pair in the index and 7 bytes for
each segment that contains that fieldname/value. The ordered index uses
the length of the value string plus 3 bytes for a fieldname/value pair and
2 to 7 bytes (generally
2 to 5) for each segment that contains that fieldname/value. For example, a
fieldname/value pair with a value string of length 14 and 8 segments
containing data might require 7+8*7 or
63 bytes in table C but 14+3+8*4 or 49 bytes in the ordered index.
- Ordered index fields provide efficient FOR EACH VALUE IN ORDER processing.
- The ordered index provides better data locality when there is a pattern
to retrievals.
For example, if a date field (in YY/MM/DD format) is indexed in table C, the
entries for
a given day would probably be on a different page from the previous day or
the next day. In the ordered
index, the entries for consecutive days would almost always be on the same
page. Since date
retrievals tend to be clustered in time, relatively few pages would be
required in the
buffer pool to satisfy most find requests.
- For key fields with relatively few values, all the values might
fit on a single or few pages
of the ordered index while each value might be on a different page
in table C. For example,
if STATE is an "ORDERED" field, all values for STATE will probably
fit on a single table D
page so any find based on STATE would always require the same table D
page in the buffer pool.
On the other hand if STATE is a "KEY" field, each possible value might
be on a different table
C page requiring up to 50 pages in the buffer pool to satisfy all
find requests.
- The ordered index can be easily reorganized with the REORGANIZE command.
- Invisible ordered index data can be preserved by Fast Reorg over
a file reorganization.
- You can do pattern matching with ordered index fields.
- It is easier and quicker to determine ordered index usage than
table C usage.
Comparing the advantages and disadvantages of "ORDERED" and "KEY"
fields it is
difficult to escape the conclusion that it is generally better to
define a field
as "ORDERED" rather than "KEY" unless there are compelling reasons
otherwise. Similarly,
it is generally not a good idea to define a field as both "KEY" and "ORDERED".
The two major cases where one would define a field as "KEY" rather than
"ORDERED" are :
- When CPU resources are so limited that even the slightest savings in CPU
overwhelm any other considerations.
- When a field has a huge number of values and the finds against them have
no pattern so
that one never expects a requested index page to be found in the buffer pool.
In this case, by using a "KEY" field rather than "ORDERED" one would avoid the
extra overhead of reading or buffering higher level node pages and would
simply live with the overhead of doing a physical read for every find. Note
that in the case where this type of field only occurs once per record, it
might be more efficient
to define the file as a hashkey file with the described field as the hashkey.
The major problem with the ordered index that can occur with the ordered index
that can would require a reorganization is low ordered index density. That is,
you might have more ordered index pages than are required by the quantity of
ordered index data. This can result in excess I/O or buffer pool usage for
the ordered index. Fortunately, there are two simple formulas that measure
ordered index density. These formulas are :
OINODEP = 100 * (OINBYTES / OINODES) / PAGESZ
OILEAFP = 100 * (OINBYTES / OILEAVES) / PAGESZ
where OINBYTES, OILEAVES and OINODES are viewable file parameters, PAGESZ is
a viewable system parameter,
OINODEP is the overall ordered index density and OILEAFP is the leaf page
density of the ordered index.
OINODEP and OILEAFP will usually be between 0 and 100 and will generally
be approximately equal. After an ordered index reorganization, OILEAFP
will be approximately 100 - LRESERVE (assuming LRESERVE is the same for all
fields).
Unfortunately, the values of OILEAFP and OINODEP are affected by Model 204
ordered index prefix compression where any common part of all values
on an ordered index page is stored only once rather than for each value.
But Model 204 counts the uncompressed size for OINBYTES.
For example, if all the values on a page start with
'1999062211' Model 204 compresses out the common prefix saving only
one copy of it on the page but counting the full 10 bytes for each entry.
Let's say for the sake of argument that the rest of each of these keys is
an additional 6 bytes plus 1 byte for a length and 3 bytes for a single
record number (let's say each record has a unique time stamp). This would
mean each entry on a page would be credited with 20 bytes for OINBYTES
purposes while only using 10 bytes. In such a situation, a page that's
only 70% full would register as being 140% full by the OILEAFP
measure.
The bottom line is that OINODEP and OILEAFP can exceed 100 and are not
a completely accurate representation of ordered index density in cases
where ordered index prefix compression is being used to great advantage.
In add only files, OILEAFP will gradually decrease after a reorg,
never getting below 50 (unless SPLITPCT is set to something other than 50).
After a while OILEAFP might start increasing again and will fluctuate
between 60 and 85 over time.
In a general update file, OILEAFP will gradually decrease after a reorg
possibly getting to below 50 and perhaps fluctuating over time.
OILEAFP gives one a simple value that indicates
the density of data in the ordered index.
One can increase ordered index density to approximately 100 - LRESERVE
by doing an ordered index reorganization. For example, if OILEAFP is
52.8 in a particular file one can set LRESERVE for all "ORDERED" fields to 90
and do an ordered index reorganization, getting a 70 percent increase in
ordered index density.
What are the correct values for the ordered index field parameters - LRESERVE,
NRESERVE, SPLITPCT and IMMED? Because of the relatively forgiving nature
of the ordered index, the defaults are never going to produce significantly
worse performance than anything else. In a read only file one would get some
benefit to setting NRESERVE and LRESERVE to 0 before doing a reorg. If one is
willing to do frequent reorgs to keep OILEAFP at a high value, one might
benefit
by setting NRESERVE and LRESERVE to a smaller value (such as 5), although the
benefits of this will probably be too small to justify the need for extra
reorgs. Finally, the optimal setting of the IMMED parameter is highly data
dependent and is not
likely to produce significant performance benefits except in a few unusual
applications.
How to Reorganize your Files
The classic options for reorganizing Model 204 database files
are frought with tradeoffs. The two basic ways one can perform
reorgs are :
- By doing a 'PAI' dump of table B and then using a generic FLOD
program to reload the data.
- By doing a formatted unload of your data using PRINT or WRITE IMAGE
statements and then writing special purpose FLOD programs to reload this
formatted data.
The advantages of the PAI format are :
- One can use the same unload and reload programs for all files
as longs as the files are not hash or sort key files.
- There is no danger of losing data with this format (except null data).
The disadvantage of the PAI format are :
- It is slow on both the unload and the reload because of the
requirement in both cases to look up field codes in table A.
- It is impossible to meaningfully sort the intermediate sequential
data set. This precludes the use of this format for reorganizing
hash key files and for clustering data in non-sort key files.
- The intermediate work file tends to be unnecessarily large because
of the space required to hold fieldnames.
- Invisible keys are lost over a reorg.
The advantages of a formatted reorg are :
- It can be very fast if most records have approximately the same
number of occurrences of each field and each occurrence is of approximately
the same length.
- It can be used to reorganize hash or sort key files or to cluster data
because data in a fixed format can be easily sorted by sort packages.
The disadvantages of a formatted reorg are :
- It requires a different unload and load program for each file.
- The unload can lose data if a field has more or longer occurrences
for a field than were anticipated by the person who wrote the unload/reload
code. While there are ways around this, these require expertise in FILELOAD
language that is difficult to find and most shops wouldn't want
to depend on.
- It can be slow if the number of occurrences and lengths of certain
fields are highly variable.
- Invisible keys are lost over a reorg.
As can be clearly seen from the above lists of advantages and
disadvantages, neither format provides a highly satisfactory
mechanism for performing reorgs.
Fast Reorg combines the advantages of the two
techniques and adds a few that neither has. A typical reorganization
using Fast Reorg consists of a Fast Unload
program that contains
OPEN filename
UAI
and a Fast Reload program that contains
FILELOAD -1,-1,0,1000000,10000,10000,64
LAI
END
If you wish to sort the unloaded data for clustering by field FOO
or because field FOO is the sort key simply specify
OPEN filename
UAI SORT FOO
in the unload program. If you are reorganizing a hash key file and
you are changing BSIZE to 20000 in the reorganized file simply specify
OPEN filename
UAI HASH FOO BSIZE 20000
on the unload. If you want invisible ordered index data to be preserved
over a reorg simply specify
OPEN filename
UAI OINDEX
on the unload. The OINDEX parameter can be combined with the HASH and
SORT parameters described above.
There are other options on UAI and LAI that effect performance but
UAI and LAI even without any tuning will generally be considerably faster
than any of the old style reorgs.
To summarize, then, the advantages of the UAI/LAI format are :
- Almost no programming is required.
- It is fast.
- The unloaded data is highly compact.
- The unloaded data can be easily sorted in either hash or sort key
order. Multiple sort keys are also supported.
- There is no danger of losing data, even null values.
- It allows you to preserve invisible ordered index keys over a
reorg.
The disadvantage of UAI/LAI format :
- If the file is extremely heavily indexed, sorting and applying the index
data could still take a long time.
- It is unable to preserve invisible non-ordered index keyed fields over a
reorg.
This latter point should be viewed mainly as a migration issue. If you
redefine your invisible key fields as invisible ordered index fields and
regenerate your invisible keys, you will never again have to worry about
losing them over a reorg.
Conclusion
In order to provide good system response and throughput to your end users,
it is important to develop and maintain a good database file design. Most
Model 204
/sers recognize this and have designs in place and strategies for maintaining
them. Fast Reorg, however, provides such a significant improvement
in the speed and simplicity of reorganizing Model 204 database files that you
should at least reconsider and perhaps rework your database designs and file
maintenance strategies. Fast Reorg could provide that critical
tool that lets you control your Model 204 database files rather than having
them control you.