Key Generation While Using Partition
If the collection is partitioned, then stored procedures are executed in the transaction scope of a single partition key. Each stored procedure execution must then include a partition key value corresponding to the scope the transaction must run under. You could refer to the description above which mentioned here. Apr 12, 2015 Table Partitioning in SQL Server – The Basics. Often called the partition key. Only one column can be used as the partition column, but it is possible to use a.
surrogate-key-generation in TransformerCreate a unique counter in datastage
This entry describes various ways of creating a unique counter in DataStage jobs.
A parallel job has a surrogate key stage that creates unique IDs, however it is limited in that it does not support conditional code and it may be more efficient to add a counter to an existing transformer rather than add a new stage.
In a server job there are a set of key increment routines installed in the routine SDK samples that offer a more complex counter that remembers values between job executions.
The following section outlines a transformer only technique.
Steps
In a DataStage job the easiest way to create a counter is within the Transformer stage with a Stage Variable.svMyCounter = svMyCounter + 1
This simple counter adds 1 each time a row is processed.
The counter can be given a seed value by passing a value in as a job parameter and setting the initial value of svMyCounter to that job parameter.
In a parallel job this simple counter will create duplicate values on each node as the transformer is split into parallel instances. It can be turned into a unique counter by using special parallel macros.
- Create a stage variable for the counter, eg. SVCounter.
- At the Stage Properties form set the Initial Value of the Stage Variable to '@PARTITIONNUM - @NUMPARTITIONS + 1'.
- Set the derivation of the stage variable to 'svCounter + @NUMPARTITIONS'. You can embed this in an IF statement if it is a conditional counter.
Remember this method only works if your data is evenly balanced i.e. equal number of rows going through each partition. Alternative syntax is:
@INROWNUM * @NUMPARTITIONS + @PARTITIONNUM
(numpartion*rwonum)+partinnum-numpartition+1
Example:(2 partitions)
2*1+0-2+1=1
2*1+1-2+1=2
2*2+0-2+1=3
2*2+1-2+1=4
Or
Sequence numbers can be generated in Datastage using certain routines. They are
-KeyMgtGetNextVal
-KeyMgtGetNextValConn
Or
NextSurrogatekey()
This article assumes the following:
- You have a DB2 DPF environment and are familiar with DB2 DPF concepts.
- You are either designing a new table that will be hash-partitioned, or you have an existing hash-partitioned table that might have a data skew problem.
This article helps you to accomplish the following tasks:
- Choose the right initial partitioning key (PK) prior to defining and populating a table
- Evaluate the quality of the existing PK on a table
- Evaluate the quality of candidate replacement PKs on an existing table
- Change the PK while keeping the table online
This article provides the following type of help:
- Review of concepts and considerations
- Design guidelines
- New routines to estimate data skews for existing and new partitioning keys
Quick review of hash partitioning
In DPF environments, large tables are partitioned across multiple database partitions. There are several ways to partition a table, but the focus of this article is distribution by hash. For other ways to partition the table, refer to the article in the Related topics section.
Distribution by hash is based on partitioning keys. A partitioning key consists of one or more columns defined at table creation. For each newly inserted record, the partitioning key determines on which database partition this record should be stored. The placement is determined by an internal hashing function that takes the values in the column or columns defined as a partitioning key and returns the database partition number. A hashing function is a deterministic function, which means that for the same partitioning key values it always generates the same partitioning placement, assuming that there are no changes to the database partition group definition.
The following syntax examples demonstrate the steps necessary to create a hash-partitioned table:
- Create a database partition group that specifies the database partitions that will participate in the partitioning. The following example illustrates how to create a database partition group PDPG on database partitions 1, 2, 3, and 4:
According to IBM Smart Analytics System and IBM InfoSphere™ Balanced Warehouse best practices, hash-partitioned tables should not be created on the coordinator or administration partition (database partition 0). Database partition 0 is typically used for storing small, non-partitioned lookup tables.
- Create the table space in the database partition group. All objects created in this table space will be partitioned across the database partitions specified in the database partition group definition:
- Create the table in the table space. At this point, the definition of the table is tied to the definition of the database partition group. The only way to change this relationship is to drop the table and recreate it in a different table space that is tied to a different database partition group.
In the following example, Table1 is created on database partitions 1, 2, 3, and 4, and is redistributed based on a partitioning key on column COL1:
Keep in mind that the database partition group definition can change. For example, new database partitions can be added. If this happens, the hash-partitioned table defined prior the modification will not take advantage of the new partition until the database partition group is redistributed using the REDISTRIBUTE DATABASE PARTITION GROUP
command.
Define the partitioning key
The partitioning key is defined using the DISTRIBUTED BY HASH
clause in the CREATE TABLE
command. After the partition key is defined, it cannot be altered. The only way to change it is to recreate the table.
The following rules and recommendations apply to the partitioning key definition:
- The primary key and any unique index of the table must be a superset of the associated partitioning key. In other words, all columns that are part of the partitioning key must be present in the primary key or unique index definition. The order of the columns does not matter.
- A partitioning key should include one to three columns. Typically, the fewer the columns, the better.
- An integer partitioning key is more efficient than a character key, which is more efficient than a decimal key.
- If there is no partitioning key provided explicitly in the
CREATE TABLE
command, the following defaults are used:- If a primary key is specified in the
CREATE TABLE
statement, the first column of the primary key is used as the distribution key. - If there is no primary key, the first column that is not a long field is used.
- If a primary key is specified in the
Why choosing the right partitioning key is important
Choosing the right partitioning key is critical for two reasons:
- It improves the performance of the queries that use hashed partition
- It balances the storage requirements for all partitions
Data balancing
Data balancing refers to the relative number of records stored on each individual database partition. Ideally, each database partition in a hash-partitioned table should hold the same number of records. If records are stored unequally across the database partitions, it can result in disproportional storage requirements and performance problems. The performance problems in this scenario result from the fact that the query work is done independently on each database partition, but the results are consolidated by the coordinating agent, which must wait until all database partitions return a result set. In other words, the total performance is tied to the performance of the slowest database partition.
Table data skew refers to a difference between the number of records in a table on particular database partitions and the average number of records across all database partitions for this table. So, for example, if the table data skew on database partition 1 is 60% for a particular table, it means that this database partition contains 60% more rows from this table than the average database partition.
From the best practices perspective, the table data skew on every individual database partition should be no more than 10%. To achieve this goal, the partitioning key should be selected on the columns that have high cardinality, or in other words, that contain a large number of distinct values. Generate public key from pem file java.
If your table statistics are up to date, you can quickly and inexpensively check the cardinality of the columns in your existing table by issuing the following statement:
Listing 1. Checking the cardinality of the columns in an existing table
Collocation
Collocation between two joined tables in a query means that the matching rows of the two tables always reside in the same database partition. If the join is not collocated, the database manager must ship the records from one database partition to another over the network, which results in sub-optimal performance. There are certain requirements that must be met for the database manager to use the collocation join:
- The joined tables must be defined in the same database partition group.
- The partitioning key for each of the joined tables must match. In other words, they must contain the same number and sequence of columns.
- For each column in the partitioning key of the joined tables, an equijoin predicate must exist.
If you choose a partitioning key based on your query workload, the partitioning key should typically consist of either a joined column or a set of columns that is frequently used in many queries. Windows key generator windows 7.
Although collocated tables typically achieve the strongest performance, it is not possible in practice to collocate all tables. In addition, it is not a good idea to select partitioning keys based on a handful of SQL statements. In decision-support environments, queries can often be unpredictable. In this kind of environment, you should examine your data model to determine the best choice for partitioning keys. The data model and the business relationship between tables can provide a more stable way of selecting a partitioning key than specific SQL statements.
When choosing partitioning keys, draw a data model that shows the relationships among the tables that are in your database. Identify frequent joins and high-use tables. Based on your data model, you should select partitioning keys that favor frequent joins and that are based on a primary key. Ideally, you should collocate frequently joined tables. Another strategy to improve the collocation of the join is to replicate smaller dimensional tables on each database partition.
Collocation compared to data balancing
In some cases, you might find that guidelines for choosing the proper partitioning key based on collocation and data balancing contradict one another. In such cases, it is recommended that you choose a partitioning key based on even data balancing.
Validate the partitioning keys on existing tables
Key Generation While Using Partition Windows 7
If you want to validate how good your partitioning keys are, you should check to see if queries in your workload are collocated and if the data is balanced properly. It is also possible that over time, as your data changes, old partitioning keys become less optimal than they were previously. You can check the collocation in the query joins by looking at the access plan generated by DB2 Explain. If the query is not collocated, you typically will see the TQUEUE
(table queue) operator feeding the join, as shown in Figure 1:
Figure 1. Explain graph that includes a TQUEUE
operator
To check if the data in the table is balanced properly across the database partitions, you can run a simple count on your table grouped by the database partition ID with the help of the DBPARTITIONNUM
function.
You can also use the custom stored procedure ESTIMATE_EXISTING_DATA_SKEW
routine (available in the Download section), which provides more user-friendly output, including a list of database partitions, the skew percentage as compared to the average, and more. This routine can be run on a sample of the original data for faster performance. (See the Appendix for a full routine description.)
If you are planning to run this routine in a production environment, consider running it during a maintenance window or when the system is under a light load. You may also want to try it on one of the smaller tables with the sample value of 1% to get an estimate of how long it takes to return results. The total execution time is included at the bottom of the report.
Example 1
This example tests the data skew in a scenario in which the partitioning key was changed to S_NATIONKEY. This example uses only 25% of the data in the sampling. As you can see from the output, the data has some extensive skewing, and data volumes in some database partitions are 60% skewed.
Listing 2. Measuring the existing data skew for a single table
Example 2
This example demonstrates the usage of the wildcard character in the ESTIMATE_EXISTING_DATA_SKEW
routine. Listing 3 shows a report for the existing data skew on all tables that have schema TPCD and a table name that starts with 'PART.' Since the tables are relatively large, the sample is built on 1% of the data to reduce the performance cost.
Listing 3. Measuring the existing data skew for multiple tables
Evaluate the quality of candidate replacement PKs on the existing table
If you decide to change an existing partitioning key, it is important to determine if the new partitioning key that you are considering will result in good query collocation and evenly distributed data.
To check for query collocation, it is recommended that you collect the queries that characterize your workload, place them in a file, and then run a db2advis report to get recommendations on the new partitioning keys:
You can also run a report based on the recently executed queries that still reside in the package cache using the following form of the db2advis
utility:
Listing 4 provides an example db2advis
output:
Listing 4. db2advis
output
To check if the data would be properly balanced using the new partitioning key, you can use the routine ESTIMATE_NEW_DATA_SKEW
that is also provided in the Download section. This routine creates a copy of your existing table with the new partitioning key and loads it partially or fully with the data from the original table. It then runs the same report for the existing data skew estimation and, at the end. drops the copy table. Note that the table space containing the original table must be able to hold a minimum of 1% of the data from the original table since the copied version is created in the same table space.
Example 3
This example tests the data skew in a scenario in which the partitioning key was changed from S_NATIONKEY to S_ID. This example uses 100% of the data in the sampling. As this example demonstrates, the new partitioning key causes minimal data skew and is a much better choice than the original S_NATIONAL key from Example 1.
Listing 5. Estimating the data skew for a new partitioning key
Change the PK while keeping the table online
A new routine in DB2 9.7 named ADMIN_MOVE_TABLE
allows you to automatically change the partitioning key of a table while keeping the table fully accessible for reads and writes. In addition to the partitioning key change, this procedure can move the table to a different table space, change column definitions, and more.
Example 4
This example changes the partitioning key of the TPCD.PART table from COL1 to (COL2, COL3). It uses the LOAD
option to improve the performance of the ADMIN_MOVE_TABLE
routine.
Listing 6. Changing partitioning keys
While the ADMIN_MOVE_TABLE
procedure is running, the TPCD.PART table is fully accessible and the change to the partitioning key is transparent to the end users.
Conclusion
Choosing appropriate partitioning keys is essential for optimizing database performance in a partitioned environment based on DB2 software. This article provided guidance and tooling for choosing the best partitioning keys based on your needs.
This article described:
- The concepts related to partitioning keys, and the rules and recommendations for creating partitioning keys
- Routines that can help you estimate the data skew for new and existing partitioning keys
- How to change partitioning keys while keeping the table fully accessible
Appendix: Routine reference documentation
Prerequisites
Both the ESTIMATE_EXISTING_DATA_SKEW
and the ESTIMATE_NEW_DATA_SKEW
procedures are supported in DB2 9.7 or later. The routine ADMIN_MOVE_TABLE
that is used for the actual movement of the table is shipped with the core DB2 9.7 product or later. For the ESTIMATE_NEW_DATA_SKEW
routine, there must be enough free space in the table space that contains the original table to store the sampling data.
Deployment instructions
- Download and save the estimate_data_skew.sql file found in the Download section.
- Connect to the database from the command line and deploy the routines using the following command:
ESTIMATE_NEW_DATA_SKEW
procedure
The ESTIMATE_NEW_DATA_SKEW
routine estimates the data skew of individual database partitions on an existing table with a new partitioning key. To improve the performance and lower the storage requirements of this routine, the estimation can be based on a subset of the data using extremely fast sampling on the page level.
Syntax
Procedure parameters
csv_format (optional)
- This optional input parameter is used to request the format in which the data will be returned. A value of 'Y' for this parameter requests that the procedure will return the data in CSV format. This CSV formatted data is then under the headings: SCHEMA, TABLE, PARTITION, SAMPLE%, TABLEROWCOUNT, PARTAVG, PARTROWCOUNT and SKEW. The parameter defaults to 'N' for regular format.
in_tabschema
- This input parameter specifies the name of the schema that contains the table to be estimated for data skews. This parameter is case-sensitive and has a data type of VARCHAR(128). This parameter does not support wildcards.
in_tabname
- This input parameter specifies the name of the table to be estimated for data skews. This parameter is not case-sensitive and has a data type of VARCHAR(128). This parameter does not support wildcards.
new_partitioning_keys
- This input parameter specifies the new partitioning keys to be used in the estimation of data skews.
sampling_percentage
- This input parameter specifies the percentage of data to be used in the data skew estimation. Valid values are 1 to 100, where 100 means that the stored procedure will use all records in the table for the estimation. The purpose of this parameter is to improve performance and minimize the space usage when estimating data skew with new partitioning keys. If performance and disk space are not an issue, specify 100 for this value.
ESTIMATE_EXISTING_DATA_SKEW
procedure
Key Generation While Using Partition Windows 10
The ESTIMATE_EXISTING_DATA_SKEW
stored procedure estimates the data skew of individual database partitions in one or more tables based on the existing partitioning keys. To improve the performance of this procedure, the estimation can be based on a subset of the data using extremely fast sampling on the page level.
Syntax
Procedure parameters
csv_format (optional)
- This optional input parameter is used to request the format in which the data will be returned. A value of 'Y' for this parameter requests that the procedure will return the data in CSV format. This CSV formatted data is then under the headings: SCHEMA, TABLE, PARTITION, SAMPLE%, TABLEROWCOUNT, PARTAVG, PARTROWCOUNT and SKEW. The parameter defaults to 'N' for regular format.
in_tabschema
- This input parameter specifies the name of the schema that contains the table to be estimated for data skews. This parameter is case-sensitive and has a data type of VARCHAR(128). This parameter supports % as a wildcard. If the NULL value is specified, a report will be run for all schemas defined in the database.
in_tabname
- This input parameter specifies the name of the table to be estimated for data skews. This parameter is not case-sensitive and has a data type of VARCHAR(128). This parameter supports % as a wildcard.
sampling_percentage
- This input parameter specifies the percentage of data to be used in the data skew estimation. Valid values are 1 to 100, where 100 means that the stored procedure will use all records in the table for the estimation.
Downloadable resources
- Sample SQL script for this article (estimate_data_skew.zip 4KB)
Related topics
- 'DB2 partitioning features' (developerWorks, August 2006): Get an introduction to some of the DB2 LUW table design features, such as table partitioning, multidimensional clustering (MDC), database partitioned tables, and materialized query tables (MQT).
- DB2 for Linux, UNIX, and Windows area on developerWorks: Get the resources you need to advance your DB2 skills.
- DB2 Information Center: Connect to the DB2 Information Center to get more information about partitioned database environments.
- '
ADMIN_MOVE_TABLE
procedure - Move an online table' - '
db2advis
- DB2 design advisor command'
- '
- DB2 9.7 for Linux, UNIX, and Windows: Download a free trial version of DB2 9.7 for Linux, UNIX, and Windows.
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.