Teradata Data Distribution
There are three keys to how Teradata spreads the
data among the AMPs. They are Primary Index, Primary Index, and Primary Index
The Primary Index
Teradata
takes each table and spreads the table rows across the AMPs. When the table
needs to be read, each AMP has to read only their portion of the table. If the
AMPs start reading at the same time and there are an equal amount of rows on
each AMP, then parallel processing works brilliantly. Alone an AMP can do so
little, but the AMPs working together can accomplish the incredible. This
brilliant feat begins with the Primary Index.
Each
table in Teradata is required to have a Primary Index. The biggest key to a
great Teradata Database Design begins with choosing the correct Primary Index.
The Primary Index will determine on which AMP a row will reside. Because this
concept is extremely important, let me state again that the Primary Index is
the only thing that will determine on which AMP a row will reside.
Many
people new to Teradata assume that the most important concept concerning the
Primary Index is data distribution. INCORRECT! The Primary Index does determine
data distribution, but even more importantly the Primary Index provides the
fastest physical path to retrieving data. The Primary Index also plays an
incredibly important role in how joins are performed. Remember these three
important concepts of the Primary Index and you are well on your way to a great
Physical Database Design.
The
Primary Index plays 3 roles:
· Data Distribution
· Fastest Way to Retrieve Data
· Incredibly important for Joins
The Two Types of Primary Indexes
Every
table must have at least one column as the Primary Index. The Primary Index is
defined when the table is created. There are only two types of Primary Indexes,
which are a Unique Primary Index (UPI) or a Non-Unique Primary Index (NUPI).
Unique Primary Index (UPI)
A
Unique Primary Index means that the values for the selected column must be
unique. If you try and insert a row with a Primary Index value that is already in
the table, the row will be rejected. A Unique Primary Index will always spread
the table rows evenly amongst the AMPs. Please don’t assume this is always the
best thing to do. Below is a table that has a Unique Primary Index. We have
selected EMP to be our Primary Index. Because we have designated EMP to be a
Unique Primary Index then there can be no duplicate employee numbers in the
table.
Employee Table
EMP DEPT LNAME FNAME SAL
UPI
1 100 JOHN CHRIS 10000
2 200 JOHN JACK 13000
3 300 LICE ANN 12000
4 300 BROWN MARY 30000
A
Unique Primary Index (UPI) will always spread the rows of the table evenly
amongst the AMPs.
Don’t
assume this is always the best choice!
Non-Unique Primary Index
A
Non-Unique Primary Index (NUPI) means that the values for the selected column
can be non-unique. You can have billions of the same values in the Primary
Index. A Non-Unique Primary Index will almost never spread the table rows
evenly. Please don’t assume this is always a bad thing. Below is a table that
has a Non-Unique Primary Index. We have selected LNAME to be our Primary Index.
Because we have designated LNAME to be a Non-Unique Primary Index we are
anticipating that there will be individuals in the table with the same last
name.
Employee Table
EMP DEPT LNAME FNAME SAL
NUPI
1 100 JOHN CHRIS 10000
2 200 JOHN JACK 13000
3 300 LICE ANN 12000
4 300 BROWN MARY 30000
A
Non-Unique Primary Index (UPI) will almost NEVER spread the rows of the table
evenly amongst the AMPs.
Don’t
assume this is a bad thing!
How Teradata Turns the Primary Index Value into the Row Hash
The
Primary Index is the only thing that determines where a row will reside. It is
important that you understand this process. Here are the fundamentals in the
simplest form. When a new row arrives into Teradata then:
Teradata
examines the Primary Index value for the row.
Teradata
takes that Primary Index value and runs it through a Hashing Formula.
The
output of the Hashing Algorithm (a.k.a., Formula) is a 32-bit Row Hash.
The
32-bit Row Hash will perform two functions:
1.
The 32-bit Row Hash will point to a certain spot on the Hash Map, which will
indicate which AMP will hold the row.
2.
The 32-bit Row Hash will always remain with the Row as part of a Row Identifier
(Row ID).
Hashing
is a mathematical process where an Index (UPI, NUPI) is converted into a
32-bit row hash value. The key to this hashing algorithm is the Primary
Index. When this value is determined, the output of this 32-bit value is
called the Row Hash.
A
new row is going to be inserted into Teradata. The Primary Index is the column
called EMP. The value in EMP for this row is 99. Teradata runs the value of 99
through the Hash Formula and the output is a 32-bit Row Hash. In his example
our 32-bit Row Hash output: 00001111000011110000111100001111.
Every
time employee 99 is run through the hash formula it returns the same Row Hash.
That Row Hash will point to the same Hash Bucket every time. That is how
Teradata knows which AMP will hold row 99. It does the math and it always gets
what it always got!
The Row is Delivered to the Proper AMP
Now
that we know that Employee 99 is to be delivered to AMP 4(For ex), Teradata packs up
the row, places the Row Hash on the front of the row, and delivers it to AMP 4.
REVIEW:
· A row is to be inserted into a
Teradata table
· The Primary Index Value for the Row
is put into the Hash Algorithm
· The output is a 32-bit Row Hash
· The Row Hash points to a bucket in
the Hash Map
· The bucket points to a specific AMP
· The row along with the Row Hash are
delivered to that AMP
The AMP will add a Uniqueness Value
When
the AMP receives a row it will place the row into the proper table, and the AMP
checks if it has any other rows in the table with the same row hash. If this is
the first row with this particular row hash the AMP will assign a 32-bit
uniqueness value of 1. If this is the second row hash with that particular row
hash, the AMP will assign a uniqueness value of 2. The 32-bit row hash and the
32-bit uniqueness value make up the 64-bit Row ID. The Row ID is how tables are
sorted on an AMP.
How Teradata Retrieves Rows
In
the example below a user runs a query looking for information on Employee 99.
The PE sees that the Primary Index Value EMP is used in the SQL WHERE clause.
Because this is a Primary Index access operation, the PE knows this is a one
AMP operation. The PE hashes 99 and the Row Hash is
00001111000011110000111100001111. This points to a bucket in the Hash Map that
represents AMP 4. AMP 4 is sent a message to get the Row Hash:
00001111000011110000111100001111 and make sure it’s EMP 99.
very good explanation
ReplyDeleteExcellent !!
ReplyDeletegreat !!
ReplyDeleteperfect
ReplyDeleteAwesome !!
ReplyDeleteFabulous
ReplyDelete