Search This Blog

Wednesday, May 15, 2013

Teradata Joining Strategies

Teradata Join Strategies are utilized by the optimizer to choose the least cost plan and better performance. The strategy will be chosen based on the available information to the Optimizer like Table size, PI information, Stats Information. 

Teradata join strategies are following:
·       Merge (Exclusion)
·       Nested
·       Row Hash
·       Product (including Cartesian Product joins)

Merge Join Strategies


There are four different merge join strategies based on the redistribution. Below details have all these information.

Merge Join Strategy 1

The 1st merge join will utilize the Primary Index on both tables in the join Condition.
The key here is that both of the Primary Index columns of each table are used in the WHERE or ON clause in the join type.



The inner join above focuses on returning all rows when there is a match between the two tables. The ON clause is extremely important because this join establishes the join (equality) condition. Think of this as if you were repairing a deck and you required a certain kind of wood in order to make sure you had a match. You would go down to the local home improvement store to locate the right pieces of wood to repair the deck. A match would definitely not be cedar if your deck required pressured pine.

To investigate further, each matching row is joined where Emp = Emp which is stated from the ON Clause in the JOIN. If we analyze further, EMP is the Primary Index for both tables. This first merge join type is extremely efficient because both columns in the ON clause are the Primary Indexes of their respective tables. When this occurs, NO data has to be moved into spool and the joins can be performed in what is called AMP LOCAL. Teradata can perform this join with rapid speed. Remember that the less data that has to be moved to complete a join the better the performance will be achieved. The end result of this join is that the rows from each table are physically joined together to form one row.

Merge Join Strategy 2

The previous join strategy is the best join scenario because it focuses on utilizing the Primary Index Column in both tables for the equality (ON) condition. The next best scenario is when a join is performed on a Primary Index column of one table to a non-primary indexed column of another table.

In this example, two tables are being joined based on the DEPT column. In the department table, the Primary Index column is DEPT. As we know, this is a good match based on the equality (ON) condition. However, the employee table has EMP as the Primary Index column. Which table do you think will move based on this join?
Regardless of the equality condition, the primary objective is to bring the rows together from each table on the same AMPs. There are several options that the Teradata Optimizer could choose in order to complete this task. The first option is to duplicate the smaller table on all AMPs. The second option could be to leave the department table that has an equality condition match on the Primary Index Column stationary on the AMP. The next step would be to move the rows from the Employee table into spool. This would be accomplished by hashing (locating) the columns in the employee table, and then moving these rows into spool to the appropriate AMPs where the department table rows reside. What has occurred here is that the stationary table is already sorted by the hash code and the spool table will be re-sorted by hash code in spool to the appropriate matching AMPs.

Clearly the second option is an excellent choice for the Teradata Optimizer because it reduces the amount of resources necessary to complete the join and improve performance.

Merge Join Strategy 3

The third scenario is where neither table is being joined on the Primary Index of either table. In this case Teradata will redistribute both tables into spool and sort them by hash code. When we want to redistribute and sort by hash code we merely hash the non-primary index columns and move them to the AMPs spool where they are sorted by hash. Once this is accomplished, then the appropriate rows are together in spool on all the AMPs. It is now join time!

In the previous example, the columns in the join equality are MgrEmp = MgrNo. The Primary Index of the department table is DEPT and the Primary Index for the manager table is LOC. In this case, both columns being utilized in this join equality are not part of the Primary Index columns. So what strategy will Teradata take to resolve this join operation?
Basically rows from both tables will need to be rehashed and redistributed into SPOOL.
The reason is because neither columns selected in the ON Clause are the Primary Index of the respective tables. Therefore, both tables are redistributed based on the ON clause columns.

The next step in this process is to redistributed the rows and locate them to the matching AMPs. When this is completed, the rows from both tables will be located in two different spools. Lastly, the rows in each spool will be joined together to bring back the matching rows. This type of join strategy is extremely inefficient. It consumes a ton of resources and time to manage and assemble this type of join.

Merge Join Strategy 4

The fourth merge join strategy is called the big table - small table join. If one of the tables being joined is small, then Teradata may choose a strategy that will duplicate the smaller table across all the AMPs. The key about this strategy is that regardless if the table is part of the Primary Index Column or not Teradata could still choose to duplicate the table across all the AMPs.


In this inner join above, the two tables involved in the join are the Employee table and the Department table. The DEPT column is the join equality that is making the match between the two tables. The DEPT column is the Primary Index Column in the Department table. The Employee table has the EMP column as the Primary Index. The final analysis of this join is that the Department table is small and makes a good candidate for this type of join strategy.
In order to join these two tables together, the first step is to get the rows together on the same AMP. In this case, since the Department table is small, Teradata will choose to duplicate the entire Department table on each AMP into spool. Once this is completed, then the next step is for the AMPs to join the base Employee rows with the Department rows.
`
Instead of redistributing the larger Employee table, which is not part of the Primary Index Column in the equality (ON) condition, Teradata will choose a more efficient strategy. This strategy would be to duplicate the smaller table across all the AMPs (Big Table -Small Table Join). This merge join strategy will consume minimal resources, and allow for Teradata to excel.

Nested Join

A nested join strategy is probably the most precise join available. This join is designed to utilize a unique index type (Either Unique Primary Index or Unique Secondary Index) from one of the tables in the join statement in order to retrieves a single row. It then matches that row to one or more rows on the other table being used in the join.

From the example above, the nested join has the join equality (ON) condition based on the DEPT column. The dept column is the Primary Index Column on the department table. In addition, the dept column is the Secondary Index Column in the employee table. Based on this information above, which rows will move?
Keep in mind that the nested join prides itself on being able to move a single row into spool and then matching that row with another table that contains several matches. How is this done? Analysis of this join statement indicates a new clause has been added to this join statement. This is known as the WHERE option. When utilized, the WHERE option allows for a single row to be retrieved from a table. In addition, a nested join will always use a unique index to isolate that single record and then join that record to another table. The other table may use an index or it may not. However, the best practice is to always use columns that have indexes when doing joins. Teradata has superior knowledge on indexed columns and can utilize this information to choose an aggressive strategy to complete a join. Utilization of indexes in join statements will improve performance and utilize less resources as the below diagram illustrates.


Since there is only one row in the department table that has a match for department =10, which is based on the AND option in the join statement, the Teradata Optimizer will choose a path to move the department table columns into spool and duplicate them across all the AMP’s.
Once this is completed, then the matches will proceed with that single record (10 and SALES) to the second table, which did not move from the base AMP. Nested Joins are great in an OLTP Environment because of the usage of both Unique and Non-Unique Indexes. In addition, a nested join can reduce the resources necessary to complete the join. Finally, nested joins, which are similar to all the join strategies discussed, must have an equality condition in the ON Clause (d.dept = e.dept).

Hash Join

The Hash Join is part of the Merge Join Family. Remember that the key to a Merge Join is based on an equality condition such as E.Dept = D.Dept in the ON clause of the join statement. A Hash Join can only take place if one or both of the tables on each AMP can fit completely inside the AMP’s memory.
Hash Join Strategy

In this example, the Hash Join has a join equality (ON) condition based on the EMP and MGREMP Columns. The key point here is the columns do not necessarily have to the same name when doing a join operation. The columns names can be different but the row information has to be similar in order for the match to work. Both the EMP and MGREMP columns have the same type of information so therefore a join based on these column names will be successful. In addition, EMP column is the Primary Index column on the employee table. However, the MGREMP column is not an index column in the department table. Based on this information above, which rows will move?
Remember that the key to determining a Hash Join is if the SMALLER TABLE can be held completely in each AMP’s MEMORY.


The Hash Join process is where the smaller table is sorted by row hash and duplicated on every AMP. The key here is that the smaller table is required to be held completely in each AMP’s memory. Teradata will use the join column of the larger table in order to search for a match. The row hash join is extremely efficient because it eliminates the sorting, redistribution, and or copying of the larger table into spool. In addition, the rows that are duplicated into the AMP’s memory yield increased performance because the rows never go into spool. Rows that go into spool always have to involve disk activity. AMP memory does not involve disk interaction, which automatically increases performance. Hash Joins and Nested Joins are both Great in an OLTP Environment for these reasons.

Exclusion Join

All of the joins that we have reviewed up to this point were based on finding matching rows based on a join equality condition. The returned rows from these types of joins compared rows from both tables in a join and then returned rows that matched. In addition, these joins were inclusive. When working with exclusion joins the thought process has to be reversed. Exclusion Joins have one primary function. They exclude rows during a join. The best example here is when I was out with my best friend and his two sons (Blake and Zach). They were telling a story that sounded believable about their dog Freddie who allegedly chased a neighbor’s cat down the street. When they completed the story, I asked them if this was true…they said in unison “NOT”! Well the NOT statement works exactly the same in Teradata. When you put a NOT in front of a statement it will give you the opposite answer.

SELECT EMP, DEPT, NAME 
FROM 
EMPLOYEETABLE
WHERE DEPT=10 and 
EMP NOT IN (SELECT MGREMP from DEPTTABLE WHERE MGREMP IS NOT NULL )

As you can see in the above example and as has been discussed above this type of join utilizes the NOT IN statement. Exclusion joins are used for finding rows that don’t have a matching row in the other table. Queries with the NOT IN operator are the types of queries that always result in exclusion joins. In this case, this query will find all the employees who belong to department 10 who are NOT managers.
These joins will always involve a Full Table Scan because Teradata will need to compare every record to eliminate rows that will need to be excluded. With this being stated, this type of join can be resource intensive if the two tables in this comparison are large.
In addition, the biggest problem with the Exclusion Joins is when the NOT IN statement is used. The reason for this is that NULLs are considered unknowns so the data returned in the answer will be NULLs. There are two ways to correct this:
·       Define NOT IN columns as NOT NULL on the CREATE TABLE.
·       Add the “ AND WHERE Column IS NOT NULL” to the end of the JOIN as seen in the above example.

Product Joins

What Makes Product Joins Different

Product Joins compare every row of one table to every row of another table. They are called product joins because they are a product of the number of rows in table one multiplied by the number of rows in table two. For example, if one table had five rows and the other table had five rows then the Product Join would compare 5 x 5 or 25 rows with a potential of 25 rows coming back.

SELECT E.EMP,D.DEPT 
FROM EMPLOYEETABLE E,DEPTTABLE D
WHERE
EMP LIKE '_b%'
About 99% of the time, product joins are major mistakes. The recommendation is to avoid these types of queries whenever possible. The reason is because all rows in both tables will be compared. Remember, Teradata tables have the potential to contain millions of rows. If a user accidentally writes a product join against two tables that have 1 million rows each. The result set would return One Trillion Rows (1000000000000)! Needless to say this is a mistake you want to make. So how do you avoid writing a product join?
To avoid a product join, check your syntax to ensure that the join is based on an EQUALITY condition. In the join syntax example above, the equality statement reads “WHERE EMP Like ‘_b%’”. Because this clause is not based on a common domain condition between the two tables (i.e., e.dept = d.dept), the result is a product join. Another cause of a product join is when aliases are not used after being established. Finally check your join syntax to ensure the WHERE clause is not missing.

Cartesian Product Join

Just as discussed with Product Joins above, a Cartesian Product Join is usually something you want to avoid. If we decided to run this query in Michigan this would be call the big “Mistake on the Lake”. A Cartesian Product Join will join every row in one table to every row in another table. The only thing that decides the number of rows will be the total number of rows from both tables. If one table had 5 rows and another had 10 rows then you will always get 50 rows returned. Imagine this situation. if we have a table with 10 million rows and another with 25 million rows and a Cartesian product join is written (by accident) against these two tables. What will be the result? Well, based on the example above, you will get back about 25 Trillion Rows (250000000000000)! This is definitely NOT the correct answer this user would want. This is why spool space limitations are utilized.


SELECT E.EMP,D.DEPT 
FROM EMPLOYEETABLE E,DEPTTABLE D;
About 99% of the time, a Cartesian Product Join is a major problem. The recommendation is avoid these types of queries whenever possible. The reason is because all rows in both tables will be joined. So how do you avoid writing a Cartesian Product Join?
To avoid a Cartesian Product Join, check your syntax to ensure that the join is based on an EQUALITY condition. In the join syntax example above, the WHERE clause is missing. Because this clause is missing, a common domain condition between the two tables (i.e., e.dept = d.dept) does not exist, the result is a product join. Another cause of a product join is when aliases are not used after being established.

14 comments:

  1. This is simply awesome post... Thanks for sharing ....

    ReplyDelete
  2. One of the best post i have came across..

    ReplyDelete
  3. Fantastic explanation..

    ReplyDelete
  4. awesome , Now I can relate things with big teradata books

    ReplyDelete
  5. Excellent. Thanks mate.

    ReplyDelete
  6. Thanks for the details explanation in simple words.

    Thanks a lot

    ReplyDelete
  7. Thanks.This is really excellent. I want to know the scenario when product join is userful

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. How the Hash Join is different from Merge Join strategy 4? As per your explanation, they both looks same.

    ReplyDelete
  10. Thank you for sharing valuable information.This article is very useful for me valuable info about
    Teradata Online Training.keep updating.........

    ReplyDelete