Search This Blog

Showing posts with label Utilities. Show all posts
Showing posts with label Utilities. Show all posts

Wednesday, May 15, 2013

Teradata Utilities


Teradata Utilities

Below article shows the details of various Teradata Utilites and their usage.

BTEQ:

BTEQ is batch-mode utility for submitting SQL requests to the Teradata database. It runs on every supported platform – laptop to mainframe. It can both export/import data to/from client system.

Instead of checking error code or error level after each statement, use the following setting at the beginning of the script. This sets the error severity level at which BTEQ terminates to 1.
.MAXERROR 1

Thus, any error condition which results in an error level of 1 or greater will cause the program to terminate, returning the error code that caused the MAXERROR condition to trigger. (MAXERROR is NOT the maximum number of errors allowed).

Under certain conditions, it may be desirable to have certain error conditions not trigger the session to exit. Such an example is dropping a table or view, but not wanting to abort the session if that table/view does not exist. This can be accomplished by having the DROP statement as the first statement and .MAXERROR 1 as the second statement.

If your need to have the DROP object statement after some SQL statement (cannot be the first statement), the method to accomplish this is to set the error level associated with the error code.
.SET ERRORLEVEL 3807 SEVERITY 0
Then to reset the error level to be checked again you do the following:
.SET ERRORLEVEL 3807 SEVERITY 8
(8 is the default error level of the 3807 error, table/view does not exist).
You can find ERRORLEVEL for specific ERRORCODE in the BTEQ manual.
If you are going to use this functionality, it is very important to reset the ERRORLEVEL back to its default before processing any statement where you want that error condition to abort the script.
If need to implement loops in BTEQ script ,  can use following syntax/logic

Sql stmnt;

If activitycount ……

  =n ; ( how many times want to repeat the loop ).

Bteq Export

All bteq export processes should use the ‘close’ option of the export command and the 'set retry off' to ensure that the process aborts immediately upon a DBMS restart.  If not, the export will reconnect sessions when Teradata is available again, retransmitting rows already sent.

Bteq Import

All bteq import processes populating empty tables should be preceded by a delete of that table for restartability. Import will not automatically reconnect sessions after a Teradata restart.  The job must be manually restarted.


FASTLOAD:


Purpose: Loads large amount of data from external file into an empty table at high speed

Restriction:
Only load one empty table with 1 fast load job.
Tables defined with Referential integrity, secondary indexes, join indexes, hash indexes or triggers cannot be loaded with FastLoad.
Duplicate rows cannot be loaded into a multiset table with FastLoad.

FastLoad has two phases – Data Acquisition and Application. It requires separate error table for each phase. Use the BEGIN LOADING statement to specify (and create) the two error tables for the operation. We can specify the error table in the same database as the data table or different database.

BEGIN LOADING [dbname.]table_name
      ERRORFILES [dbname.]Err_Table1,
      ERRORFILES [dbname.]Err_Table2
      [CHECKPOINT integer]
      [INDICATORS];
.
.
.
.
END LOADING;
LOGOFF;

We must remove the error tables before we re-run the same load job or it will terminate in an error condition.

The CHECKPOINT option defines points in a job where FastLoad pauses to record that Teradata has processes a specified number of input records. When we use checkpoints, we do not have to re-run the entire Fastload job if it stops before completion. Fastload will use the checkpoint information in the restart log table to determine the restart location.Checkpoints slow FastLoad processing – set the CHECKPOINT large enough that checkpoints are taken every 10 to 15 minutes.

An errlimit count should be specified on all fastload utility control cards to pause the load when an unusually high number of constraint violations are directed to error table 1.  For large tables (such as fact tables), this number should be approximately 1000.  The errlimit for smaller tables, such as dimension tables, would be table specific.

END LOADING indicates that all data rows have been transmitted and the second phase can begin. Its omission means the load is incomplete and will be restarted later. This causes the table that is being loaded to become “Fastload Paused” and we can’t access the table via SQL. If the END LOADING was omitted by mistake, submit BEGIN and END LOADING statements, it will restart phase 2 only.

We can specify the max and min number of sessions that can be used by the Fastload script. Specify the max value equal to the number of AMPs. If we specify a max value larger than the number of available AMPs, Fastload limits the sessions to one per working AMP. Min is optional and defaulted to 1.

FAST EXPORT:

Purpose: Exports large volumes of formatted data from Teradata to a host file or user-written application.

Restrictions:
Equality conditions for a Primary Index or USI
WITH option to generate total or subtotal response rows.


The FastExport job may consist of multiple SELECT statements, which will be executed sequentially by FastExport.
In the .EXPORT statement, we need to specify definition of the output data file. The different data file format supported are: 
TEXT – Fixed width format
VARTEXT – Character delimiter format
BINARY – Binary format, a 2 bytes length indicator followed by n bytes of data

The data can be exported in the sorted order. But note that requesting sorted data adds additional work (overhead and time) to Teradata.

We can specify the max and min number of sessions that can be used by the FastExport script. Specify the max value equal to the number of AMPs. If we specify a max value larger than the number of available AMPs, FastExport limits the sessions to one per working AMP. Min is optional and defaulted to 1.


MULTI LOAD:


Purpose: Used for loading, updating or deleting data to and from populated tables, typically with batch inputs from a host file.

Restrictions:
Cannot process tables defined with USI’s, Referential Integrity, Join Indexes, Hash Indexes, or Triggers.
No data retrieval capability.
Import tasks require use of Primary Index.


The Multiload supports five target tables per script. Tables may contain pre-existing data. Ability to do INSERTs UPDATEs, DELETEs and UPSERTs.

In the BEGIN MLOAD statement define one work table for each target data table. If work tables are not defined, these are created with the table name prefixed by ‘WT_’. Make sure that the database where work tables are defined has enough space. Multiload loads the data into work table before applying into the target table data blocks.

Two error tables need to be defined for each target table. If they are not defined in the BEGIN MLOAD, they are defaulted to an ‘ET_’ and ‘UV_’ prefix and the table name. ‘ET_tablename’ refers to acquisition phase error table and ‘UV_tablename’ refers to an application phase error table.

BEGIN MLOAD
    TABLES tname1, tname2,…
    WORKTABLES wt_table1, wt_table2,…
    ERRORTABLES et_table1 uv_table1, et_table2 uv_table2, …
    [CHECKPOINT integer]
    [INDICATORS];
.
.
.
.
END MLOAD;

Use RELEASE MLOAD tablename; statement if MLOAD doesn’t successfully completed and there is no desire to restart the MLOAD. This will return the target table to general availability. Error tables should be dropped manually, as RELASE MLOAD will not drop them automatically.

In the .IMPORT statement, we need to specify definition of the input data file. The different data file format supported are: 
TEXT – Fixed width format
VARTEXT – Character delimiter format

We can specify the max and min number of sessions that can be used by the MLOAD script. Specify the max value equal to the number of AMPs. If we specify a max value larger than the number of available AMPs, MLOAD limits the sessions to one per working AMP. Min is optional and defaulted to 1.
Multiload jobs may or may not benefit from dropping/recreating indexes.  This is dependent on the size of the table, the number of indexes, and the size of the file being multiloading.  Tests will  need to be conducted to determine the method that performs best.  If drop/create index is not done, these members should exis, but contain comments only.

An errlimit count should be specified on all multiload utility control cards to pause the load when an unusually high number of constraint violations are directed to the ‘et’ error table.  For large tables (such as fact tables), this number should be approximately 1000. The errlimit for smaller tables, such as dimension tables, would be table specific.

Multiload Tips and Techniques

Multiload is a Teradata load utility that is used mostly to append rows to a table.  It is the fastest way to do this, but is substantially slower than fastload.  Although runtime is affected by the number of duplicate primary index values, multiload runs approximately 2-3 times slower than fastload.  A sample runtime on a 2 node, 12 amp NCR box is: 1000000, 300 byte rows in 1 hour (unique primary index).
Multiload usually runs faster when secondary indexes are dropped prior to the load and then recreated afterwards. Multiload does not support unique secondary indexes and they must be dropped prior to a multiload.
Multiload restart options depend on whether or not the job abends in the acquisition or application phase.  The syslist will indicate which phase it is in, but if you are unsure run the release mload command.  If the job is in the application phase, this command will not be accepted. Please see the multiload restart procedures for more details on restart options.
Multiload can also be used to perform a large number of deletes in an efficient manner, using the multiload delete statement. 
Any operation performed using MLOAD cannot be rolled back.



         

Tuesday, May 14, 2013

Teradata Interview Questions




Hi Everyone, 

This post aims at giving the important/repeated questions that have been asked in the interview for Teradata positions. Instead of just giving answers, i have explained most of the concepts in a detail manner for the better understanding and makes you to answer if the question is based out of your answer. Please post  if you need any more details on the concepts or Teradata related questions if you need an answer. I will keep updating this post and will include any sections if you are interested. ALL THE VERY BEST and Happy Reading.

TERADATA INTERVIEW QUESTIONS AND CONCEPTS


What is Teradata & how does it differ from other conventional Systems?

Teradata is a relational database that can store billions of rows and petabytes (1 petabyte=1000 terabytes) of data. The architecture of the system makes it possible and provides the flexibility to access & process the data in a faster way.
It differs from other conventional database systems from its architecture to the processing speed. Easy scalability and its fault tolerance keep the demand high for this system.

What are the main components of Teradata System?

Teradata has 3 main components which do wonder to the world of data management & Storage.
It has

1. PE (Parsing Engine) : Acts as a gate keeper to the Teradata Systems and manages all sessions, interprets the SQL statements for any errors, manages the access rights for the user, defines a least expensive optimizer plan for the query to execute and sends the request to AMP via Bynet.
2. Message Passing Layer (Bynet)  : Carries messages between the AMPs and PEs, provides Point-to-Point and Broadcast communications, Merging answer sets back to the PE and Making Teradata parallelism possible
3. AMP (Access Module Processor) : AMP is the heart of Teradata which does most of the operations for data storage and retrieval. It also takes care of  finding the rows requested, Lock management of the tables and rows, Sorting rows, Aggregating columns, Join processing etc.

What are Primary Index and Primary Key in Teradata?

Unlike other database systems, Teradata distributes the data based on PI (Primary Index). PI is defined at the time of table creation and database automatically takes the first column as the PI if the PI is not mentioned explicitly.
Since the data distribution is based on PI, it is wise to choose a PI that evenly distributes the data among the AMP. 
For example, if Table A has two columns like below and we have 5 AMPs in the System.
ID            Gender
1              Male
2              Male
3              Male
4              Male
5              Female

If we choose ID as PI, since the values are distinct all 5 rows are distributed evenly across all 5 AMPs. But if GENDER has been chosen as PI , we have only 2 distinct values and data will be stored in only 2 AMPS leaving other 3 AMPS empty and idle.
Note: Same Value of PI will be stored in the same AMP.
Primary Key is a concept that uniquely identifies a particular row of a table.

What are the types of PI (Primary Index) in Teradata?

There are two types of Primary Index. Unique Primary Index ( UPI) and Non Unique Primary Index (NUPI). By default, NUPI is created when the table is created. Unique keyword has to be explicitly given when UPI has to be created.

UPI will slower the performance sometimes as for each and every row , uniqueness of the column value has to be checked and it is an additional overhead to the system but the distribution of data will be even.
Care should be taken while choosing a NUPI so that the distribution of data is almost even . UPI/NUPI decision should be taken based on the data and its usage.

How to Choose Primary Index(PI) in Teradata?

Choosing a Primary Index is based on Data Distribution and Join frequency of the Column. If a Column is used for joining most of the tables then it is wise to choose the column as PI candidate.
For example, We have an Employee table with EMPID and DEPTID and this table needs to be joined to the Department Table based on DEPTID.

It is not a wise decision to choose DEPTID as the PI of the employee table. Reason being, employee table will have thousands of employees whereas number of departments in a company will be less than 100. So choosing EMPID will have better performance in terms of distribution.

How the data is distributed among AMPs based on PI in Teradata?

•       Assume a row is to be inserted into a Teradata table
•       The Primary Index Value for the Row is put into the Hash Algorithm
•       The output is a 32-bit Row Hash
•       The Row Hash points to a bucket in the Hash Map.The first 16 bits of the Row Hash of is used to locate a bucket in the Hash Map
•       The bucket points to a specific AMP
•       The row along with the Row Hash are delivered to that AMP

When the AMP receives a row it will place the row into the proper table, and the AMP checks if it has any other rows in the table with the same row hash. If this is the first row with this particular row hash the AMP will assign a 32-bit uniqueness value of 1. If this is the second row hash with that particular row hash, the AMP will assign a uniqueness value of 2. The 32-bit row hash and the 32-bit uniqueness value make up the 64-bit Row ID. The Row ID is how tables are sorted on an AMP.

This uniqueness value is useful in case of NUPI's to distinguish each BUPI value.
Both UPI and NUPI is always a One AMP operation as the same values will be stores in same AMP.


How Teradata retrieves a row?

For example, a user runs a query looking for information on Employee ID 100. The PE sees that the Primary Index Value EMP is used in the SQL WHERE clause.
Because this is a Primary Index access operation, the PE knows this is a one AMP operation. The PE hashes 100 and the Row Hash points to a bucket in the Hash Map that represents AMP X. AMP X  is sent a message to get the Row Hash and make sure it’s EMP 100.

What are Secondary Indexes (SI) , types of SI and disadvantages of Secondary Indexes in Teradata?

Secondary Indexes provide another path to access data. Teradata allows up to 32 secondary indexes per table. Keep in mind; row distribution of records does not occur when secondary indexes are defined. The value of secondary indexes is that they reside in a subtable and are stored on all AMPs, which is very different from how the primary indexes (part of base table) are stored. Keep in mind that Secondary Indexes (when defined) do take up additional space.

Secondary Indexes are frequently used in a WHERE clause. The Secondary Index can be changed or dropped at any time. However, because of the overhead for index maintenance, it is recommended that index values should not be frequently changed.

There are two different types of Secondary Indexes, Unique Secondary Index (USI), and Non-Unique Secondary Index (NUSI). Unique Secondary Indexes are extremely efficient. A USI is considered a two-AMP operation. One AMP is utilized to access the USI subtable row (in the Secondary Index subtable) that references the actual data row, which resides on the second AMP.

A Non-Unique Secondary Index is an All-AMP operation and will usually require a spool file. Although a NUSI is an All-AMP operation, it is faster than a full table scan.

Secondary indexes can be useful for:
•       Satisfying complex conditions
•       Processing aggregates
•       Value comparisons
•       Matching character combinations
•       Joining tables

How are the data distributed in Secondary Index Subtables in Teradata?

When a user creates a Secondary Index, Teradata automatically creates a Secondary Index Subtable. The subtable will contain the:
•       Secondary Index Value
•       Secondary Index Row ID
•       Primary Index Row ID

When a user writes an SQL query that has an SI in the WHERE clause, the Parsing Engine will Hash the Secondary Index Value. The output is the Row Hash, which points to a bucket in the Hash Map.
That bucket contains an AMP number and the Parsing Engine then knows which AMP contains the Secondary Index Subtable pertaining to the requested USI information. 

The PE will direct the chosen AMP to look-up the Row Hash in the Subtable. The AMP will check to see if the Row Hash exists in the Subtable and double check the subtable row with the actual secondary index value. Then, the AMP will pass the Primary Index Row ID back up the BYNET network. This request is directed to the AMP with the base table row, which is then easily retrieved.

What are the types of JOINs available in Teradata?

Types of JOINs are  :  Inner Join, Outer Join (Left, Right, Full), Self Join, Cross Join and Cartesian Joins.

The key things to know about Teradata and Joins
•       Each AMP holds a portion of a table.
•       Teradata uses the Primary Index to distribute the rows among the AMPs.
•       Each AMP keeps their tables separated from other tables like someone might keep clothes in a dresser drawer.
•       Each AMP sorts their tables by Row ID.
•       For a JOIN to take place the two rows being joined must find a way to get to the same AMP.
•       If the rows to be joined are not on the same AMP, Teradata will either redistribute the data or duplicate the data in spool to make that happen.

What are the types of Join Strategies available in Teradata?

Join Strategies are used by the optimizer to choose the best plan to join tables based on the given join condition.
•       Merge (Exclusion)
•       Nested
•       Row Hash
•       Product (including Cartesian Product joins)

There are different types of merge join strategies available. But in general , while joining two tables the data will be redistributed or duplicated across all AMPs to make sure joining rows are in the same AMPs.

If the two tables are joined based on PI, no redistribution/duplication will happen as the rows will be in the same AMP and performance will be better.  If one table PI is used and Other table PI not used, redistribution/duplication of the table will happen based on the table size.In these cases Secondary Indexes will be helpful.

Explain types of re-distribution of data happening for joining of columns from two tables in Teradata?

Case 1 - P.I = P.I joins
Case 2 - P.I = non Index joins
Case 3 - non Index = non Index joins

Case1 - there is no redistribution of data over amp's. Since amp local joins happen as data are present in same AMP and need not be re-distributed. These types of joins on unique primary index are very fast.
Case2 - data from second table will be re-distributed on all amps since joins are happening on PI vs. non Index column. Ideal scenario is when small table is redistributed to be joined with large table records on same amp
case3 - data from both the tables are redistributed on all AMPs. This is one of the longest processing queries; Care should be taken to see that stats are collected on these columns

What is Partitioned Primary Index (PPI) in Teradata?

Partitioned primary index is physically splitting the table into a series of subtables, one for every partitioning value. When a single row is accessed, it looks first at the partitioning value to determine the subtable, then at the primary index to calculate the rowhash for the row(s).

For example, we have PPI on a MONTH Column, the rows of particular months are all sorted with in the same partition and whenever data is accessed for particular month, it will retrive the data in a faster way.
It helps to avoid full table scans.

What are the advantages and disadvantages of PPI in Teradata?

Advantages:
·      Range queries don’t have to utilize a Full Table Scan.
·      Deletions of entire partitions are lightning fast.
·      PPI provides an excellent solution instead of using Secondary Indexes
·      Tables that hold yearly information don’t have to be split into 12 smaller tables to avoid Full Table Scans (FTS). This can make modeling and querying easier.
·      Fastload and Multiload work with PPI tables, but not with all Secondary Indexes.

Disadvantages:
·      A two-byte Partition number is added to the ROW-ID and it is now called a ROW KEY. The two-bytes per row will add more Perm Space to a table.
·      Joins to Non-Partitioned Tables can take longer and become more complicated for Teradata to perform.
·      Basic select queries utilizing the Primary Index can take longer if the Partition number is not also mentioned in the WHERE clause of the query.
·      You can’t have a Unique Primary Index (UPI) if the Partition Number is not at least part of the Primary Index. You must therefore create a Unique Secondary Index to maintain uniqueness.

Volatile and Global Temporary Tables in Teradata?

Volatile tables are temporary tables that are materialized in spool and are unknown to the Data Dictionary.
A volatile table may be utilized multiple times and in more than one SQL statement throughout the life of a session. This feature allows for additional queries to utilize the same rows in the temporary table without requiring the rows to be rebuilt.

Volatile tables are local to session and the tables are dropped once the session is disconnected.
ON COMMIT PRESERVE ROWS option should be mentioned at the time of table creation. It means that at the end of a transaction, the rows in the volatile table will not be deleted. The information in the table remains for the entire session. Users can ask questions to the volatile table until they log off. Then the table and data go away.

Global Temporary Tables are similar to volatile tables in that they are local to a user’s session. However, when the table is created, the definition is stored in the Data Dictionary. In addition, these tables are materialized in a permanent area known as Temporary Space. Because of these reasons, global tables can survive a system restart and the table definition will not discarded at the end of the session. However, when a system restarts, the rows inside the Global Temporary Table will be removed. Lastly, Global tables require no spool space. They use Temp Space.

Statistics can be collected in both of the tables in TD13 Version. Previously Collecting Stats on Volatile tables are not allowed.

SubQuery and Correlated Subquery in teradata?

Sub queries and Correlated Sub queries are two important concepts in Teradata and used most of the times.
The basic concept behind a subquery is that it retrieves a list of values that are used for comparison against one or more columns in the main query. Here the subquery is executed first and based on the result set, the main query will be executed.

For example,
Select empname,deptname from employee where empid IN ( select empid from salarytable where salary>10000).
In the above query, empid will be choosen first based on the salary in the subquery and main query will be executed based on the result subset.

Correlated Subquery is an excellent technique to use when there is a need to determine which rows to SELECT based on one or more values from another table.It combines subquery processing and Join processing into a single request.

It first reads a row from the main query and then goes into the subquery to find the rows that match the specified column value.Then it goes for the next row from the main query. This process continues untill all the qualifying rows from MAIN query.

For example,
select empname,deptno, salary
from employeetable as emp
where
salary=(select max(salary) from employeetable as emt where emt.deptno=emp.deptno)
Above query returns the highest paid employee from each department. This is also one of the scenario based questions in teradata.

How to calculate the tablesize , database size  and free space left in a database in teradata?

DBC.TABLESIZE and DBC.DISKSPACE are the systems tables used to find the space occupied.

Below Query gives the table size of each tables in the database and it will be useful to find the big tables in case of any space recovery.
SELECT         DATABASENAME,
TABLENAME,
SUM(CURRENTPERM/(1024*1024*1024)) AS "TABLE SIZE"
FROM
DBC.TABLESIZE
WHERE
DATABASENAME = <'DATABASE_NAME'> AND TABLENAME = <'TABLE_NAME'>
GROUP BY 1,2;

Below query gives the total space and free space available in a database.

SELECT
        DATABASENAME DATABASE_NAME,
        SUM(MAXPERM)/(1024*1024*1024) TOTAL_PERM_SPACE,
        SUM(CURRENTPERM)/(1024*1024*1024) CURRENT_PERM_SPACE,
        TOTAL_PERM_SPACE-CURRENT_PERM_SPACE as FREE_SPACE
 FROM
        DBC.DISKSPACE
 WHERE
        DATABASENAME = <'DATABASE_NAME'>
group by 1;

What are the Performance improvement techniques available in Teradata?

First of all use EXPLAIN plan to see how the query is performing. Keywords like Product joins, low confidence are measures of poor performance.

Make Sure, STATS are collected on the columns used in WHERE Clause and JOIN columns. If STATS are collected , explain plan will show HIGH CONFIDENCE This tells the optimizer about the number of rows in that table which will help the optimizer to choose the redistribution/duplication of smaller tables.

Check the joining columns & WHERE Clause whether PI, SI or PPI are used.

Check whether proper alias names are used in the joining conditions.

Split the queries into smaller subsets in case of poor performance.

 What does” Pseudo Table” Locks mean in EXPLAIN Plan in Teradata?

It is a false lock which is applied on the table to prevent  two users from getting conflicting locks with all-AMP requests.
PE will determine an particular AMP to manage all AMP LOCK requests for given table and Put Pseudo lock on the table.

Can you compress a column which is already present in table using ALTER in Teradata?

No,   We cannot use ALTER command to compress the existing columns in the table.
A new table structure has to be created which includes the Compression values and data should be inserted into Compress column table.
Please note - ALTER can be used only to add new  columns with compression values to table.

How to create a table with an existing structure of another table with or without data and also with stats defined in Teradata?

CREATE TABLE new_TABLE AS old_TABLE WITH DATA
CREATE TABLE new_TABLE AS old_TABLE WITH NO DATA
CREATE TABLE new_TABLE AS old_TABLE WITH DATA AND STATS

How to find the duplicate rows in the table in Teradata?

Group by those fields and add a count greater than 1 condition for those columns
For example –
SELECT name, COUNT (*) FROM TABLE EMPLOYEE GROUP BY name HAVING COUNT (*)>1;
Also  DISTINCT will be useful. If both DISTINCT and COUNT(*) returns same number then there are no duplicates.

Which is more efficient GROUP BY or DISTINCT to find duplicates in Teradata?

With more duplicates GROUP BY is more efficient while if we have fewer duplicates the DISTINCT is efficient.

What is the difference between TIMESTAMP (0) and TIMESTAMP (6) in teradata?

Both has the Date and Time Values. The major difference is that TIMESTAMP (6) has microsecond too.

What is spool space and when running a job if it reached the maximum spool space how you solve the problem in Teradata?

Spool space is the space which is required by the query for processing or to hold the rows in the answer set. Spool space reaches maximum when the query is not properly optimized. We must use appropriate condition in WHERE clause and JOIN on correct columns to optimize the query. Also make sure unncessary volatile tables are dropped as it occupies spool space.       
         
Why does varchar occupy 2 extra bytes?

The two bytes are for the number of bytes for the binary length of the field.
It stores the exact no of characters stored in varchar

What is the difference between User and database in Teradata?

- User is a database with password but database cannot have password
- Both can contain Tables , views and macros
- Both users and databases may or may not hold privileges
- Only users can login, establish a session with Teradata database and they can submit requests

What are the types of HASH functions used in teradata?

These are the types of HASH, HASHROW, HASHAMP and HASHBAKAMP. Their SQL functions are-
HASHROW (column(s))
HASHBUCKET (hashrow)
HASHAMP (hashbucket)
HASHBAKAMP (hashbucket)

To find the data distribution of a table based on PI, below query will be helpful. This query will give the number of records in each AMP for that particular table.

SELECT HASHAMP(HASHBUCKET(HASHROW(PI_COLUMN))),COUNT(*) FROM TABLENBAME GROUP BY 1.

TERADATA UTILITIES:

Explain Fast Load in Teradata?

Loads large amount of data from external file into an empty table at high speed.Only load one empty table with 1 fast load job.Tables defined with Referential integrity, secondary indexes, join indexes, hash indexes or triggers cannot be loaded with FastLoad.Duplicate rows cannot be loaded into a multiset table with FastLoad

FastLoad has two phases – Data Acquisition and Application. It requires separate error table for each phase. Use the BEGIN LOADING statement to specify (and create) the two error tables for the operation. 

We can specify the error table in the same database as the data table or different database.
We must remove the error tables before we re-run the same load job or it will terminate in an error condition.

Explain Multi Load in Teradata?

Used for loading, updating or deleting data to and from populated tables, typically with batch inputs from a host file.Cannot process tables defined with USI’s, Referential Integrity, Join Indexes, Hash Indexes, or Triggers.No data retrieval capability.Import tasks require use of Primary Index

The Multiload supports five target tables per script. Tables may contain pre-existing data. Ability to do INSERTs UPDATEs, DELETEs and UPSERTs.

Use RELEASE MLOAD tablename; statement if MLOAD doesn’t successfully completed and there is no desire to restart the MLOAD. This will return the target table to general availability. Error tables should be dropped manually, as RELASE MLOAD will not drop them automatically.

Explain TPUMP (Teradata Parallel Data Pump) Utility in Teradata?

•TPUMP allows near real time updates from Transactional Systems into the Data Warehouse.It can perform Insert, Update and Delete operations or a combination from the same source.
•It can be used as an alternative to MLOAD for low volume batch maintenance of large databases.
•TPUMP allows target tables to have Secondary Indexes, Join Indexes, Hash Indexes, Referential Integrity, Populated or Empty Table, Multiset or Set Table or Triggers defined on the Tables.
•TPUMP can have many sessions as it doesn’t have session limit.
•TPUMP uses row hash locks thus allowing concurrent updates on the same table

How do you transfer large amount of data in Teradata?

Transferring of large amount of data can be done by using the various Teradata Utilities i.e. BTEQ, FASTLOAD, MULTILOAD, TPUMP and FASTEXPORT.
•BTEQ (Basic Teradata Query) supports all 4 DMLs: SELECT, INSERT, UPDATE and DELETE.BTEQ also support IMPORT/EXPORT protocols.
•Fastload, MultiLoad and Tpump transfer the data from Host to Teradata.
•FastExport is used to export data from Teradata to the Host.

How to make sure BTEQ utility is not erroring out while dropping a table when the table doesnt exist in Teradata?

Setting the error level to 0 will achieve this.
ERRORLEVEL (3807) SEVERITY 0;
DROP TABLE TABLENAME;
ERRORLEVEL (3807) SEVERITY 8;

Difference between MultiLoad and TPump in Teradata?

Tpump provides an alternative to MultiLoad for low volume batch maintenance of large databases under control of a Teradata system.
Tpump updates information in real time, acquiring every bit of a data from the client system with low processor utilization.
It does this through a continuous feed of data into the data warehouse, rather than the traditional batch updates.
Continuous updates results in more accurate, timely data. Tpump uses row hash locks than table level locks. This allows you to run queries while Tpump is running.

Different phases of MultiLoad in Teradata?

• Preliminary phase
• DML phase
• Acquisition phase
• Application phase
• End phase

Which is faster – MultiLoad delete or Delete command in Teradata?

MultiLoad delete is faster than normal Delete command, since the deletion happens in data blocks of 64Kbytes, whereas delete command deletes data row by row.
Transient journal maintains entries only for Delete command since Teradata utilities doesn’t support Transient journal loading

For smaller table deletes, simple DELETE command is enough. Multiload delete is useful when the delete has to be performed on a large table in teradata.

How to Skip or Get first and Last Record from Flat File through MultiLoad in Teradata?

In .IMPORT command in Mload we have a option to give record no. from which processing should begin. i.e. ‘FROM m’ ‘m’ is a logical record number, as an integer, of the record in the identified data source where processing is to begin. You can mention ’m’ as 2 and processing will start from second record.
THRU k and FOR n are two options in the same Mload command, functions same towards the end of the processing.

Adding to the above, if from n"start record" and for n "stop record" are not mentioned, mload considers records from start till the end of the file

Why Fload doesn’t support multiset table in Teradata?

Fload does not support Multiset table because of restart capability.

Say, the fastload job fails. Till the fastload failed, some number of rows was sent to the AMP's.
Now if you restart FLOAD,  it would start loading record from the last checkpoint and some of the consecutive rows are sent for the second time. These will be caught as duplicate rows are found after sorting of data.

This restart logic is the reason that Fastload will not load duplicate rows into a MULTISET table. It assumes they are duplicates because of this logic. Fastload support Multiset table but does not support the duplicate rows. Multiset tables are tables that allow duplicate rows. When Fastload finds the duplicate rows it discards it. Fast Load can load data into multiset table but will not load the duplicate rows

How to skip the header row in the fastload script

RECORD 2; /* this skips first record in the source file */
DEFINE ...