Teradata Data Placement
Because
Teradata was built for large data warehouses, its architects knew that data
placement and management of tables could be a full time job. That is why they
designed Teradata to automatically manage the data. Nobody had ever attempted
this incredible feat. The Teradata designers dreamed of things that never were
and made them so.
Managing
table space, disks, and other system administration functions in a data
warehouse is a nightmare. Teradata has made the DBA’s role a dream because
Teradata lets the system handle the difficult functions. Teradata not only
spreads the data evenly, but it can retrieve it quickly because it knows which
AMP holds a particular row.
Teradata
always attempts to spread data evenly so each AMP will manage approximately the
same amount of data. As a result, the rows of every table are distributed
across all of the AMPs. In other words, every AMP stores a portion of every
table in the database on its virtual disk (VDISK). If a data warehouse has 200
tables, then each AMP will hold a portion the 200 tables. This method of data
distribution is unique only to Teradata.
There
are some significant benefits to handling data this way: First, the biggest
bottleneck in any system is the disk. Because each AMP has their own virtual
disk and each table is spread among the AMPs, there is no disk bottleneck.
Second,
when each AMP has nearly the same quantity of table rows, then no one AMP
becomes a data bottleneck. AMPs can retrieve all or a portion of the data
in parallel so you do not have AMPs sitting idle while others are chugging
away. Baseball superstar Casey Stengel once said, “It’s easy to get good
players. Getting’ em to play together, that’s the hard part.” AMPs love to work
together in parallel.
Third,
each AMP is unaware of any data except its own portion. Each AMP can ONLY
read or write to a particular row of data that the AMP actually owns. This
makes retrieving data from a particular row very efficient as all AMPs focus on
their own work. Fourth, each AMP automatically groups all of its rows by the
tables from which they come. Have you ever been to a large aquarium and
seen one of the displays that look like a very tall, clear cylinder? As you
walk around the glass, the fish tend to swim in schools. Similarly, Teradata
does this with the rows on the AMPs to boost performance. When you ask for data
from any given table, an AMP will immediately go to that particular group of
rows, and then select what you need. It doesn’t need to look through the rows
of many tables before it finds what you need. This is how parallel processing
works. The AMPs retrieve data in parallel and then pass it over the BYNET to
the PE. The PE ensures the data is delivered to the user. Keep in mind the
BYNET is an internal Teradata network, across which the PEs and the AMPs
communicate.
The
example below shows the information we have just discussed. Notice that the
system has four AMPs, and three tables: “Employee,” “WebLog,” and “Order.” Notice
each AMP holds a portion of the rows for every table. AMP1, for example, holds
1/4th of the Employee table rows, 1/4th of the WebLog
table rows, and 1/4th of the Order table rows.
Plus,
the data is spread evenly across for all tables. If a query asks for all rows
in the employee table, then each AMP will retrieve their employee table rows in
parallel. Each AMP will then pass its data to the PE via the BYNET. Because the
data in the employee table is spread evenly among all AMPs, each should finish
reading at exactly the same time.
Also,
notice how each AMP separates each table. Just like schools of fish, the rows
of the Employee table are grouped together. In addition, the WebLog and Order
tables are grouped together. This is important key in a data warehouse
environment because most queries read millions of rows to satisfy a single
query. Performance is enhanced when table rows are grouped together and
Teradata is permitted to bring blocks of rows into memory.
No comments:
Post a Comment