Search This Blog

Monday, May 20, 2013

Teradata OLAP Functions


                                              OLAP Functions

 
The OLAP functions are built into the Teradata database to provide data mining capabilities and trend analysis. These functions provide processing not available using the standard aggregation. As mentioned earlier, an aggregate eliminates the detail data from a row. These OLAP functions provide the result of their operation and display the detail data values used in the function.

This technique produces output that is somewhat like the Teradata extensions of WITH and WITH BY. The similarity is in the fact that the detail row data is also displayed as part of the answer set. However, since the output is in row format and not in a report format like WITH, the OLAP functions may be performed on all tables or views and may  be used to populate tables in conjunction with INSERT/SELECT. The biggest difference is that these can be used in Queryman, unlike WITH.

Below list has some of the important OLAP functions used.

Cumulative Sum Using the CSUM Function

The process of creating a cumulative sum means that data values in sequential rows are added together. The same addition was seen earlier in this book when using the SUM aggregate, with some major differences that are contrasted here.

The Cumulative Sum (CSUM) function provides a running or cumulative total for a column’s numeric value. This allows users to see what is happening with column totals over an ongoing progression. The results will be sorted in ascending or descending order and the sort list can consist of a single or multiple columns, listed as sort keys.

The syntax for CSUM is:
SELECT CSUM( <column-name>, <sort-key> [ASC | DESC]
                             [, <sort-key> [ASC | DESC] … )
FROM <table-name>
[GROUP BY <column-name> [,<column-number> … ] ]
;

The CSUM command uses the first parameter as the column containing a numeric value to sum. This value will be added to the previous data values and provide a running or cumulative answer.

The second parameter is entered as a sort key that sequences the output rows. This column determines the major sort sequence of the detail data from the row along with the CSUM. By default, the sort sequence is ascending (ASC). The DESC can be specified to request a descending (highest to lowest) sequence. Optionally, additional sort keys can be entered to specify one or more minor sort sequences.

To understand the use of CSUM, we will use a table that stores sales data for all products for each store on a daily basis. The CSUM function can be used to show the daily sales data for any or all of the products and accumulate the sales data for both running and final totals.

The column specified in this CSUM should contain the sales dollar figure for each day and the sort key as the date. Then, the WHERE clause provides a beginning and ending date for the query.

Cumulative Sum with Reset Capabilities

The CSUM has the ability to reset values throughout the accumulation process to obtain a subtotal. It uses the GROUP BY designation to specify a data column that, when the value changes, causes the accumulation value to be reset back to zero.

Using CSUM and GROUP BY

A cumulative sum may be reset to zero at a specified breakpoint by merely adding a GROUP BY clause to the query when using the CSUM function. The GROUP BY provides for subtotals within the output.


Cumulative Sum with Reset Capabilities

As mentioned with the CSUM, this version of the cumulative sum can be reset a to provide the equivalent of a subtotal. This method uses the ANSI Standard SUM with OVER and PARTITION BY designators to specify a data value that, when it changes,
causes the accumulation value to be reset back to zero.

SUM Using SUM and OVER / PARTITION BY

Normally aggregate functions and OLAP functions are incompatible. This is because
aggregates provide only the final single row answer and eliminate row detail data.
Conversely, the OLAP functions provide the row detail data and the answer. Using this
ANSI syntax, the SUM aggregate can be made to act more as an OLAP function to
provide both the answer and the row detail.

As seen above, CSUM is a Teradata Extension. It may use the GROUP BY designation
to reset the accumulation process for the equivalent of a subtotal. The ANSI method
does not use GROUP BY. Instead, it uses the OVER to design that a partition or group
can be established using the PARTITION BY designator.

One of the major advantages to the PARTITION is that each column that is a SUM can
be based on a different value. Whereas, there can only be a single GROUP BY in a
SELECT.

The following ANSI syntax used with SUM to provide OLAP functionality:
SELECT
            SUM(<column-name>) OVER ( PARTITION BY <column-name>
                  ORDER BY <column-name> [ASC | DESC]
                           [,<column-name> [ASC | DESC] ] )
FROM <table-name>
;

Below, the previously used SELECT is again modified with SUM and OVER /
PARTITION to show the equivalent cumulative sales with a reset on a change in the date
as seen with the CSUM:
SELECT
        Product_ID
        ,Sale_Date
        ,Daily_Sales
        ,SUM(Daily_Sales) OVER ( PARTITION BY Sale_Date
                              ORDER BY Sale_Date
                              ROWS UNBOUNDED PRECEDING)
                                     (format ‘$$$$,$$$.99’) AS Like_CSum
FROM Sales_table
WHERE Sale_Date between 1001001 and 1001003 ;

The PARTITION has the same effect here as the GROUP BY does in the proprietary
Teradata extension OLAP functions.
There are two advantages to using this syntax. First, it is the ANSI standard. Second,
and the biggest advantage, is that it is compatible with other OLAP functions because the  detail data is retained as well as the use of aggregate functions within the formulas for derived data.


Partitioning Data Using the QUANTILE Function


A Quantile is used to divide rows into a number of partitions of roughly the same number
of rows in each partition. The percentile is the QUANTILE most commonly used in
business. This means that the request is based on a value of 100 for the number of
partitions. It is also possible to have quartiles (based on 4), tertiles (based on 3) and
deciles (based on 10).

By default, both the QUANTILE column and the QUANTILE value itself will be output
in ascending sequence. As in some cases, the ORDER BY clause may be used to reorder the output for display. Here the order of the output does not change the meaning of the output, unlike a summation where the values are being added together and all need to appear in the proper sequence.

The syntax of the QUANTILE function is:

SELECT QUANTILE (<partitions>, <column-name> ,<sort-key> [DESC | ASC])
FROM <table-name>
[QUALIFY QUANTILE (<column-name>) {< | > | = | <= | >=} <number-of-rows>]
;

The next SELECT determines the percentile for every row in the Sales table based on the daily sales amount and sorts it into sequence by the value being partitioned, in this case the daily sales amount:

SELECT Product_ID
       ,Sale_Date
       ,Daily_Sales
       ,QUANTILE (100, Daily_Sales )
FROM Sales_table
WHERE Product_ID < 3000 and Sale_Date > 1000930 ;

QUALIFY to Find Products in the top Partitions

Like the aggregate functions, OLAP functions must read all required rows before
performing their operation. Therefore, the WHERE clause cannot be used. Where the
aggregates use HAVING, the OLAP functions uses QUALIFY. The QUALIFY
evaluates the result to determine which ones to return.

The following SELECT uses a QUALIFY to show only the products that sell in the top
60 Percentile:

SELECT Product_ID
        ,Sale_Date ,Daily_Sales
        ,QUANTILE (100, Daily_Sales, Sale_Date ) as “Percentile”
FROM Sales_table
QUALIFY “Percentile” >= 60 ;

Ranking Data using RANK

The Ranking function (RANK) permits a column to be evaluated and compared, either
based on high or low order, against all other rows to create the output set. The order will
be sorted by default in descending sequence of the ranking column, which correlates to
descending rank.

This style of selecting the best and the worst has been available using SQL in the past.
However, it was very involved and required extensive coding and logic in order to make
it work. The new RANK function replaces all the elaborate SQL previously required to
create this type of output.

The output of the RANK function is the highest or the lowest data values in the column,
depending on the sort requested. A query can return a specified number of the “best”
rows (highest values) or the “worst” rows (lowest values). These capabilities and output
options will be demonstrated below.

Here is the syntax for RANK:

SELECT RANK( <column-name> [DESC | ASC] )
FROM <table-name>
[GROUP BY <column-name> [,<column-number> ] ] [ { ASC | DESC } ]
[QUALIFY RANK(<column-name>) {< | <=} <number-of-rows> ]
;


QUALIFY to Find Top Best or Bottom Worse

The above report could have been created without the columns in the RANK function
and RANK value. It is a list in descending sequence by the sales amount. With a small
number of rows, the best and the worst is readily available. However, when there are
hundreds or millions of rows, returning all the rows takes far too much time.
Instead, it is preferable to only return the rows desired. For instance, the best 20 or the
worst 20 might be needed. Like the QUANTILE function, the RANK function uses a
QUALIFY clause to allow for control of how many rows to output in the final result.

The following SELECT is the same as the above, but uses the QUALIFY to limit the
output to the best 3 (highest values) rows:

SELECT
           Product_ID
           ,Sale_Date
           ,Daily_Sales
           ,RANK(Daily_Sales)
FROM Sales_table
QUALIFY RANK(Daily_Sales) < 4 ;


Using RANK with GROUP BY

As previously seen, the RANK function permits a column to be ranked, either based on
high or low order, against other rows. The GROUP BY can be used in conjunction with
a RANK function to change the ranking function’s scope. This allows a check on which
items were in the top sales bracket for each store.

The following SELECT ranks the daily sales for each product using the GROUP BY and
creates an alias for the RANK column to use in the QUALIFY to find the best 2 days:

SELECT
           Product_ID
           ,Sale_Date
           ,Daily_Sales
           ,RANK(Daily_Sales) AS Ranked
FROM Sales_table
GROUP BY Product_ID
QUALIFY Ranked <= 2
;




4 comments: