The
column specified in this CSUM should contain the sales dollar figure for each
day and
the sort key as the date. Then, the WHERE clause provides a beginning and ending date
for the query.
Cumulative Sum with Reset Capabilities
|
The
CSUM has the ability to reset values throughout the accumulation process to obtain
a subtotal.
It uses the GROUP BY designation to specify a data column that, when the value
changes, causes the accumulation value to be reset back to zero.
|
A cumulative
sum may be reset to zero at a specified breakpoint by merely adding a GROUP
BY clause to the query when using the CSUM function. The GROUP BY provides
for subtotals within the output.
Cumulative Sum with Reset Capabilities
|
As
mentioned with the CSUM, this version of the cumulative sum can be reset a to provide
the equivalent of a subtotal. This method uses the ANSI Standard SUM with OVER
and PARTITION BY designators to specify a data value that, when it changes,
causes
the accumulation value to be reset back to zero.
|
SUM Using SUM and OVER / PARTITION BY
|
Normally
aggregate functions and OLAP functions are incompatible. This is because
aggregates
provide only the final single row answer and eliminate row detail data.
Conversely,
the OLAP functions provide the row detail data and the answer. Using this
ANSI
syntax, the SUM aggregate can be made to act more as an OLAP function to
provide
both the answer and the row detail.
|
As
seen above, CSUM is a Teradata Extension. It may use the GROUP BY designation
to
reset the accumulation process for the equivalent of a subtotal. The ANSI method
does
not use GROUP BY. Instead, it uses the OVER to design that a partition or group
can
be established using the PARTITION BY designator.
|
One
of the major advantages to the PARTITION is that each column that is a SUM can
be
based on a different value. Whereas, there can only be a single GROUP BY in a
SELECT.
|
The
following ANSI syntax used with SUM to provide OLAP functionality:
|
SUM(<column-name>) OVER ( PARTITION BY <column-name>
ORDER BY <column-name>
[ASC | DESC]
[,<column-name>
[ASC | DESC] ] )
FROM <table-name>
;
|
Below,
the previously used SELECT is again modified with SUM and OVER /
PARTITION
to show the equivalent cumulative sales with a reset on a change in the date
as
seen with the CSUM:
|
Product_ID
,Sale_Date
,Daily_Sales
,SUM(Daily_Sales) OVER ( PARTITION BY Sale_Date
ORDER BY Sale_Date
ROWS UNBOUNDED PRECEDING)
(format ‘$$$$,$$$.99’)
AS Like_CSum
FROM Sales_table
WHERE Sale_Date between 1001001 and 1001003 ;
The
PARTITION has the same effect here as the GROUP BY does in the proprietary
Teradata
extension OLAP functions.
|
There
are two advantages to using this syntax. First, it is the ANSI standard. Second,
and
the biggest advantage, is that it is compatible with other OLAP functions because
the detail data is retained as well as the use of aggregate functions within the formulas for derived data.
|
Partitioning Data Using the QUANTILE Function
|
A Quantile
is used to divide rows into a number of partitions of roughly the same number
of
rows in each partition. The percentile is the QUANTILE most commonly used in
business.
This means that the request is based on a value of 100 for the number of
partitions.
It is also possible to have quartiles (based on 4), tertiles (based on 3) and
deciles
(based on 10).
|
By
default, both the QUANTILE column and the QUANTILE value itself will be output
in
ascending sequence. As in some cases, the ORDER BY clause may be used to reorder the
output for display. Here the order of the output does not change the meaning of
the output,
unlike a summation where the values are being added together and all need to appear
in the proper sequence.
|
The
syntax of the QUANTILE function is:
|
SELECT QUANTILE (<partitions>, <column-name>
,<sort-key> [DESC | ASC])
FROM <table-name>
[QUALIFY QUANTILE (<column-name>) {<
| > | = | <= | >=} <number-of-rows>]
;
|
The
next SELECT determines the percentile for every row in the Sales table based on
the daily
sales amount and sorts it into sequence by the value being partitioned, in this
case the
daily sales amount:
|
SELECT Product_ID
,Sale_Date
,Daily_Sales
,QUANTILE
(100, Daily_Sales )
FROM Sales_table
|
WHERE Product_ID < 3000 and
Sale_Date > 1000930 ;
QUALIFY to Find Products in the top Partitions
|
Like
the aggregate functions, OLAP functions must read all required rows before
performing
their operation. Therefore, the WHERE clause cannot be used. Where the
aggregates
use HAVING, the OLAP functions uses QUALIFY. The QUALIFY
evaluates
the result to determine which ones to return.
|
The
following SELECT uses a QUALIFY to show only the products that sell in the top
60
Percentile:
|
SELECT Product_ID
,Sale_Date ,Daily_Sales
,QUANTILE (100, Daily_Sales, Sale_Date ) as “Percentile”
FROM Sales_table
QUALIFY “Percentile” >= 60 ;
The
Ranking function (RANK) permits a column to be evaluated and compared, either
based
on high or low order, against all other rows to create the output set. The
order will
be
sorted by default in descending sequence of the ranking column, which
correlates to
descending
rank.
|
This
style of selecting the best and the worst has been available using SQL in the
past.
However,
it was very involved and required extensive coding and logic in order to make
it
work. The new RANK function replaces all the elaborate SQL previously
required to
create
this type of output.
|
The
output of the RANK function is the highest or the lowest data values in the
column,
depending
on the sort requested. A query can return a specified number of the “best”
rows
(highest values) or the “worst” rows (lowest values). These capabilities and
output
options
will be demonstrated below.
|
Here
is the syntax for RANK:
|
SELECT RANK( <column-name> [DESC | ASC]
)
FROM <table-name>
[GROUP BY <column-name>
[,<column-number> ] ] [ { ASC | DESC } ]
[QUALIFY RANK(<column-name>) {< |
<=} <number-of-rows> ]
;
|
QUALIFY to Find Top Best or Bottom Worse
|
The
above report could have been created without the columns in the RANK function
and
RANK value. It is a list in descending sequence by the sales amount. With a small
number
of rows, the best and the worst is readily available. However, when there are
hundreds
or millions of rows, returning all the rows takes far too much time.
|
Instead,
it is preferable to only return the rows desired. For instance, the best 20
or the
worst
20 might be needed. Like the QUANTILE function, the RANK function uses a
QUALIFY
clause to allow for control of how many rows to output in the final result.
|
The
following SELECT is the same as the above, but uses the QUALIFY to limit the
output
to the best 3 (highest values) rows:
|
Product_ID
,Sale_Date
,Daily_Sales
,RANK(Daily_Sales)
FROM Sales_table
QUALIFY RANK(Daily_Sales) < 4 ;
|
|
As
previously seen, the RANK function permits a column to be ranked, either
based on
high
or low order, against other rows. The GROUP BY can be used in conjunction
with
a
RANK function to change the ranking function’s scope. This allows a check on
which
items
were in the top sales bracket for each store.
|
The
following SELECT ranks the daily sales for each product using the GROUP BY
and
creates
an alias for the RANK column to use in the QUALIFY to find the best 2 days:
SELECT
|
Product_ID
,Sale_Date
,Daily_Sales
,RANK(Daily_Sales) AS Ranked
FROM Sales_table
GROUP BY Product_ID
QUALIFY Ranked <= 2
;
|
|
|
|
Simple yet 'the the best' post.
ReplyDeletevaluable post for Teradata folks
ReplyDelete:)
ReplyDelete:)
ReplyDelete