Transformation Functions: Doing Something With the Data

These functions can be imported from the ppc_robot_lib.reporting.transformations module.

Computations

compute(table, columns)[source]

Computes one or more columns and add them to the table.

Columns are given as a dictionary, where key is a name of a new column and value is an expression. Expression can be either callable, or string with an expression for pandas.eval(). Syntax is briefly described at Expression evaluation via eval(). Please note that calllable is strongly preferred.

The callable receives a pandas.DataFrame as the first positional argument. If the callable also contains the task_ctx argument, it will receive an ppc_robot_lib.tasks.task_context.TaskContextInterface as a keyword argument.

The callable should return a value that is directly assignable to the column: pandas.Series or a single scalar value.

Example:

>>> from ppc_robot_lib.reporting.transformation import compute
>>> compute(data_frame, {
...     'Ctr1': lambda df: df['Clicks'] / df['Impressions'],
...     'Ctr2': 'Clicks / Impressions',
... })

Parameters:

table (DataFrame) – Table to use.
columns (dict[str, str | Callable[[DataFrame], Series | str | int | float | bool]]) – Dictionary, where key is a name of a new column and value is an expression.

Filtering and Splitting

filter_table(table, filter_cb)[source]

Filters rows in the given table.

The filter can be either a callable, or list of callables. The callable receives a pandas.DataFrame as a single argument and must return a boolean vector that is used for selection.

It you use the list of callables, their result vectors are “ANDed” together using the & operator. This means that the resulting rows must match all the conditions.

Example:

>>> from ppc_robot_lib.reporting.transformation import filter_table
>>> filter_table(urls_df, [
...     lambda df: df['http_status_code'] >= 200,
...     lambda df: df['http_status_code'] <= 299,
... ])

Parameters:

table (DataFrame) – Name of the input table.
filter_cb (list[Callable[[DataFrame], Series | bool]] | Callable[[DataFrame], Series | bool]) – Filter callback. Can be a lambda function or list of lambda functions. The row must match all conditions.

Returns:

Filtered DataFrame.

filter_column_is_in(input_table, input_column, ref_values, negative_match=False)[source]

Filters rows in the table, selects only rows with values in the given column that matches values in another column from the reference table. Imagine it like filtering using INNER JOIN without joining.

You can also pass a pandas.Series or any iterable as ref_table. In this case, you have to

Example:

Let’s assume we have the two tables, in_table and out_table:

in_table
Campaign	Tag
Campaign A	product
Campaign B	remarketing
Campaign C	dsa
Campaign D	product
Campaign E	banner

ref_table
Id	Ref Tag
1	product
2	dsa
3	responsive

By executing the code:

>>> from ppc_robot_lib.reporting.transformation import filter_column_is_in
>>> filter_column_is_in(in_table, "Tag", ref_values=ref_table["Ref Tag"])

We would get the following output:

out_table
Campaign	Tag
Campaign A	product
Campaign C	dsa
Campaign D	product

If we set negative_match=True:

>>> from ppc_robot_lib.reporting.transformation import filter_column_is_in
>>> filter_column_is_in(in_table, "Tag", ref_values=ref_table["Ref Tag"], negative_match=True)

The resulting table would look like this:

out_table
Campaign	Tag
Campaign B	remarketing
Campaign E	banner

Parameters:

input_table (DataFrame) – Table that should be filtered.
input_column (str) – Column in the input table which should be used for lookups in the reference table.
ref_values (Series | Iterable[Any]) – Reference values: can be an column from another table or any iterable.
negative_match (bool) – Set to True if you need rows that DO NOT match.

Return type:

DataFrame

split_by_column_value(table, column, output_tables, default_table=None)[source]

Enables you to split one table to multiple tables according to value of a specified column.

output_tables is a dictionary, key is the name of resulting table and value can be either:

Value of the column to match.
Callable recieving a single value and returing bool. If the callable returns True, the row will be inserted into the given table.
List of values or callables with the same semantics as in the previous step. The row will be inserted if at least one of the values or callables match.

Please note that row order might not be preserved, as this operation internally uses group by and then operates on disjunct segments of table.

Resulting tables are returned in a dictionary.

Example:

Let’s assume we have the following table:

in_table
A	X
A1	1
A3	3
A2	2
A4	4
A7	7
A6	6
A5	5
A9	9
A0	10
A8	8
AA	11

By executing the function:

>>> from ppc_robot_lib.reporting.transformation import split_by_column_value
>>> split_by_column_value(in_table, 'X', {
...     'o1': [1, 2, 3, lambda x: 8 <= x <= 10],
...     'o2': lambda x: 4 <= x <= 7
... }, default_table='default')
{
    'o1': pandas.DataFrame(...)
    'o2': pandas.DataFrame(...)
    'default': pandas.DataFrame(...)
}

We would get the following tables:

o1
A	X
A1	1
A3	3
A2	2
A9	9
A0	10
A8	8

o2
A	X
A4	4
A7	7
A6	6
A5	5

default
A	X
AA	11

Parameters:

table (DataFrame) – Input table.
column (str) – Column to use for split.
output_tables (dict[str, Any | list[Any] | Callable[[Any], bool]]) – Dictionary of output tables. Key is name of the result table, value is either value to match, callable or list of values/callables to filter by.
default_table (str) – Table to be used when the value does not match any of the tables specified in the output_tables argument. If none is given, these rows are dropped.

Return type:

dict[str, DataFrame]

Returns:

Dictionary of the resulting tables.

Joining DataFrames

join(left_table, right_table, on=None, left_on=None, right_on=None, join_type=JoinType.INNER, do_sort=True, ambiguous_column_suffixes=('', '_right'), left_index=False, right_index=False)[source]

Performs an SQL-like join on two tables.

This step supports 4 types of joins:

You can join either by columns with same name from both tables, or you can provide list of columns in both left and right table. The lists has to be of equal length.

When performing the join, values of columns on corresponding indexes must be equal in order to match the rows.

For details on joins in Pandas, see merge().

Example:

>>> from ppc_robot_lib.steps.transformations import JoinOnColumnStep, JoinType
>>> JoinOnColumnStep("keywords", "adgroups",
...                  left_on=["CampaignName", "AdGroupName"],
...                  right_on=['CampaignName', "Name"],
...                  join_type=JoinType.INNER,
...                  output_table="keywords_with_adgroup_data")

Parameters:

left_table (DataFrame) – Left table.
right_table (DataFrame) – Right table.
on (str | list[str]) – Column names that must match in both tables. Exclusive with left_on and right_on.
left_on (str | list[str]) – List of columns to match in the left table. Must be the same length with right_on. Exclusive with on.
right_on (str | list[str]) – List of columns to match in the left table. Must be the same length with right_on. Exclusive with on.
join_type (JoinType) – Type of the join. See merge() for explanation of individual types.
do_sort (bool) – Set to True if you would like to sort the tables by join columns.
ambiguous_column_suffixes – Pair of column suffixes that is used when tables with conflicting column names are used. The first element is used for the left table, the second one for the right table.
left_index – Use index from the left table when matching the values.
right_index – Use index from the right table when matching the values.

Return type:

DataFrame

Returns:

Resulting table.

concat(*tables)[source]

Concatenates rows of multiple tables into one. Internally uses pandas.concat() over the rows axes, this means that the same semantics applies. All indexes are ignored.

Example:

>>> from ppc_robot_lib.reporting.transformation import concat
>>> concat(table_a, table_b, table_c)

Parameters:: tables (list[DataFrame]) – Tables to concat.
Return type:: DataFrame
Returns:: Concatenated DataFrames.

Sorting

sort_values(table, order, inplace=True)[source]

Sorts the table by one or more columns.

Example:

>>> from ppc_robot_lib.steps.transformations import SortStep
>>> from ppc_robot_lib.utils.types import Order, OrderBy
>>> SortStep("in_table", order=[
...     OrderBy("Impressions", Order.DESC),
...     OrderBy("CampaignName", Order.ASC),
... ])

Parameters:

table (DataFrame) – Input table.
order (list[OrderBy] | tuple[OrderBy, ...]) – List of ppc_robot_lib.utils.types.OrderBy objects that specify the sorting order.
inplace (bool) – Perform inplace sorting?

Return type:

DataFrame

Returns:

Sorted table. If inplace=True, the same instance is returned.

Renaming Columns

suffix_columns(table, suffix, except_cols=None, inplace=True)[source]

Suffixes column names with the given suffix. If a column name is listed in except_cols, its name is kept as-is.

If the inplace parameter is set to True (the default), it is

Example:

>>> from ppc_robot_lib.reporting.transformation import suffix_columns
>>> suffix_columns(performance, '_yday', except_cols=['Id', 'CampaignName'])

Parameters:

table (DataFrame) – Input table.
suffix (str) – Suffix to add to column names.
except_cols (list[str]) – List of collumn names that will not be suffixed.

Return type:

DataFrame

Returns:

Updated table. If inplace=True, the same instance is returned.

Grouping and Aggregations

Grouping and aggregations are very specific operations, so there are no universal steps for these operations. Use grouping operations from Pandas: Group By: split-apply-combine

Utilities

add_client_id(table, column)[source]

Add a new column to the table that contains with Client ID from task context.

Example: :rtype: None

>>> from ppc_robot_lib.steps.transformations import AddClientIdStep
>>> AddClientIdStep("table", "client_id")