Transformations

All steps can be imported directly from the ppc_robot_lib.steps.transformations module.

Filtering, splitting and merging

ConcatStep

class ConcatStep(input_tables, output_table)[source]

Concatenates rows of multiple tables into one. Internally uses pandas.concat() over the rows axes, this means that the same semantics applies.

Example:

>>> from ppc_robot_lib.steps.transformations import ConcatStep
>>> ConcatStep(["urls_ads", "urls_keywords", "urls_sitelinks"], output_table="urls")

Parameters:

input_tables (list[str]) – List of input tables.
output_table (str) – Output table. The output table must not already exist.

FilterStep

class FilterStep(input_table, filter_cb, output_table=None)[source]

Filters rows in the given table.

The filter can be either a callable, or list of callables. The callable receives a pandas.DataFrame as a single argument and must return a boolean vector that is used for selection.

It you use the list of callables, their result vectors are “ANDed” together using the & operator. This means that the resulting rows must match all the conditions.

Example:

>>> from ppc_robot_lib.steps.transformations import FilterStep
>>> FilterStep("urls", [
...     lambda df: df['http_status_code'] >= 200,
...     lambda df: df['http_status_code'] <= 299,
... ])

Parameters:

input_table (str) – Name of the input table.
filter_cb (list[Callable[[DataFrame], Series | bool]] | Callable[[DataFrame], Series | bool]) – Filter callback. Can be a lambda function or list of lambda functions. The row must match all conditions.
output_table (str) – Name of the output table. If not given, the result is written to the input table.

FilterColumnMatchStep

class FilterColumnMatchStep(input_table, input_column, ref_table, ref_column, output_table=None, negative_match=False)[source]

Filters rows in the table, selects only rows with values in the given column that matches values in another column from the reference table. Imagine it like filtering using INNER JOIN without joining.

Example:

Let’s assume we have the two tables, in_table and out_table:

in_table
Campaign	Tag
Campaign A	product
Campaign B	remarketing
Campaign C	dsa
Campaign D	product
Campaign E	banner

ref_table
Id	Ref Tag
1	product
2	dsa
3	responsive

By executing the code:

>>> from ppc_robot_lib.steps.transformations import FilterColumnMatchStep
>>> FilterColumnMatchStep("in_table", "Tag", ref_table="ref_table", ref_column="Ref Tag",
...                       output_table="out_table")

We would get the following output:

out_table
Campaign	Tag
Campaign A	product
Campaign C	dsa
Campaign D	product

If we set negative_match=True:

>>> from ppc_robot_lib.steps.transformations import FilterColumnMatchStep
>>> FilterColumnMatchStep("in_table", "Tag", ref_table="ref_table", ref_column="Ref Tag",
...                       output_table="out_table", negative_match=True)

The resulting table would look like this:

out_table
Campaign	Tag
Campaign B	remarketing
Campaign E	banner

Parameters:

input_table (str) – Table that should be filtered.
input_column (str) – Column in the input table which should be used for lookups in the reference table.
ref_table (str) – Reference table.
ref_column (str) – Column in the reference table that is used to build the lookup set.
output_table (str) – Output table. If not given, input table is used.
negative_match (bool) – Set to True if you need rows that DO NOT match.

JoinOnColumnStep

class JoinOnColumnStep(left_table, right_table, on=None, left_on=None, right_on=None, join_type=JoinType.INNER, output_table=None, do_sort=True, ambiguous_column_suffixes=('_left', '_right'), left_index=False, right_index=False)[source]

Performs an SQL-like join on two tables.

This step supports 4 types of joins:

ppc_robot_lib.utils.types.JoinType.INNER
ppc_robot_lib.utils.types.JoinType.OUTER
ppc_robot_lib.utils.types.JoinType.LEFT
ppc_robot_lib.utils.types.JoinType.RIGHT

You can join either by columns with same name from both tables, or you can provide list of columns in both left and right table. The lists has to be of equal length.

When performing the join, values of columns on corresponding indexes must be equal in order to match the rows.

For details on joins in Pandas, see merge().

Example:

>>> from ppc_robot_lib.steps.transformations import JoinOnColumnStep, JoinType
>>> JoinOnColumnStep("keywords", "adgroups",
...                  left_on=["CampaignName", "AdGroupName"],
...                  right_on=['CampaignName', "Name"],
...                  join_type=JoinType.INNER,
...                  output_table="keywords_with_adgroup_data")

Parameters:

left_table (str) – Left table.
right_table (str) – Right table.
on (str | list[str]) – Column names that must match in both tables. Exclusive with left_on and right_on.
left_on (str | list[str]) – List of columns to match in the left table. Must be the same length with right_on. Exclusive with on.
right_on (str | list[str]) – List of columns to match in the left table. Must be the same length with right_on. Exclusive with on.
join_type (JoinType) – Type of the join. See merge() for explanation of individual types.
output_table (str) – Output table. If none is given, left table is used.
do_sort (bool) – Set to True if you would like to sort the tables by join columns.
ambiguous_column_suffixes – Pair of column suffixes that is used when tables with conflicting column names are used. The first element is used for the left table, the second one for the right table.

class JoinType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

INNER = 'inner': Inner join.

LEFT = 'left': Left join.

OUTER = 'outer': Outer join.

RIGHT = 'right': Right join.

SplitByColumnStep

class SplitByColumnStep(table, column, output_tables, default_table=None)[source]

Enables you to split one table to multiple tables according to value of a specified column.

output_tables is a dictionary, key is the name of resulting table and value can be either:

Value of the column to match.
Callable recieving a single value and returing bool. If the callable returns True, the row will be inserted into the given table.
List of values or callables with the same semantics as in the previous step. The row iwill be inserted if at least one of the values or callables match.

Please note that row order might not be preserved, as this operation internally uses group by and then operates on disjunct segments of table.

Example:

Let’s assume we have the following table:

in_table
A	X
A1	1
A3	3
A2	2
A4	4
A7	7
A6	6
A5	5
A9	9
A0	10
A8	8
AA	11

By executing step:

>>> from ppc_robot_lib.steps.transformations import SplitByColumnStep
>>> SplitByColumnStep('in_table', 'X', {
...     'o1': [1, 2, 3, lambda x: 8 <= x <= 10], 'o2': lambda x: 4 <= x <= 7
... }, default_table='default')

We would get the following tables:

o1
A	X
A1	1
A3	3
A2	2
A9	9
A0	10
A8	8

o2
A	X
A4	4
A7	7
A6	6
A5	5

default
A	X
AA	11

Parameters:

table (str) – Input table.
column (str) – Column to use for split.
output_tables (dict[str, Any | list[Any] | Callable[[Any], bool]]) – Dictionary of output tables. Key is name of the result table, value is either value to match, callable or list of values/callables to filter by.
default_table (str) – Table to be used when the value does not match any of the tables specified in the output_tables argument. If none is given, these rows are dropped.

Grouping and Aggregations

AggregateByColumnStep

class AggregateByColumnStep(input_table, group_by, aggregation, output_table, column=None, sort=False)[source]

Group by table by specific columns and perform an aggregation on the specified column. The aggregation argument must be a name of the aggregation – `sum, count, avg etc. See Pandas GroupBy Documentation for a complete list.

Unlike ppc_robot_lib.steps.transformations.group_by_and_aggregate.GroupByAndAggregateStep, this step returns a pandas.core.groupby.DataFrameGroupBy, not a pandas.DataFrame. Be careful when using the result.

See Splitting an object into groups for more information about working with pandas.core.groupby.DataFrameGroupBy objects.

Example:

Let’s assume the following table named input:

Campaign	Ad Group	Impressions
Camp 1	A	10
Camp 1	B	30
Camp 2	A	15
Camp 2	A	50
Camp 2	A	0

By executing:

>>> from ppc_robot_lib.steps.transformations import AggregateByColumnStep
>>> AggregateByColumnStep("input", group_by="Campaign", aggregation="sum", column="Impressions",
...                       output_table="output")

We would get the following pandas.core.groupby.DataFrameGroupBy object:

Campaign	Impressions
Camp 1	40
Camp 2	65

If you omit the column, the aggregation will be applied on all columns that are not present in the group_by argument.

Parameters:

input_table (str) – Table to used for grouping and aggregation.
group_by (str | list[str] | tuple[str, ...]) – Column name or list of columns for aggregation.
aggregation (str) –
Aggregation to apply, must be a name of a method from pandas.core.groupby.DataFrameGroupBy, e.g. sum, count or avg. Complete list can be found in the Pandas GroupBy Documentation.
output_table (str) – Outout table.
column (str) – Column to use for aggregation.
sort – Set to True if you would like to sort the result by the columns used for grouping.

GroupByAndAggregateStep

class GroupByAndAggregateStep(input_table, group_by, aggregations, output_table=None, sort=False)[source]

Groups the given table by one or more columns and performs aggregations on the columns.

The columns argument is a name of the column or a list of columns that should be used for the group-by operation.

The aggregations is a dictionary, key is a name of the column, value is either a name of the aggregation (method of pandas.core.groupby.DataFrameGroupBy, see Pandas GroupBy Documentation for list), or a callable.

If you use callable, the callable must return another callable. The outer callable will receive a pandas.DataFrame, the inner will receive only a column slice to aggregate. You can call methods like sum, avg on the slice, or use it in functions provided by numpy.

This is useful when you need, for instance, a weighted average.

Please note that columns that are not present in neither columns or aggregations will be not present in the output table.

Unlike ppc_robot_lib.steps.transformations.aggregate_by_column.AggregateByColumnStep and ppc_robot_lib.steps.transformations.group_by_column.GroupByColumnStep, this step returns a pandas.DataFrame, not a pandas.core.groupby.DataFrameGroupBy.

Example:

Assume we have the following table:

in_table
Campaign	AdGroup	Tag	Impressions	Quality
Campaign A	Group 1	product	10	1
Campaign A	Group 2	product	20	10
Campaign B	Group 1	remarketing	14	7
Campaign B	Group 2	remarketing	36	5
Campaign B	Group 3	remarketing	52	10
Campaign C	Group 1	dsa	14	5
Campaign D	Group 1	product	0	10
Campaign D	Group 2	product	10	3
Campaign E	Group 1	banner	35	8

We can use the following code to get weighted quality score:

>>> import numpy
>>> from ppc_robot_lib.steps.transformations import GroupByAndAggregateStep
>>> def agg_weighted_avg(df: pandas.DataFrame):
...     return lambda quality: numpy.average(quality, weights=df.loc[quality.index, 'Impressions'])
>>> GroupByAndAggregateStep("in_table", "Campaign", {
...     "Tag": "first",
...     "Impressions": "sum",
...     "Quality": agg_weighted_avg,
... })

Note

The aggregation function receives only a slice of the Quality column in the argument quality. However, we can use quality.index to access the corresponding columns in dataframe df. Expression

df.loc[quality.index, 'Impressions']

will get only the related values in the Impressions columns from the original, non-grouped dataframe.

The resulting table:

in_table
Campaign	Tag	Impressions	Quality
Campaign A	product	30	7
Campaign B	remarketing	102	7.8235…
Campaign C	dsa	14	5
Campaign D	product	10	3
Campaign E	banner	35	8

Parameters:

input_table (str) – Input table.
group_by (str | list[str] | tuple[str, ...]) – Column name or list of columns to group by.
aggregations (dict[str, str | Callable[[DataFrame], Callable[[Series], Any]]]) – Dictionary of aggregations. Key is name of the column and value is either name of the aggregation, or a callable. See documentation above for explanation.
output_table (str) – Name of the output table.
sort – Set to True if you would like to sort the result by the columns used for grouping.

GroupByColumnStep

class GroupByColumnStep(input_table, columns, output_table, sort=False)[source]

Groups a table by specified column or columns.

Unlike ppc_robot_lib.steps.transformations.group_by_and_aggregate.GroupByAndAggregateStep, this step returns a pandas.core.groupby.DataFrameGroupBy, not a pandas.DataFrame. Be careful when using the result.

See Splitting an object into groups for more information about working with pandas.core.groupby.DataFrameGroupBy objects.

Example:

>>> from ppc_robot_lib.steps.transformations import GroupByColumnStep
>>> GroupByColumnStep("in_table", ["Campaign", "AdGroup"], output_table="out_table")

Parameters:

input_table (str) – Input table.
columns (str | list[str] | tuple[str, ...]) – Columns to group by.
output_table (str) – Output table.
sort – Set to True if you would like to sort the result by the columns used for grouping.

Computation

ComputeStep

class ComputeStep(table, columns)[source]

Computes one or more columns and add them to the table.

Columns are given as a dictionary, where key is a name of a new column and value is an expression. Expression can be either callable, or string with an expression for pandas.eval(). Syntax is briefly described at Expression evaluation via eval(). Please note that calllable is strongly preferred.

The callable receives a pandas.DataFrame as the first positional argument. If the callable also contains the task_ctx argument, it will receive an ppc_robot_lib.tasks.task_context.TaskContextInterface as a keyword argument.

The callable should return a value that is directly assignable to the column: pandas.Series or a single scalar value.

Example:

>>> from ppc_robot_lib.steps.transformations import ComputeStep
>>> ComputeStep("table", {
...     'Ctr1': lambda df: df['Clicks'] / df['Impressions'],
...     'Ctr2': 'Clicks / Impressions',
... })

Parameters:

table (str) – Table to use.
columns (dict[str, Callable[[DataFrame], Series]]) – Dictionary, where key is a name of a new column and value is an expression.

Sorting

SortStep

class SortStep(input_table, order, output_table=None)[source]

Sorts the table by one or more columns.

Example:

>>> from ppc_robot_lib.steps.transformations import SortStep
>>> from ppc_robot_lib.utils.types import Order, OrderBy
>>> SortStep("in_table", order=[
...     OrderBy("Impressions", Order.DESC),
...     OrderBy("CampaignName", Order.ASC),
... ])

Parameters:

input_table (str) – Input table.
order (list[OrderBy] | tuple[OrderBy, ...]) – List of ppc_robot_lib.utils.types.OrderBy objects that specify the sorting order.
output_table (str) – Output table. If none is given, input table is used.

Utility Operations

AddClientIdStep

class AddClientIdStep(table, column)[source]

Add a new column to the table that contains with Client ID from task context.

Example:

>>> from ppc_robot_lib.steps.transformations import AddClientIdStep
>>> AddClientIdStep("table", "client_id")

Parameters:

table (str) – Table to use.
column (str) – Column to add.

ProjectionStep

class ProjectionStep(input_table, columns, output_table=None)[source]

Projection creates a new table containing only the given columns, in the given order.

Example:

Let’s assume we have a table with columns A, B, C, D, in that order. By executing:

>>> from ppc_robot_lib.steps.transformations import ProjectionStep
>>> ProjectionStep("in_table", columns=['C', 'B', 'D'], output_table="out_table")

we would get a table with columns C, B and D. Column A would be dropped.

Parameters:

input_table (str) – Input table.
columns (list[str]) – List of columns to transfer to the new table.
output_table (str) – Output table. If no table is given, input table is used.

RenameStep

class RenameStep(input_table, rename=None, drop=None, output_table=None, drop_not_renamed=False)[source]

Allows you to rename or drop columns from a table. You have to specify rename, drop or both. Optional argument drop_not_renamed allows you to drop columns that were not renamed by this operation.

Example:

>>> from ppc_robot_lib.steps.transformations import RenameStep
>>> RenameStep("in_table", rename={'original': 'new', 'col1': "renamed_col1"},
...            drop=['unnecessary_column', "internal_value_users_shall_not_see"])

Parameters:

input_table (str) – Input table.
rename (dict[str, str]) – Dictionary of columns, key is the original name and value is the new name.
drop (list[str]) – List of column names to drop. At least one of drop and rename must be given.
output_table (str) – Output table. If not given, operations are performed in-place on the input table.
drop_not_renamed (bool) – Drop columns that are not present in the rename argument.

SuffixColumnsStep

class SuffixColumnsStep(input_table, suffix, except_cols=None, output_table=None)[source]

Suffixes column names with the given suffix. If a column name is listed in except_cols, its name is kept as-is.

Example:

>>> from ppc_robot_lib.steps.transformations import SuffixColumnsStep
>>> SuffixColumnsStep('performance_yday', '_yday', except_cols=['Id', 'CampaignName'])

Parameters:

input_table (str) – Input table.
suffix (str) – Suffix to add to column names.
except_cols (list[str]) – List of collumn names that will not be suffixed.
output_table (str) – Output table. If none is given, input table is used.