Transformations
All steps can be imported directly from the ppc_robot_lib.steps.transformations
module.
Filtering, splitting and merging
ConcatStep
- class ConcatStep(input_tables, output_table)[source]
Concatenates rows of multiple tables into one. Internally uses
pandas.concat()
over the rows axes, this means that the same semantics applies.Example:
>>> from ppc_robot_lib.steps.transformations import ConcatStep >>> ConcatStep(["urls_ads", "urls_keywords", "urls_sitelinks"], output_table="urls")
FilterStep
- class FilterStep(input_table, filter_cb, output_table=None)[source]
Filters rows in the given table.
The filter can be either a callable, or list of callables. The callable receives a
pandas.DataFrame
as a single argument and must return a boolean vector that is used for selection.It you use the list of callables, their result vectors are “ANDed” together using the
&
operator. This means that the resulting rows must match all the conditions.Example:
>>> from ppc_robot_lib.steps.transformations import FilterStep >>> FilterStep("urls", [ ... lambda df: df['http_status_code'] >= 200, ... lambda df: df['http_status_code'] <= 299, ... ])
- Parameters:
input_table (
str
) – Name of the input table.filter_cb (
list
[Callable
[[DataFrame
],Series
|bool
]] |Callable
[[DataFrame
],Series
|bool
]) – Filter callback. Can be a lambda function or list of lambda functions. The row must match all conditions.output_table (
str
) – Name of the output table. If not given, the result is written to the input table.
FilterColumnMatchStep
- class FilterColumnMatchStep(input_table, input_column, ref_table, ref_column, output_table=None, negative_match=False)[source]
Filters rows in the table, selects only rows with values in the given column that matches values in another column from the reference table. Imagine it like filtering using INNER JOIN without joining.
Example:
Let’s assume we have the two tables,
in_table
andout_table
:in_table Campaign
Tag
Campaign A
product
Campaign B
remarketing
Campaign C
dsa
Campaign D
product
Campaign E
banner
ref_table Id
Ref Tag
1
product
2
dsa
3
responsive
By executing the code:
>>> from ppc_robot_lib.steps.transformations import FilterColumnMatchStep >>> FilterColumnMatchStep("in_table", "Tag", ref_table="ref_table", ref_column="Ref Tag", ... output_table="out_table")
We would get the following output:
out_table Campaign
Tag
Campaign A
product
Campaign C
dsa
Campaign D
product
If we set
negative_match=True
:>>> from ppc_robot_lib.steps.transformations import FilterColumnMatchStep >>> FilterColumnMatchStep("in_table", "Tag", ref_table="ref_table", ref_column="Ref Tag", ... output_table="out_table", negative_match=True)
The resulting table would look like this:
out_table Campaign
Tag
Campaign B
remarketing
Campaign E
banner
- Parameters:
input_table (
str
) – Table that should be filtered.input_column (
str
) – Column in the input table which should be used for lookups in the reference table.ref_table (
str
) – Reference table.ref_column (
str
) – Column in the reference table that is used to build the lookup set.output_table (
str
) – Output table. If not given, input table is used.negative_match (
bool
) – Set toTrue
if you need rows that DO NOT match.
JoinOnColumnStep
- class JoinOnColumnStep(left_table, right_table, on=None, left_on=None, right_on=None, join_type=JoinType.INNER, output_table=None, do_sort=True, ambiguous_column_suffixes=('_left', '_right'), left_index=False, right_index=False)[source]
Performs an SQL-like join on two tables.
This step supports 4 types of joins:
You can join either by columns with same name from both tables, or you can provide list of columns in both left and right table. The lists has to be of equal length.
When performing the join, values of columns on corresponding indexes must be equal in order to match the rows.
For details on joins in Pandas, see merge().
Example:
>>> from ppc_robot_lib.steps.transformations import JoinOnColumnStep, JoinType >>> JoinOnColumnStep("keywords", "adgroups", ... left_on=["CampaignName", "AdGroupName"], ... right_on=['CampaignName', "Name"], ... join_type=JoinType.INNER, ... output_table="keywords_with_adgroup_data")
- Parameters:
left_table (
str
) – Left table.right_table (
str
) – Right table.on (
str
|list
[str
]) – Column names that must match in both tables. Exclusive withleft_on
andright_on
.left_on (
str
|list
[str
]) – List of columns to match in the left table. Must be the same length withright_on
. Exclusive withon
.right_on (
str
|list
[str
]) – List of columns to match in the left table. Must be the same length withright_on
. Exclusive withon
.join_type (
JoinType
) – Type of the join. See merge() for explanation of individual types.output_table (
str
) – Output table. If none is given, left table is used.do_sort (
bool
) – Set toTrue
if you would like to sort the tables by join columns.ambiguous_column_suffixes – Pair of column suffixes that is used when tables with conflicting column names are used. The first element is used for the left table, the second one for the right table.
SplitByColumnStep
- class SplitByColumnStep(table, column, output_tables, default_table=None)[source]
Enables you to split one table to multiple tables according to value of a specified column.
output_tables
is a dictionary, key is the name of resulting table and value can be either:Value of the column to match.
Callable recieving a single value and returing
bool
. If the callable returnsTrue
, the row will be inserted into the given table.List of values or callables with the same semantics as in the previous step. The row iwill be inserted if at least one of the values or callables match.
Please note that row order might not be preserved, as this operation internally uses group by and then operates on disjunct segments of table.
Example:
Let’s assume we have the following table:
in_table A
X
A1
1
A3
3
A2
2
A4
4
A7
7
A6
6
A5
5
A9
9
A0
10
A8
8
AA
11
By executing step:
>>> from ppc_robot_lib.steps.transformations import SplitByColumnStep >>> SplitByColumnStep('in_table', 'X', { ... 'o1': [1, 2, 3, lambda x: 8 <= x <= 10], 'o2': lambda x: 4 <= x <= 7 ... }, default_table='default')
We would get the following tables:
o1 A
X
A1
1
A3
3
A2
2
A9
9
A0
10
A8
8
o2 A
X
A4
4
A7
7
A6
6
A5
5
default A
X
AA
11
- Parameters:
table (
str
) – Input table.column (
str
) – Column to use for split.output_tables (
dict
[str
,Any
|list
[Any
] |Callable
[[Any
],bool
]]) – Dictionary of output tables. Key is name of the result table, value is either value to match, callable or list of values/callables to filter by.default_table (
str
) – Table to be used when the value does not match any of the tables specified in theoutput_tables
argument. If none is given, these rows are dropped.
Grouping and Aggregations
AggregateByColumnStep
- class AggregateByColumnStep(input_table, group_by, aggregation, output_table, column=None, sort=False)[source]
Group by table by specific columns and perform an aggregation on the specified column. The
aggregation
argument must be a name of the aggregation –`sum
,count
,avg
etc. See Pandas GroupBy Documentation for a complete list.Unlike
ppc_robot_lib.steps.transformations.group_by_and_aggregate.GroupByAndAggregateStep
, this step returns apandas.core.groupby.DataFrameGroupBy
, not apandas.DataFrame
. Be careful when using the result.See Splitting an object into groups for more information about working with
pandas.core.groupby.DataFrameGroupBy
objects.Example:
Let’s assume the following table named
input
:Campaign
Ad Group
Impressions
Camp 1
A
10
Camp 1
B
30
Camp 2
A
15
Camp 2
A
50
Camp 2
A
0
By executing:
>>> from ppc_robot_lib.steps.transformations import AggregateByColumnStep >>> AggregateByColumnStep("input", group_by="Campaign", aggregation="sum", column="Impressions", ... output_table="output")
We would get the following
pandas.core.groupby.DataFrameGroupBy
object:Campaign
Impressions
Camp 1
40
Camp 2
65
If you omit the column, the aggregation will be applied on all columns that are not present in the
group_by
argument.- Parameters:
input_table (
str
) – Table to used for grouping and aggregation.group_by (
str
|list
[str
] |tuple
[str
,...
]) – Column name or list of columns for aggregation.aggregation (
str
) –Aggregation to apply, must be a name of a method from
pandas.core.groupby.DataFrameGroupBy
, e.g.sum
,count
oravg
. Complete list can be found in the Pandas GroupBy Documentation.output_table (
str
) – Outout table.column (
str
) – Column to use for aggregation.sort – Set to
True
if you would like to sort the result by the columns used for grouping.
GroupByAndAggregateStep
- class GroupByAndAggregateStep(input_table, group_by, aggregations, output_table=None, sort=False)[source]
Groups the given table by one or more columns and performs aggregations on the columns.
The
columns
argument is a name of the column or a list of columns that should be used for the group-by operation.The
aggregations
is a dictionary, key is a name of the column, value is either a name of the aggregation (method ofpandas.core.groupby.DataFrameGroupBy
, see Pandas GroupBy Documentation for list), or a callable.If you use callable, the callable must return another callable. The outer callable will receive a
pandas.DataFrame
, the inner will receive only a column slice to aggregate. You can call methods likesum
,avg
on the slice, or use it in functions provided by numpy.This is useful when you need, for instance, a weighted average.
Please note that columns that are not present in neither
columns
oraggregations
will be not present in the output table.Unlike
ppc_robot_lib.steps.transformations.aggregate_by_column.AggregateByColumnStep
andppc_robot_lib.steps.transformations.group_by_column.GroupByColumnStep
, this step returns apandas.DataFrame
, not apandas.core.groupby.DataFrameGroupBy
.Example:
Assume we have the following table:
in_table Campaign
AdGroup
Tag
Impressions
Quality
Campaign A
Group 1
product
10
1
Campaign A
Group 2
product
20
10
Campaign B
Group 1
remarketing
14
7
Campaign B
Group 2
remarketing
36
5
Campaign B
Group 3
remarketing
52
10
Campaign C
Group 1
dsa
14
5
Campaign D
Group 1
product
0
10
Campaign D
Group 2
product
10
3
Campaign E
Group 1
banner
35
8
We can use the following code to get weighted quality score:
>>> import numpy >>> from ppc_robot_lib.steps.transformations import GroupByAndAggregateStep >>> def agg_weighted_avg(df: pandas.DataFrame): ... return lambda quality: numpy.average(quality, weights=df.loc[quality.index, 'Impressions']) >>> GroupByAndAggregateStep("in_table", "Campaign", { ... "Tag": "first", ... "Impressions": "sum", ... "Quality": agg_weighted_avg, ... })
Note
The aggregation function receives only a slice of the
Quality
column in the argumentquality
. However, we can usequality.index
to access the corresponding columns in dataframedf
. Expressiondf.loc[quality.index, 'Impressions']
will get only the related values in the
Impressions
columns from the original, non-grouped dataframe.The resulting table:
in_table Campaign
Tag
Impressions
Quality
Campaign A
product
30
7
Campaign B
remarketing
102
7.8235…
Campaign C
dsa
14
5
Campaign D
product
10
3
Campaign E
banner
35
8
- Parameters:
input_table (
str
) – Input table.group_by (
str
|list
[str
] |tuple
[str
,...
]) – Column name or list of columns to group by.aggregations (
dict
[str
,str
|Callable
[[DataFrame
],Callable
[[Series
],Any
]]]) – Dictionary of aggregations. Key is name of the column and value is either name of the aggregation, or a callable. See documentation above for explanation.output_table (
str
) – Name of the output table.sort – Set to
True
if you would like to sort the result by the columns used for grouping.
GroupByColumnStep
- class GroupByColumnStep(input_table, columns, output_table, sort=False)[source]
Groups a table by specified column or columns.
Unlike
ppc_robot_lib.steps.transformations.group_by_and_aggregate.GroupByAndAggregateStep
, this step returns apandas.core.groupby.DataFrameGroupBy
, not apandas.DataFrame
. Be careful when using the result.See Splitting an object into groups for more information about working with
pandas.core.groupby.DataFrameGroupBy
objects.Example:
>>> from ppc_robot_lib.steps.transformations import GroupByColumnStep >>> GroupByColumnStep("in_table", ["Campaign", "AdGroup"], output_table="out_table")
Computation
ComputeStep
- class ComputeStep(table, columns)[source]
Computes one or more columns and add them to the table.
Columns are given as a dictionary, where key is a name of a new column and value is an expression. Expression can be either callable, or string with an expression for
pandas.eval()
. Syntax is briefly described at Expression evaluation via eval(). Please note that calllable is strongly preferred.The callable receives a
pandas.DataFrame
as the first positional argument. If the callable also contains thetask_ctx
argument, it will receive anppc_robot_lib.tasks.task_context.TaskContextInterface
as a keyword argument.The callable should return a value that is directly assignable to the column:
pandas.Series
or a single scalar value.Example:
>>> from ppc_robot_lib.steps.transformations import ComputeStep >>> ComputeStep("table", { ... 'Ctr1': lambda df: df['Clicks'] / df['Impressions'], ... 'Ctr2': 'Clicks / Impressions', ... })
Sorting
SortStep
- class SortStep(input_table, order, output_table=None)[source]
Sorts the table by one or more columns.
Example:
>>> from ppc_robot_lib.steps.transformations import SortStep >>> from ppc_robot_lib.utils.types import Order, OrderBy >>> SortStep("in_table", order=[ ... OrderBy("Impressions", Order.DESC), ... OrderBy("CampaignName", Order.ASC), ... ])
Utility Operations
AddClientIdStep
ProjectionStep
- class ProjectionStep(input_table, columns, output_table=None)[source]
Projection creates a new table containing only the given columns, in the given order.
Example:
Let’s assume we have a table with columns A, B, C, D, in that order. By executing:
>>> from ppc_robot_lib.steps.transformations import ProjectionStep >>> ProjectionStep("in_table", columns=['C', 'B', 'D'], output_table="out_table")
we would get a table with columns C, B and D. Column A would be dropped.
RenameStep
- class RenameStep(input_table, rename=None, drop=None, output_table=None, drop_not_renamed=False)[source]
Allows you to rename or drop columns from a table. You have to specify
rename
,drop
or both. Optional argumentdrop_not_renamed
allows you to drop columns that were not renamed by this operation.Example:
>>> from ppc_robot_lib.steps.transformations import RenameStep >>> RenameStep("in_table", rename={'original': 'new', 'col1': "renamed_col1"}, ... drop=['unnecessary_column', "internal_value_users_shall_not_see"])
- Parameters:
input_table (
str
) – Input table.rename (
dict
[str
,str
]) – Dictionary of columns, key is the original name and value is the new name.drop (
list
[str
]) – List of column names to drop. At least one ofdrop
andrename
must be given.output_table (
str
) – Output table. If not given, operations are performed in-place on the input table.drop_not_renamed (
bool
) – Drop columns that are not present in therename
argument.
SuffixColumnsStep
- class SuffixColumnsStep(input_table, suffix, except_cols=None, output_table=None)[source]
Suffixes column names with the given
suffix
. If a column name is listed inexcept_cols
, its name is kept as-is.Example:
>>> from ppc_robot_lib.steps.transformations import SuffixColumnsStep >>> SuffixColumnsStep('performance_yday', '_yday', except_cols=['Id', 'CampaignName'])