Most of us working in Python do so because of the powerful set of libraries that have been built for it. One such library I would like to discuss today is the ubiquitous pandas
.
For the two people who have not heard of pandas
, it allows you to:
Read tabular data from a multitude of sources (e.g. csv, Excel, databases);
Clean up that data;
Manipulate it to fit your need (e.g. adding columns defined by formulas, grouping and aggregations);
Display your data as plots; and
Export your data to any format you would like.
In this post, we are going to discuss some dos and don'ts when working with pandas
. If any reader wants to dig deeper into these best practices, I recommend the excellent book Effective Pandas - Patterns for Data Manipulation by Matt Harrison, which was the inspiration for this post.
The concept of "tidy data"
If you have worked with the R language, chance is you are familiar with the concept of tidy data, since it is the back-bone of the very popular collection of packages: the tidyverse.
In a paper published in the Journal of Statistical Software in 2014, Hadley Wickham (creator of the tidyverse) details the main tenets of so-called "tidy data", and the benefits they bring to our data analysis work. The concept of tidy data is language-agnostic, so we can apply it in our work with the pandas
library.
A table holds values, each of which belongs to an observation as well as a variable. In a given dataset, an observation contains all of the measurements for a given unit across attributes (e.g. a financial instrument), while a variable contains all measurements for the same underlying attribute (e.g. price).
We would consider a dataset "tidy" if the following is true:
Each variable forms a column;
Each observation forms a row; and
Each type of observational unit forms a table.
But why should we bother having tidy data? Let us consider the example of the data retrieved from the Yahoo Finance API using the following code:
import yfinance as yf
df_prices = (
yf
.Tickers("MSFT AAPL TSLA")
.download(period="5d")
.loc[:, "Close"]
)
print(df_prices)
Date | AAPL | MSFT | TSLA |
---|---|---|---|
2025-02-05 | 232.215 | 413.29 | 378.17 |
2025-02-06 | 232.964 | 415.82 | 374.32 |
2025-02-07 | 227.38 | 409.75 | 361.62 |
2025-02-10 | 227.65 | 412.22 | 350.73 |
2025-02-11 | 232.62 | 411.44 | 328.5 |
Now what is wrong with that? Well first of all, the data is poorly labeled, as it is not clear at first glance what type of values are included in this table. To illustrate the second reason as to why there may be issues with this format, let us consider the following exercise:
We want to compute the price of a portfolio made up of these three stocks, with quantities provided in a separate dataframe
df_quantities
, as shown below.
Date | AAPL | MSFT | TSLA |
---|---|---|---|
2023-01-13 | 1 | 0 | 1 |
2023-01-17 | 1 | 1 | 1 |
2023-01-18 | 2 | 1 | 2 |
2023-01-19 | 1 | 1 | 2 |
2023-01-20 | 0 | 1 | 2 |
One way to do so would be to merge the two dataframes using suffixes...
df_mrg = df_prices.merge(
df_quantities,
left_index=True,
right_index=True,
suffixes=("_PX", "_QT")
)
print(df_mrg)
Date | AAPL_PX | MSFT_PX | TSLA_PX | AAPL_QT | MSFT_QT | TSLA_QT |
---|---|---|---|---|---|---|
2025-02-05 | 232.215 | 413.29 | 378.17 | 1 | 0 | 1 |
2025-02-06 | 232.964 | 415.82 | 374.32 | 1 | 1 | 1 |
2025-02-07 | 227.38 | 409.75 | 361.62 | 2 | 1 | 2 |
2025-02-10 | 227.65 | 412.22 | 350.73 | 1 | 1 | 2 |
2025-02-11 | 232.62 | 411.44 | 328.5 | 0 | 1 | 2 |
... and combine the columns appropriately
df_ptf = pd.DataFrame(
index=df_mrg.index,
data={
"value": (
df_mrg["AAPL_PX"] * df_mrg["AAPL_QT"]
+ df_mrg["MSFT_PX"] * df_mrg["MSFT_QT"]
+ df_mrg["TSLA_PX"] * df_mrg["TSLA_QT"]
)
}
)
print(df_ptf)
Date | value |
---|---|
2025-02-05 | 610.385 |
2025-02-06 | 1023.1 |
2025-02-07 | 1587.75 |
2025-02-10 | 1341.33 |
2025-02-11 | 1068.44 |
But this wouldn't work very well for portfolios made up of other stocks, or that include more than 3 positions. To generalize it, one must think of the process used to build this new value column:
For each date/stock pair, multiply the quantity and the **price **to obtain the position value. Then, for each date, sum these values to obtain the portfolio value.
Clearly, in this context, the variables of our input table are quantity and *price, *while each date/stock pair forms an observation. To make the original dataset tidy, we use the .melt
method:
df_prices_tidy = (
pd.melt(
df_prices.reset_index(),
id_vars=["Date"],
var_name="stock",
value_name="price",
)
.set_index(["Date", "stock"])
)
print(df_prices_tidy)
price | |
---|---|
(datetime.date(2025, 2, 5), 'AAPL') | 232.215 |
(datetime.date(2025, 2, 6), 'AAPL') | 232.964 |
(datetime.date(2025, 2, 7), 'AAPL') | 227.38 |
(datetime.date(2025, 2, 10), 'AAPL') | 227.65 |
(datetime.date(2025, 2, 11), 'AAPL') | 232.62 |
(datetime.date(2025, 2, 5), 'MSFT') | 413.29 |
(datetime.date(2025, 2, 6), 'MSFT') | 415.82 |
(datetime.date(2025, 2, 7), 'MSFT') | 409.75 |
(datetime.date(2025, 2, 10), 'MSFT') | 412.22 |
(datetime.date(2025, 2, 11), 'MSFT') | 411.44 |
(datetime.date(2025, 2, 5), 'TSLA') | 378.17 |
(datetime.date(2025, 2, 6), 'TSLA') | 374.32 |
(datetime.date(2025, 2, 7), 'TSLA') | 361.62 |
(datetime.date(2025, 2, 10), 'TSLA') | 350.73 |
(datetime.date(2025, 2, 11), 'TSLA') | 328.5 |
The same can be done with the table of quantities, and the merge operation now yields:
quantity | price | |
---|---|---|
(datetime.date(2025, 2, 5), 'AAPL') | 1 | 232.215 |
(datetime.date(2025, 2, 6), 'AAPL') | 1 | 232.964 |
(datetime.date(2025, 2, 7), 'AAPL') | 2 | 227.38 |
(datetime.date(2025, 2, 10), 'AAPL') | 1 | 227.65 |
(datetime.date(2025, 2, 11), 'AAPL') | 0 | 232.62 |
(datetime.date(2025, 2, 5), 'MSFT') | 0 | 413.29 |
(datetime.date(2025, 2, 6), 'MSFT') | 1 | 415.82 |
(datetime.date(2025, 2, 7), 'MSFT') | 1 | 409.75 |
(datetime.date(2025, 2, 10), 'MSFT') | 1 | 412.22 |
(datetime.date(2025, 2, 11), 'MSFT') | 1 | 411.44 |
(datetime.date(2025, 2, 5), 'TSLA') | 1 | 378.17 |
(datetime.date(2025, 2, 6), 'TSLA') | 1 | 374.32 |
(datetime.date(2025, 2, 7), 'TSLA') | 2 | 361.62 |
(datetime.date(2025, 2, 10), 'TSLA') | 2 | 350.73 |
(datetime.date(2025, 2, 11), 'TSLA') | 2 | 328.5 |
Then, the value can be calculated for each observation, and an aggregation per date can be performed via the groupby
and agg
methods.
df_values = (
df_mrg_tidy
# We create a new column to store the position values
.assign(value=lambda df: df["price"] * df["quantity"])
# For each date, sum the "value" column and store into "ptf_value"
.groupby("Date")
.agg(ptf_value=("value", "sum"))
)
Date | ptf_value |
---|---|
2025-02-05 | 610.385 |
2025-02-06 | 1023.1 |
2025-02-07 | 1587.75 |
2025-02-10 | 1341.33 |
2025-02-11 | 1068.44 |
With tidy data, the data manipulation steps make sense from a semantic point of view. To obtain the portfolio value for each date, we first compute the value of each position at each date, using prices and quantities, and then we aggregate these values via a sum for each date in our table.
But how do we transform data to become tidy? The paper shows the techniques in detail, but mostly, it is about "pivoting" (i.e. moving from a "long" table to a "wide" table) and "melting" (the opposite operation, which we used earlier). In pandas, the methods which can be used are .pivot
, .pivot_table
, and .melt
(which we used above).
Chaining
If you have worked with pandas
code before, you may have seen data manipulation procedures that are a tad difficult to read because they are written in a procedural style, assigning values to temporary variables along the way.
Consider our example above. In most code, you may see that procedure written like below:
df_mrg_tidy["value"] = df_mrg_tidy["price"] * df_mrg_tidy["quantity"]
df_grp = df_mrg_tidy.groupby("Date")
df_values = df_grp.agg(ptf_value=("value", "sum"))
Here, we are messing with df_mrg_tidy
by adding a column that only has value after aggregation as part of df_values
, which is not something we want.
A better approach is to perform our data manipulation in a single step, starting from a starting dataset and producing the result we need. This style is close to the main tenet of functional programming, which is the lack of side effects.
To do so, we use the chaining method, which takes advantage of the fact that pandas methods usually return the modified dataframe, which means that further methods can then be called on it. Recall our previous example, which uses the assign
method to create a column, rather than the df[col] = ...
pattern used above.
df_values = (
df_mrg_tidy
# We create a new column to store the position values
.assign(value=lambda df: df["price"] * df["quantity"])
# For each date, sum the "value" column and store into "ptf_value"
.groupby("Date")
.agg(ptf_value=("value", "sum"))
)
The flow is now pretty clear: we start from df_mrg_tidy
to produce df_values
. The steps to get there are the creation of a new column value
, and an aggregation of that column via a sum, for each unique value in the Date
column.
Ceci n'est pas une
Consider we have access to the following function that parses the date columns of a dataframe for a given format (I have deliberately not written it as a chain so as to avoid confusion).
def parse_date_columns(
df: pd.DataFrame,
columns: list[str],
format: str
) -> pd.DataFrame:
df_out = df.copy()
for col in columns:
df_out.loc[:, col] = pd.to_datetime(df_out[col], format=format)
return df_out
Applying a function such as this one cannot be done directly when chaining, as... well the following does not quite look like a chain.
df_clean = (
parse_date_columns(
(
df_start
.rename(columns={"Date": "date", "Values": "value"})
.dropna()
.assign(squares=lambda df: df["value"] ** 2)
),
columns=["date"],
format="%Y-%m-%d",
)
)
It naturally becomes more convoluted when multiple functions like this one are involved in our data manipulation process.
To integrate these functions into our data processing chains, we pass them as arguments to the ever-so-handy .pipe
method. The above code is therefore equivalent to:
df_clean = (
df_start
.rename(columns={"Date": "date", "Values": "value"})
.dropna()
.assign(squares=lambda df: df["value"] ** 2)
.pipe(parse_date_columns, columns=["date"], format="%Y-%m-%d")
)
Note that, in some situations, we may want to use .pipe
even when pandas methods are available to us. Consider the simple example of a simple filtering on rows:
df_filtered = (
df_start
.loc[df_start["value"] > 3]
.dropna()
)
Since we refer to df_start
in line 3, we cannot easily switch around the order of operations. Indeed, swapping line 3 and 4 will not give the same result if .dropna()
ends up removing rows, since df_start
refers to the starting state of our dataframe. I usually like to avoid this situation, to ensure that my pipelines are actually doing what I expect them to do, but also to make them somewhat reusable.
df_filtered = (
df_start
.pipe(lambda df: df.loc[df["value"] > 3])
.dropna()
)
Now, we can flip the operations over and get the same results, and we could hypothetically copy that line 3 and paste it in another pipeline and it would work all the same, as long as a value
column is present in that new dataframe. An alternative to this would have been the .query
method, as shown below.
df_filtered = (
df_start
.query("Value > 3")
.dropna()
)
Chaining, overall, allows for cleaner data operations that read like a recipe. Indeed, the sequential order makes it much more readable, and ultimately reusable. As shown in one of our previous examples, you can also add comments inside your chain to improve the readability even more.
One last advantage of chaining, is that it makes the dreaded SettingWithCopyWarning
basically disappear, since we are not modifying existing dataframes on every line of code.
Take advantage of vectorization
One of the main reasons for the popularity of libraries like pandas
and numpy
is the combination of a convenient python API and a powerful calculation engine. However, there are anti-patterns when working with pandas
, some that I have seen time and time over.
First of all, while you can technically loop on pandas
dataframes via methods such as .iterrows
and .itertuples
, it is incompatible with chaining and is overly slow and verbose on top of it. I have yet to find a situation where I would have to resort to using for
loops on dataframes using these methods, especially given the existence of the .apply
method.
.apply
is convenient because it lets you apply python functions to elements in your dataframe, be it rows or individual values, while retaining the ability to use chaining. However, it should be a last resort and specific methods should be preferred, when those exist.
Let us see a simple example, where we try to determine the even numbers from random integers stored in a dataframe.
import timeit
setup_str = """
import pandas as pd
import numpy as np
df = pd.DataFrame(
data={
"value": np.random.random_integers(
low=0,
high=5000,
size=(100_000,)
)
}
)
"""
with_apply = "df['value'].apply(lambda x: x % 2 == 0)"
with_method = "df['value'].mod(2) == 0"
print(timeit.timeit(with_apply, setup=setup_str, number=100))
print(timeit.timeit(with_method, setup=setup_str, number=100))
Timing (in seconds) | |
---|---|
Using apply | 0.9603517500218004 |
Using method | 0.023110125097446144 |
The simple reason for the significantly slower execution of the approach using .apply
is that the function loops over each row of the dataframe. In pandas
, looping should be a last resort, and it is generally not required.
Let us consider a slightly less contrived example, where we want to create a column based on some criteria. Here, our alternative is np.select
, which takes a list of arrays of booleans (i.e. True/False) and a list of outputs, and recombines them into a single array.
setup_str = """
import pandas as pd
import numpy as np
def greatest(a, b):
if a > b:
return "a"
elif b > a:
return "b"
return None
df = pd.DataFrame(
data={
"a": np.random.random_integers(
low=0,
high=5000,
size=(100_000,)
),
"b": np.random.random_integers(
low=0,
high=5000,
size=(100_000,)
)
}
)
"""
with_apply = """df.assign(
greatest=lambda df_: df_.apply(
lambda row: greatest(**row), axis=1))
"""
with_method = """
df.assign(
odd_even=lambda df_: np.select(
condlist=[
df_['a'] > df_['b'],
df_['b'] > df_['a'],
],
choicelist=[
'a',
'b',
],
default=None,
)
)
"""
print(timeit.timeit(with_apply, setup=setup_str, number=100))
print(timeit.timeit(with_method, setup=setup_str, number=100))
Timing (in seconds) | |
---|---|
Using apply | 38.27637587499339 |
Using np.select | 0.375061666010879 |
A good question to ask yourself when using .apply
, is whether there is any other solution that is built around vectors/arrays rather than individual values, as those solutions would in general be more performant. These solutions can either be dataframe methods, or numpy
functions, like where
or select
.
Some snippets
Here is a couple of real-world snippets of processing pipelines below, which you may find useful.
Concatenation of multiple tables stored in .csv files, with some light processing.
from datetime import datetime
df_loaded = (
pd.concat(
[
(
pd.read_csv(
f"./data_{date}.csv",
sep=";",
decimal=","
)
.dropna()
.assign(file_date=datetime.strptime(date, "%Y%m%d"))
)
for date in ["20220930", "20220831", "20220731"]
],
axis=0
)
)
💡
Loading multiple tables into a single dataframe is faster using pd.concat
rather than appending.
Assignment using a dictionary
What if we want to replace NaN
values in a certain set of columns while chaining?
df_processed = (
df_start
.pipe(lambda df: df.assign(
**{
col: df[col].fillna(value=0)
for col in ["price", "quantity", "dividend"]
}
))
)
Here, we unpack a dictionary which holds the columns to be assigned as keys, and the processed pandas series as values. This is necessary because the columns are defined programmatically.
If you are not aware of what dictionary unpacking does, it transforms the dictionary into keyword arguments when the function is executed. For example, f(**{"a": 2, "b": 3})
becomes f(a=2, b=3)
.
Apply functions on groups
Previously, I have mentioned that the .apply
method should be avoided. One exception is when applied right after a .groupby
operation. Indeed, after grouping, using .apply
allows you to apply a function that takes a dataframe as an input and returns a dataframe to each group, which is of great use for chaining.
See the example below, where we wish to transform the value
column into a percentage for each given date
.
df_processed = (
df_start
.groupby("date", group_keys=False)
.apply(
lambda df: df.assign(pct=df["value"] / df["value"].sum())
)
)
Note that this could have been solved differently using a merge
df_processed = (
df_start
.pipe(
lambda df: df.merge(
(
df_start
.groupby("date")
.agg(total=("value", "sum"))
.reset_index()
),
on="date"
)
)
.assign(pct=lambda df: df["value"] / df["total"])
.drop(columns=["total"])
)
Debugging chains
One thing you may wonder is how do we go about debugging chains. One simple way of doing so is to implement your own diagnostic function, to call with .pipe
along your process.
def show(df):
print(df.head())
return df
df_debug = (
df_mrg_tidy
# We create a new column to store the position values
.assign(value=lambda df: df["price"] * df["quantity"])
# Useful for debug
.show()
# For each date, sum the "value" column and store into "ptf_value"
.groupby("Date")
.agg(ptf_value=("value", "sum"))
)
Conclusion
I hope you have found those tips useful, and that you will be able to implement them in your processing pipelines. This is of course only scratching the surface, and I would recommend you to read Matt Harrison's book.
As with everything, practice makes perfect, so I encourage you to build pipelines, and look at them critically in light of the elements mentioned.
Stay tuned for more!