🐼 Some tips when working with pandas

Tags

Python
Data

Estimated read time: 11 minutes

Most of us working in Python do so because of the powerful set of libraries that have been built for it. One such library I would like to discuss today is the ubiquitous pandas.

For the two people who have not heard of pandas, it allows you to:

  • Read tabular data from a multitude of sources (e.g. csv, Excel, databases);

  • Clean up that data;

  • Manipulate it to fit your need (e.g. adding columns defined by formulas, grouping and aggregations);

  • Display your data as plots; and

  • Export your data to any format you would like.

In this post, we are going to discuss some dos and don'ts when working with pandas. If any reader wants to dig deeper into these best practices, I recommend the excellent book Effective Pandas - Patterns for Data Manipulation by Matt Harrison, which was the inspiration for this post.

The concept of "tidy data"

If you have worked with the R language, chance is you are familiar with the concept of tidy data, since it is the back-bone of the very popular collection of packages: the tidyverse.

In a paper published in the Journal of Statistical Software in 2014, Hadley Wickham (creator of the tidyverse) details the main tenets of so-called "tidy data", and the benefits they bring to our data analysis work. The concept of tidy data is language-agnostic, so we can apply it in our work with the pandas library.

A table holds values, each of which belongs to an observation as well as a variable. In a given dataset, an observation contains all of the measurements for a given unit across attributes (e.g. a financial instrument), while a variable contains all measurements for the same underlying attribute (e.g. price).

We would consider a dataset "tidy" if the following is true:

  • Each variable forms a column;

  • Each observation forms a row; and

  • Each type of observational unit forms a table.

But why should we bother having tidy data? Let us consider the example of the data retrieved from the Yahoo Finance API using the following code:

import yfinance as yf

df_prices = (
    yf
    .Tickers("MSFT AAPL TSLA")
    .download(period="5d")
    .loc[:, "Close"]
)

print(df_prices)
DateAAPLMSFTTSLA
2025-02-05232.215413.29378.17
2025-02-06232.964415.82374.32
2025-02-07227.38409.75361.62
2025-02-10227.65412.22350.73
2025-02-11232.62411.44328.5

Now what is wrong with that? Well first of all, the data is poorly labeled, as it is not clear at first glance what type of values are included in this table. To illustrate the second reason as to why there may be issues with this format, let us consider the following exercise:

We want to compute the price of a portfolio made up of these three stocks, with quantities provided in a separate dataframe df_quantities, as shown below.

DateAAPLMSFTTSLA
2023-01-13101
2023-01-17111
2023-01-18212
2023-01-19112
2023-01-20012

One way to do so would be to merge the two dataframes using suffixes...

df_mrg = df_prices.merge(
    df_quantities,
    left_index=True,
    right_index=True,
    suffixes=("_PX", "_QT")
)

print(df_mrg)
DateAAPL_PXMSFT_PXTSLA_PXAAPL_QTMSFT_QTTSLA_QT
2025-02-05232.215413.29378.17101
2025-02-06232.964415.82374.32111
2025-02-07227.38409.75361.62212
2025-02-10227.65412.22350.73112
2025-02-11232.62411.44328.5012

... and combine the columns appropriately

df_ptf = pd.DataFrame(
  index=df_mrg.index,
    data={
        "value": (
            df_mrg["AAPL_PX"] * df_mrg["AAPL_QT"]
            + df_mrg["MSFT_PX"] * df_mrg["MSFT_QT"]
            + df_mrg["TSLA_PX"] * df_mrg["TSLA_QT"]
        )
    }
)

print(df_ptf)
Datevalue
2025-02-05610.385
2025-02-061023.1
2025-02-071587.75
2025-02-101341.33
2025-02-111068.44

But this wouldn't work very well for portfolios made up of other stocks, or that include more than 3 positions. To generalize it, one must think of the process used to build this new value column:

For each date/stock pair, multiply the quantity and the **price **to obtain the position value. Then, for each date, sum these values to obtain the portfolio value.

Clearly, in this context, the variables of our input table are quantity and *price, *while each date/stock pair forms an observation. To make the original dataset tidy, we use the .melt method:

df_prices_tidy = (
    pd.melt(
        df_prices.reset_index(),
        id_vars=["Date"],
        var_name="stock",
        value_name="price",
    )
    .set_index(["Date", "stock"])
)

print(df_prices_tidy)
price
(datetime.date(2025, 2, 5), 'AAPL')232.215
(datetime.date(2025, 2, 6), 'AAPL')232.964
(datetime.date(2025, 2, 7), 'AAPL')227.38
(datetime.date(2025, 2, 10), 'AAPL')227.65
(datetime.date(2025, 2, 11), 'AAPL')232.62
(datetime.date(2025, 2, 5), 'MSFT')413.29
(datetime.date(2025, 2, 6), 'MSFT')415.82
(datetime.date(2025, 2, 7), 'MSFT')409.75
(datetime.date(2025, 2, 10), 'MSFT')412.22
(datetime.date(2025, 2, 11), 'MSFT')411.44
(datetime.date(2025, 2, 5), 'TSLA')378.17
(datetime.date(2025, 2, 6), 'TSLA')374.32
(datetime.date(2025, 2, 7), 'TSLA')361.62
(datetime.date(2025, 2, 10), 'TSLA')350.73
(datetime.date(2025, 2, 11), 'TSLA')328.5

The same can be done with the table of quantities, and the merge operation now yields:

quantityprice
(datetime.date(2025, 2, 5), 'AAPL')1232.215
(datetime.date(2025, 2, 6), 'AAPL')1232.964
(datetime.date(2025, 2, 7), 'AAPL')2227.38
(datetime.date(2025, 2, 10), 'AAPL')1227.65
(datetime.date(2025, 2, 11), 'AAPL')0232.62
(datetime.date(2025, 2, 5), 'MSFT')0413.29
(datetime.date(2025, 2, 6), 'MSFT')1415.82
(datetime.date(2025, 2, 7), 'MSFT')1409.75
(datetime.date(2025, 2, 10), 'MSFT')1412.22
(datetime.date(2025, 2, 11), 'MSFT')1411.44
(datetime.date(2025, 2, 5), 'TSLA')1378.17
(datetime.date(2025, 2, 6), 'TSLA')1374.32
(datetime.date(2025, 2, 7), 'TSLA')2361.62
(datetime.date(2025, 2, 10), 'TSLA')2350.73
(datetime.date(2025, 2, 11), 'TSLA')2328.5

Then, the value can be calculated for each observation, and an aggregation per date can be performed via the groupby and agg methods.

df_values = (
    df_mrg_tidy

    # We create a new column to store the position values
    .assign(value=lambda df: df["price"] * df["quantity"])

  # For each date, sum the "value" column and store into "ptf_value"
    .groupby("Date")
    .agg(ptf_value=("value", "sum"))
)
Dateptf_value
2025-02-05610.385
2025-02-061023.1
2025-02-071587.75
2025-02-101341.33
2025-02-111068.44

With tidy data, the data manipulation steps make sense from a semantic point of view. To obtain the portfolio value for each date, we first compute the value of each position at each date, using prices and quantities, and then we aggregate these values via a sum for each date in our table.

But how do we transform data to become tidy? The paper shows the techniques in detail, but mostly, it is about "pivoting" (i.e. moving from a "long" table to a "wide" table) and "melting" (the opposite operation, which we used earlier). In pandas, the methods which can be used are .pivot, .pivot_table, and .melt (which we used above).

Chaining

If you have worked with pandas code before, you may have seen data manipulation procedures that are a tad difficult to read because they are written in a procedural style, assigning values to temporary variables along the way.

Consider our example above. In most code, you may see that procedure written like below:

df_mrg_tidy["value"] = df_mrg_tidy["price"] * df_mrg_tidy["quantity"]
df_grp = df_mrg_tidy.groupby("Date")
df_values = df_grp.agg(ptf_value=("value", "sum"))

Here, we are messing with df_mrg_tidy by adding a column that only has value after aggregation as part of df_values, which is not something we want.

A better approach is to perform our data manipulation in a single step, starting from a starting dataset and producing the result we need. This style is close to the main tenet of functional programming, which is the lack of side effects.

To do so, we use the chaining method, which takes advantage of the fact that pandas methods usually return the modified dataframe, which means that further methods can then be called on it. Recall our previous example, which uses the assign method to create a column, rather than the df[col] = ... pattern used above.

df_values = (
    df_mrg_tidy

    # We create a new column to store the position values
    .assign(value=lambda df: df["price"] * df["quantity"])

  # For each date, sum the "value" column and store into "ptf_value"
    .groupby("Date")
    .agg(ptf_value=("value", "sum"))
)

The flow is now pretty clear: we start from df_mrg_tidy to produce df_values. The steps to get there are the creation of a new column value, and an aggregation of that column via a sum, for each unique value in the Date column.

Ceci n'est pas une

Consider we have access to the following function that parses the date columns of a dataframe for a given format (I have deliberately not written it as a chain so as to avoid confusion).

def parse_date_columns(
  df: pd.DataFrame,
    columns: list[str],
    format: str
) -> pd.DataFrame:
  df_out = df.copy()

    for col in columns:
      df_out.loc[:, col] = pd.to_datetime(df_out[col], format=format)

    return df_out

Applying a function such as this one cannot be done directly when chaining, as... well the following does not quite look like a chain.

df_clean = (
  parse_date_columns(
      (
            df_start
            .rename(columns={"Date": "date", "Values": "value"})
            .dropna()
            .assign(squares=lambda df: df["value"] ** 2)
        ),
        columns=["date"],
        format="%Y-%m-%d",
    )
)

It naturally becomes more convoluted when multiple functions like this one are involved in our data manipulation process.

To integrate these functions into our data processing chains, we pass them as arguments to the ever-so-handy .pipe method. The above code is therefore equivalent to:

df_clean = (
  df_start
    .rename(columns={"Date": "date", "Values": "value"})
    .dropna()
    .assign(squares=lambda df: df["value"] ** 2)
    .pipe(parse_date_columns, columns=["date"], format="%Y-%m-%d")
)

Note that, in some situations, we may want to use .pipe even when pandas methods are available to us. Consider the simple example of a simple filtering on rows:

df_filtered = (
  df_start
    .loc[df_start["value"] > 3]
    .dropna()
)

Since we refer to df_start in line 3, we cannot easily switch around the order of operations. Indeed, swapping line 3 and 4 will not give the same result if .dropna() ends up removing rows, since df_start refers to the starting state of our dataframe. I usually like to avoid this situation, to ensure that my pipelines are actually doing what I expect them to do, but also to make them somewhat reusable.

df_filtered = (
  df_start
    .pipe(lambda df: df.loc[df["value"] > 3])
    .dropna()
)

Now, we can flip the operations over and get the same results, and we could hypothetically copy that line 3 and paste it in another pipeline and it would work all the same, as long as a value column is present in that new dataframe. An alternative to this would have been the .query method, as shown below.

df_filtered = (
  df_start
    .query("Value > 3")
    .dropna()
)

Chaining, overall, allows for cleaner data operations that read like a recipe. Indeed, the sequential order makes it much more readable, and ultimately reusable. As shown in one of our previous examples, you can also add comments inside your chain to improve the readability even more.

One last advantage of chaining, is that it makes the dreaded SettingWithCopyWarning basically disappear, since we are not modifying existing dataframes on every line of code.

Take advantage of vectorization

One of the main reasons for the popularity of libraries like pandas and numpy is the combination of a convenient python API and a powerful calculation engine. However, there are anti-patterns when working with pandas, some that I have seen time and time over.

First of all, while you can technically loop on pandas dataframes via methods such as .iterrows and .itertuples, it is incompatible with chaining and is overly slow and verbose on top of it. I have yet to find a situation where I would have to resort to using for loops on dataframes using these methods, especially given the existence of the .apply method.

.apply is convenient because it lets you apply python functions to elements in your dataframe, be it rows or individual values, while retaining the ability to use chaining. However, it should be a last resort and specific methods should be preferred, when those exist.

Let us see a simple example, where we try to determine the even numbers from random integers stored in a dataframe.

import timeit

setup_str = """
import pandas as pd
import numpy as np

df = pd.DataFrame(
    data={
        "value": np.random.random_integers(
            low=0,
            high=5000,
            size=(100_000,)
        )
    }
)
"""

with_apply = "df['value'].apply(lambda x: x % 2 == 0)"
with_method = "df['value'].mod(2) == 0"

print(timeit.timeit(with_apply, setup=setup_str, number=100))
print(timeit.timeit(with_method, setup=setup_str, number=100))
Timing (in seconds)
Using apply0.9603517500218004
Using method0.023110125097446144

The simple reason for the significantly slower execution of the approach using .apply is that the function loops over each row of the dataframe. In pandas, looping should be a last resort, and it is generally not required.

Let us consider a slightly less contrived example, where we want to create a column based on some criteria. Here, our alternative is np.select, which takes a list of arrays of booleans (i.e. True/False) and a list of outputs, and recombines them into a single array.

setup_str = """
import pandas as pd
import numpy as np

def greatest(a, b):
    if a > b:
        return "a"
    elif b > a:
        return "b"
    return None

df = pd.DataFrame(
    data={
        "a": np.random.random_integers(
            low=0,
            high=5000,
            size=(100_000,)
        ),
        "b": np.random.random_integers(
            low=0,
            high=5000,
            size=(100_000,)
        )
    }
)
"""

with_apply = """df.assign(
  greatest=lambda df_: df_.apply(
      lambda row: greatest(**row), axis=1))
"""

with_method = """
df.assign(
    odd_even=lambda df_: np.select(
        condlist=[
            df_['a'] > df_['b'],
            df_['b'] > df_['a'],
        ],
        choicelist=[
            'a',
            'b',
        ],
        default=None,
    )
)
"""

print(timeit.timeit(with_apply, setup=setup_str, number=100))
print(timeit.timeit(with_method, setup=setup_str, number=100))
Timing (in seconds)
Using apply38.27637587499339
Using np.select0.375061666010879

A good question to ask yourself when using .apply, is whether there is any other solution that is built around vectors/arrays rather than individual values, as those solutions would in general be more performant. These solutions can either be dataframe methods, or numpy functions, like where or select.

Some snippets

Here is a couple of real-world snippets of processing pipelines below, which you may find useful.

Concatenation of multiple tables stored in .csv files, with some light processing.
from datetime import datetime

df_loaded = (
  pd.concat(
      [
          (
                pd.read_csv(
                  f"./data_{date}.csv",
                    sep=";",
                    decimal=","
                )
                .dropna()
                .assign(file_date=datetime.strptime(date, "%Y%m%d"))

            )
            for date in ["20220930", "20220831", "20220731"]
        ],
        axis=0
    )

)

💡

Loading multiple tables into a single dataframe is faster using pd.concat rather than appending.

Assignment using a dictionary

What if we want to replace NaN values in a certain set of columns while chaining?

df_processed = (
  df_start
    .pipe(lambda df: df.assign(
      **{
          col: df[col].fillna(value=0)
          for col in ["price", "quantity", "dividend"]
        }
    ))
)

Here, we unpack a dictionary which holds the columns to be assigned as keys, and the processed pandas series as values. This is necessary because the columns are defined programmatically.

If you are not aware of what dictionary unpacking does, it transforms the dictionary into keyword arguments when the function is executed. For example, f(**{"a": 2, "b": 3}) becomes f(a=2, b=3).

Apply functions on groups

Previously, I have mentioned that the .apply method should be avoided. One exception is when applied right after a .groupby operation. Indeed, after grouping, using .apply allows you to apply a function that takes a dataframe as an input and returns a dataframe to each group, which is of great use for chaining.

See the example below, where we wish to transform the value column into a percentage for each given date.

df_processed = (
  df_start
    .groupby("date", group_keys=False)
    .apply(
      lambda df: df.assign(pct=df["value"] / df["value"].sum())
    )
)

Note that this could have been solved differently using a merge

df_processed = (
  df_start
    .pipe(
        lambda df: df.merge(
            (
                df_start
                .groupby("date")
                .agg(total=("value", "sum"))
                .reset_index()
            ),
            on="date"
        )
    )
    .assign(pct=lambda df: df["value"] / df["total"])
  .drop(columns=["total"])
)
Debugging chains

One thing you may wonder is how do we go about debugging chains. One simple way of doing so is to implement your own diagnostic function, to call with .pipe along your process.

def show(df):
  print(df.head())
    return df

df_debug = (
    df_mrg_tidy

    # We create a new column to store the position values
    .assign(value=lambda df: df["price"] * df["quantity"])

    # Useful for debug
    .show()

  # For each date, sum the "value" column and store into "ptf_value"
    .groupby("Date")
    .agg(ptf_value=("value", "sum"))
)

Conclusion

I hope you have found those tips useful, and that you will be able to implement them in your processing pipelines. This is of course only scratching the surface, and I would recommend you to read Matt Harrison's book.

As with everything, practice makes perfect, so I encourage you to build pipelines, and look at them critically in light of the elements mentioned.

Stay tuned for more!