1

I want to write a script/function to process a fairly generic pandas dataframe. The data frame is the result of a pd.merge() operation between two data frames, one of which is supplied by the user in the form of a CSV file with arbitrary columns. I know all the values are text but that's about it.

I want to create a new column in the dataframe based on the values of a combination of other columns. This is a fairly easy problem when the column names are known, but in this case I know neither the names of the columns nor how they are to be combined in advance.

Suppose the dataframe df looks like this:

print(df)
     col1 col2
row1 abc  def
row2 ghi  jkl

And suppose also that the desired manipulation is "create a new column by adding suffix _x to the values in col2". It would be nice if the use could express a template for this manipulation using f-strings. In this case, the template might be ..

template = 'f"{col2}_x"'

The double-quoting is is intentional -- I want to delay the evaluation of the template until it is applied to the dataframe. Note that the user will know the names of at least some columns (the ones that they supplied via the CSV file) and so they can specify a template based on these column names.

I was hoping that I could simply use pd.eval() or more specifically df.eval() which evaluates an expression using the namespace provided by the column names.

Something like

df["new_col"] = df.eval(template)

But this returns ...

AttributeError: 'PythonExprVisitor' object has no attribute 'visit_JoinedStr'

.. and I gather that pd.eval() does not support f-strings as of March 2023: https://github.com/pandas-dev/pandas/issues/52310

I guess I could export the dataframe to a dictionary and then iterate over the rows, but I was really hoping for a neat, pandas-esque solution.

Update:

I've also been trying to achieve something simular with pd.DataFrame.apply()

In the concrete case, I can iterate over the dataframe as follows:

ser=pd.Series()
for index,row in df.iterrows():
    locals().update(row.to_dict())
    ser[index]=eval(template)
df["new_col"]=ser

This might have to do, but locals().update() feels like a hack and all the advice I have read recommends vectorising over iterating. I'm also having a tough time wrapping this in a function (named or lambda) to pass into pd.DataFrame.apply() because again I don't know the column names ahead of time.

j0hn
  • 25
  • 3

2 Answers2

0

Here is one way to do it with Python built-in function eval, which must be used with caution:

import pandas as pd

df = pd.DataFrame({"col1": ["abc", "ghi"], "col2": ["def", "jkl"]})

template = 'df[f"{name}"]+"_x"'

name = "col2"
df["new_col"] = eval(template)

Then:

print(df)
# Output

  col1 col2 new_col
0  abc  def   def_x
1  ghi  jkl   jkl_x
Laurent
  • 12,287
  • 7
  • 21
  • 37
  • 1
    Thanks. Doesn't exactly do what I wanted (the question could have been clearer) but this gets me close, so I'll accept it as the answer. – j0hn Jun 05 '23 at 10:38
0

I should have added that this is for a command line tool, and the user won't know the name of the internal database any more than they know the names of the internal columns. (Only the columns that they themselves provide in a CSV file.)

Taking inspiration from @Laurent I can ask the user to supply an "f-string like" string that is read in as a regular string. Then I can wrangle into a format that will work with Python eval()

So, if the user supplies "{col2}_x" and this input is available as a variable called template ...

import re

template=re.sub(r'\{',r'df[f"\{', template)
template=re.sub(r'\}',r'\}]"', template)
template=r'f"' + template + '"'

df['new_col'] = eval(template)

Pretty ugly, but should work.

Still hoping for a neat pandas-based solution, but not going to hold my breath.

j0hn
  • 25
  • 3