12

I have a dataframe that requires a subset of the columns to have entries with multiple values. below is a dataframe with a "runtimes" column that has the runtimes of a program in various conditions:

df = [{"condition": "a", "runtimes": [1,1.5,2]}, {"condition": "b", "runtimes": [0.5,0.75,1]}]
df = pandas.DataFrame(df)

this makes a dataframe:

  condition        runtimes
0         a     [1, 1.5, 2]
1         b  [0.5, 0.75, 1]

how can I work with this dataframe and get pandas to treat its values as a numeric list? for example calculate the mean for "runtimes" column across the rows?

df["runtimes"].mean()

gives the error: "Could not convert [1, 1.5, 2, 0.5, 0.75, 1] to numeric"

it'd be useful to work with this dataframes and also to serialize them as csv files where a list like: [1, 1.5, 2] gets converted into "1,1.5,2" so that it's still a single entry in the csv file.

2 Answers2

15

It feels like you're trying to make Pandas be something it is not. If you always have 3 runtimes, you could make 3 columns. However the more Pandas-esqe approach is to normalize your data (no matter how many different trials you have) to something like this:

df = [{"condition": "a", "trial": 1, "runtime": 1},
      {"condition": "a", "trial": 2, "runtime": 1.5},
      {"condition": "a", "trial": 3, "runtime": 2},
      {"condition": "b", "trial": 1, "runtime": .5},
      {"condition": "b", "trial": 2, "runtime": .75},
      {"condition": "b", "trial": 3, "runtime": 1}]
df = pd.DataFrame(df)

then you can

print df.groupby('condition').mean()


           runtime  trial
condition                
a             1.50      2
b             0.75      2

The concept here is to keep the data tabular and only one value per cell. If you want to do nested list functions then you should be using lists, and not Pandas dataframes.

JD Long
  • 59,675
  • 58
  • 202
  • 294
5

It looks like pandas is trying to add up all the lists in the series and divide by the number of rows. This results in a list concatenation, and the result fails the numeric type check. This explains the list in your error.

You could compute the mean like this:

df['runtimes'].apply(numpy.mean)

Aside from that, pandas doesn't like working with lists as values. If your data is tabular, consider breaking the list out into three separate columns.

Serializing the column would work in a similar way:

df['runtimes'].apply(lambda x: '"' + str(x)[1:-1] + '"')
Mike
  • 6,813
  • 4
  • 29
  • 50