30

I have the following DataFrame, where Track ID is the row index. How can I split the string in the stats column into 5 columns of numbers?

Track ID    stats
14.0    (-0.00924175824176, 0.41, -0.742016492568, 0.0036830094242, 0.00251748449963)
28.0    (0.0411538461538, 0.318230769231, 0.758717081514, 0.00264000622468, 0.0106535783677)
42.0    (-0.0144351648352, 0.168438461538, -0.80870348637, 0.000816872566404, 0.00316572586742)
56.0    (0.0343461538462, 0.288730769231, 0.950844962874, 6.1608706775e-07, 0.00337262030771)
70.0    (0.00905164835165, 0.151030769231, 0.670257006716, 0.0121790506745, 0.00302182567957)
84.0    (-0.0047967032967, 0.171615384615, -0.552879463981, 0.0500316517755, 0.00217970256969)
Monica Heddneck
  • 2,973
  • 10
  • 55
  • 89
t_n
  • 343
  • 1
  • 3
  • 5
  • You will make things a lot easier for others by posting df.to_dict() of the dataframe instead of its string representation. – chthonicdaemon Mar 31 '15 at 14:03
  • 1
    The stats column, does it contain a string that looks like a tuple, or does it contain tuples? – joris Mar 31 '15 at 14:06
  • possible duplicate of [Convert a pandas "Series of pair arrays" to a "two-column DataFrame"?](http://stackoverflow.com/questions/29346512/convert-a-pandas-series-of-pair-arrays-to-a-two-column-dataframe) – TheBlackCat Mar 31 '15 at 14:11
  • Thank you for the input - I'm very much at the bottom of a steep learning curve. I wasn't aware of `df.to_dict()`, but it showed me that the `stats` column in my data are indeed tuples, and not strings. For example, using `[float(x[0]) for x in df['stats']]` gave me the 0th elements. – t_n Apr 02 '15 at 09:31

4 Answers4

59

And for the other case, assuming it are strings that look like tuples:

In [74]: df['stats'].str[1:-1].str.split(',', expand=True).astype(float)
Out[74]:
          0         1         2         3         4
0 -0.009242  0.410000 -0.742016  0.003683  0.002517
1  0.041154  0.318231  0.758717  0.002640  0.010654
2 -0.014435  0.168438 -0.808703  0.000817  0.003166
3  0.034346  0.288731  0.950845  0.000001  0.003373
4  0.009052  0.151031  0.670257  0.012179  0.003022
5 -0.004797  0.171615 -0.552879  0.050032  0.002180

(note: for older versions of pandas (< 0.16.1), you need to use return_type='frame' instead of the expand keyword)

By the way, if it are tuples and not strings, you can simply do the following:

pd.DataFrame(df['stats'].tolist(), index=df.index)
joris
  • 133,120
  • 36
  • 247
  • 202
  • Thanks for the input. I'm new to pandas, python, and scripting in general, so am still getting my head around the basics. The data in the `stats` column were indeed tuples. Creating a new DataFrame with `df2 = df['stats'].apply(pd.Series)` has moved me forwards greatly. Thanks again. – t_n Apr 02 '15 at 09:38
  • 6
    `return_type` has been deprecated in favor of `expand=True`. – tacaswell May 29 '16 at 19:17
  • Please don't use `apply(pd.Series)` (or recommend its usage). See [When should I ever use pandas apply() in my code?](https://stackoverflow.com/a/54432584/4909087) A better solution is to listify the column first. – cs95 Jan 30 '19 at 06:41
2

If you have a sequence of tuples and not strings, and you want them as DataFrame columns, this is the simplest approach:

df = pd.concat([df['Track ID'],pd.DataFrame(df['stats'].values.tolist())], axis=1)

If it is actually strings, you can first convert it to lists like so, then apply the above operation:

dfpart = pd.DataFrame(df['stats'].apply(lambda x: x.strip('()').split(', ')).values.tolist()).astype(float)
df = pd.concat([df['Track ID'], dfpart], axis=1)
TheBlackCat
  • 9,791
  • 3
  • 24
  • 31
0

Assuming you have a column which contains tuples (as it appears in your example) rather than strings, this will work:

df = pandas.DataFrame({'Track ID': [14, 28, 42], 'stats': [(1, 2, 3, 4, 5), (1, 2, 3, 4, 5), (1, 2, 3, 4, 5)]}).set_index("Track ID")

from operator import itemgetter
for i in range(5):
    df["Col {}".format(i)] = df.stats.apply(itemgetter(i))

If you actually have strings that look like tuples, you can parse them first and then apply the same pattern as above:

df = df2 = pandas.DataFrame({'Track ID': [14, 28, 42], 'stats': ["(1, 2, 3, 4, 5)", "(1, 2, 3, 4, 5)", "(1, 2, 3, 4, 5)"]}).set_index("Track ID")
df.stats = df2.stats.str.strip("()").str.split(", ")
chthonicdaemon
  • 19,180
  • 2
  • 52
  • 66
0

Since your stats column contains tuples of length 5, this does the job fairly succinctly

pd.concat([df['Track ID'], pd.DataFrame(df['stats'].tolist(), index = 
df.index, columns = [f'stats_{i}' for i in range(1, 6)])], axis = 1)
G-Higgins
  • 161
  • 1
  • 3