How can I split a column of tuples in a Pandas dataframe?

Question

I have a Pandas dataframe (this is only a little piece)

>>> d1
   y norm test  y norm train  len(y_train)  len(y_test)  \
0    64.904368    116.151232          1645          549
1    70.852681    112.639876          1645          549

                                    SVR RBF  \
0   (35.652207342877873, 22.95533537448393)
1  (39.563683797747622, 27.382483096332511)

                                        LCV  \
0  (19.365430594452338, 13.880062435173587)
1  (19.099614489458364, 14.018867136617146)

                                   RIDGE CV  \
0  (4.2907610988480362, 12.416745648065584)
1    (4.18864306788194, 12.980833914392477)

                                         RF  \
0   (9.9484841581029428, 16.46902345373697)
1  (10.139848213735391, 16.282141345406522)

                                           GB  \
0  (0.012816232716538605, 15.950164822266007)
1  (0.012814519804493328, 15.305745202851712)

                                             ET DATA
0  (0.00034337162272515505, 16.284800366214057)  j2m
1  (0.00024811554516431878, 15.556506191784194)  j2m
>>>

I want to split all the columns that contain tuples. For example, I want to replace the column LCV with the columns LCV-a and LCV-b.

How can I do that?

score 253 · Accepted Answer · edited Mar 19 '21 at 14:19

253

You can do this by doing pd.DataFrame(col.tolist()) on that column:

In [2]: df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]})

In [3]: df
Out[3]:
   a       b
0  1  (1, 2)
1  2  (3, 4)

In [4]: df['b'].tolist()
Out[4]: [(1, 2), (3, 4)]

In [5]: pd.DataFrame(df['b'].tolist(), index=df.index)
Out[5]:
   0  1
0  1  2
1  3  4

In [6]: df[['b1', 'b2']] = pd.DataFrame(df['b'].tolist(), index=df.index)

In [7]: df
Out[7]:
   a       b  b1  b2
0  1  (1, 2)   1   2
1  2  (3, 4)   3   4

Note: in an earlier version, this answer recommended to use df['b'].apply(pd.Series) instead of pd.DataFrame(df['b'].tolist(), index=df.index). That works as well (because it makes a Series of each tuple, which is then seen as a row of a dataframe), but it is slower / uses more memory than the tolist version, as noted by the other answers here (thanks to denfromufa).

edited Mar 19 '21 at 14:19

Peter Mortensen

30,738
21
105
131

answered Apr 09 '15 at 22:55

joris

133,120
36
247
202

3

is there a way to automate it due to the large number of columns? – Donbeo Apr 09 '15 at 22:56
Not directly I think. But you can easily write a function for it using the above code (+ removing the original one) – joris Apr 09 '15 at 22:58
If you have a large number of columns you may want to consider to 'tidy' your data: http://vita.had.co.nz/papers/tidy-data.html You can do this using the melt function. – Axel Feb 15 '18 at 18:49
.apply(pd.Series) works fine, but for large datasets consumes a lot of memory and can cause Memory Error – Yury Wallet Feb 02 '19 at 21:48
the pd.DataFrame(df['b'].tolist(), index=df.index) method doesn't work with missing values. The series method does work. – Nina van Bruggen Jul 21 '22 at 12:09

score 55 · Answer 2 · edited Mar 19 '21 at 14:40

The str accessor that is available to pandas.Series objects of dtype == object is actually an iterable.

Assume a pandas.DataFrame df:

df = pd.DataFrame(dict(col=[*zip('abcdefghij', range(10, 101, 10))]))

df

        col
0   (a, 10)
1   (b, 20)
2   (c, 30)
3   (d, 40)
4   (e, 50)
5   (f, 60)
6   (g, 70)
7   (h, 80)
8   (i, 90)
9  (j, 100)

We can test if it is an iterable:

from collections import Iterable

isinstance(df.col.str, Iterable)

True

We can then assign from it like we do other iterables:

var0, var1 = 'xy'
print(var0, var1)

x y

Simplest solution

So in one line we can assign both columns:

df['a'], df['b'] = df.col.str

df

        col  a    b
0   (a, 10)  a   10
1   (b, 20)  b   20
2   (c, 30)  c   30
3   (d, 40)  d   40
4   (e, 50)  e   50
5   (f, 60)  f   60
6   (g, 70)  g   70
7   (h, 80)  h   80
8   (i, 90)  i   90
9  (j, 100)  j  100

Faster solution

Only slightly more complicated, we can use zip to create a similar iterable:

df['c'], df['d'] = zip(*df.col)

df

        col  a    b  c    d
0   (a, 10)  a   10  a   10
1   (b, 20)  b   20  b   20
2   (c, 30)  c   30  c   30
3   (d, 40)  d   40  d   40
4   (e, 50)  e   50  e   50
5   (f, 60)  f   60  f   60
6   (g, 70)  g   70  g   70
7   (h, 80)  h   80  h   80
8   (i, 90)  i   90  i   90
9  (j, 100)  j  100  j  100

Inline

Meaning, don't mutate existing df.

This works because assign takes keyword arguments where the keywords are the new (or existing) column names and the values will be the values of the new column. You can use a dictionary and unpack it with ** and have it act as the keyword arguments.

So this is a clever way of assigning a new column named 'g' that is the first item in the df.col.str iterable and 'h' that is the second item in the df.col.str iterable:

df.assign(**dict(zip('gh', df.col.str)))

        col  g    h
0   (a, 10)  a   10
1   (b, 20)  b   20
2   (c, 30)  c   30
3   (d, 40)  d   40
4   (e, 50)  e   50
5   (f, 60)  f   60
6   (g, 70)  g   70
7   (h, 80)  h   80
8   (i, 90)  i   90
9  (j, 100)  j  100

My version of the `list` approach

With modern list comprehension and variable unpacking. Note: also inline using join

df.join(pd.DataFrame([*df.col], df.index, [*'ef']))

        col  g    h
0   (a, 10)  a   10
1   (b, 20)  b   20
2   (c, 30)  c   30
3   (d, 40)  d   40
4   (e, 50)  e   50
5   (f, 60)  f   60
6   (g, 70)  g   70
7   (h, 80)  h   80
8   (i, 90)  i   90
9  (j, 100)  j  100

The mutating version would be

df[['e', 'f']] = pd.DataFrame([*df.col], df.index)

Naive Time Test

Short DataFrame

Use the one defined above:

%timeit df.assign(**dict(zip('gh', df.col.str)))
%timeit df.assign(**dict(zip('gh', zip(*df.col))))
%timeit df.join(pd.DataFrame([*df.col], df.index, [*'gh']))

1.16 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
635 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
795 µs ± 42.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Long DataFrame

10^3 times bigger

df = pd.concat([df] * 1000, ignore_index=True)

%timeit df.assign(**dict(zip('gh', df.col.str)))
%timeit df.assign(**dict(zip('gh', zip(*df.col))))
%timeit df.join(pd.DataFrame([*df.col], df.index, [*'gh']))

11.4 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.33 ms ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The Iterable keyword was deprecated in python3.10: https://stackoverflow.com/questions/72032032/importerror-cannot-import-name-iterable-from-collections-in-python — thegreatcoder, Mar 13 '23 at 01:59

score 31 · Answer 3 · edited Mar 19 '21 at 14:21

31

On much larger datasets, I found that .apply() is few orders of magnitude slower than pd.DataFrame(df['b'].values.tolist(), index=df.index).

This performance issue was closed in GitHub, although I do not agree with this decision:

performance issue - apply with pd.Series vs tuple #11615

It is based on this answer.

edited Mar 19 '21 at 14:21

Peter Mortensen

30,738
21
105
131

answered Nov 17 '15 at 17:58

denfromufa

5,610
13
81
138

5

`pd.DataFrame(df['b'].tolist())` without the `.values` seems to work just fine too. (And thanks, your solution is _much_ faster than `.apply()`) – Swier Sep 19 '16 at 07:41
I was worried about capturing index, hence explicit usage of .values. – denfromufa Sep 20 '16 at 03:17
1

solution by @denfromufa works super fast df[['b1', 'b2']] =pd.DataFrame(df['b'].values.tolist(), index=df.index) and cause no Memory Error (as compared to .apply(pd.Series)) – Yury Wallet Feb 02 '19 at 21:56

score 17 · Answer 4 · edited Mar 19 '21 at 14:34

17

I think a simpler way is:

>>> import pandas as pd
>>> df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]})
>>> df
   a       b
0  1  (1, 2)
1  2  (3, 4)
>>> df['b_a'] = df['b'].str[0]
>>> df['b_b'] = df['b'].str[1]
>>> df
   a       b  b_a  b_b
0  1  (1, 2)    1    2
1  2  (3, 4)    3    4

edited Mar 19 '21 at 14:34

Peter Mortensen

30,738
21
105
131

answered Apr 01 '19 at 15:23

Jinhua Wang

1,679
1
17
44

2

This solutions is indeed much more simpler – ApplePie Apr 09 '19 at 14:26
@jinhuawang it appears this is hack on top of `str` representation of a `pd.Series` object. Can you explain how this even works?! – denfromufa Apr 14 '19 at 18:30
I think it is just how the str object works? you can access the array object with str – Jinhua Wang Apr 15 '19 at 09:47
What if some of the rows have tuples with a different number of values? – mammykins May 27 '19 at 19:35
I think this should be the accepted one. It's more 'pandas-onic'...if that's a thing. – Natacha Jan 16 '20 at 19:57
An explanation would be in order. E.g., what is it that allows it to be simpler? What is the idea/gist? Please respond by [editing your answer](https://stackoverflow.com/posts/55458565/edit), not here in comments (***without*** "Edit:", "Update:", or similar - the answer should appear as if it was written today). – Peter Mortensen Mar 19 '21 at 14:33

score 12 · Answer 5 · edited Mar 19 '21 at 14:32

A caveat of the second solution,

pd.DataFrame(df['b'].values.tolist())

is that it will explicitly discard the index, and add in a default sequential index, whereas the accepted answer

apply(pd.Series)

will not, since the result of apply will retain the row index. While the order is initially retained from the original array, Pandas will try to match the indices from the two dataframes.

This can be very important if you are trying to set the rows into an numerically indexed array, and Pandas will automatically try to match the index of the new array to the old, and cause some distortion in the ordering.

A better hybrid solution would be to set the index of the original dataframe onto the new, i.e.,

pd.DataFrame(df['b'].values.tolist(), index=df.index)

Which will retain the speed of using the second method while ensuring the order and indexing is retained on the result.

score 1 · Answer 6 · edited Jan 06 '23 at 17:21

1

pandas.Series.str.extract is another option, use the data from https://opendataportal-lasvegas.opendata.arcgis.com/datasets/restaurant-inspections-open-data/explore

import pandas as pd
df = pd.read_csv('raw_data.csv', low_memory=False)
df[['latitude', 'longitude']] = df['Location_1'].str.extract(pat = '(-?\d+\.\d+),\s*(-?\d+\.\d+)')
df.to_csv('result.csv')

edited Jan 06 '23 at 17:21

Zenville Erasmus

63
1
8

answered Oct 26 '21 at 06:57

LingYan Meng

699
4
12

How can I split a column of tuples in a Pandas dataframe?

6 Answers6

Simplest solution

Faster solution

Inline

My version of the `list` approach

Naive Time Test

Linked

Related

How can I split a column of tuples in a Pandas dataframe?

6 Answers6

Simplest solution

Faster solution

Inline

My version of the list approach

Naive Time Test

Linked

Related

My version of the `list` approach