358

I have a Pandas DataFrame with one column:

import pandas as pd

df = pd.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)]})

       teams
0  [SF, NYG]
1  [SF, NYG]
2  [SF, NYG]
3  [SF, NYG]
4  [SF, NYG]
5  [SF, NYG]
6  [SF, NYG]

How can split this column of lists into two columns?

Desired result:

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG
blackraven
  • 5,284
  • 7
  • 19
  • 45
bgame2498
  • 4,467
  • 5
  • 15
  • 19

13 Answers13

505

You can use the DataFrame constructor with lists created by to_list:

import pandas as pd

d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
print (df2)
       teams
0  [SF, NYG]
1  [SF, NYG]
2  [SF, NYG]
3  [SF, NYG]
4  [SF, NYG]
5  [SF, NYG]
6  [SF, NYG]

df2[['team1','team2']] = pd.DataFrame(df2.teams.tolist(), index= df2.index)
print (df2)
       teams team1 team2
0  [SF, NYG]    SF   NYG
1  [SF, NYG]    SF   NYG
2  [SF, NYG]    SF   NYG
3  [SF, NYG]    SF   NYG
4  [SF, NYG]    SF   NYG
5  [SF, NYG]    SF   NYG
6  [SF, NYG]    SF   NYG

And for a new DataFrame:

df3 = pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
print (df3)
  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

A solution with apply(pd.Series) is very slow:

#7k rows
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [121]: %timeit df2['teams'].apply(pd.Series)
1.79 s ± 52.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [122]: %timeit pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
1.63 ms ± 54.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/212915/discussion-between-amc-and-jezrael). – AMC May 01 '20 at 11:40
  • List method will reset index. If you want to keep index, copy index after construct df2 with `df1.index = d2.index`. – AMA Sep 09 '22 at 18:10
  • 1
    what about : `df1.apply(lambda x: x["teams"], result_type="expand",axis=1) ` – Roy Assis Dec 01 '22 at 09:16
  • @RoyAssis - Can you test performance? I think it is slow – jezrael Dec 01 '22 at 09:20
  • @jezrael Not that I think about it using apply is not performant. Ill check. – Roy Assis Dec 01 '22 at 09:27
  • Seems like the "tolist" method only works when all lists are the same length. – Paul Coccoli Jan 19 '23 at 18:31
  • Can I alert that this doesn't work for the general case where loading from, say a real csv, results in data the equivalent of `df = pd.DataFrame({"teams": [["SF, NYG"] for _ in range(7)]})`, nor also if the data has an unknown number of list items. – yeliabsalohcin Mar 09 '23 at 12:59
113

Much simpler solution:

pd.DataFrame(df2["teams"].to_list(), columns=['team1', 'team2'])

Yields,

  team1 team2
-------------
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG
7    SF   NYG

If you wanted to split a column of delimited strings rather than lists, you could similarly do:

pd.DataFrame(df["teams"].str.split('<delim>', expand=True).values,
             columns=['team1', 'team2'])
AMC
  • 2,642
  • 7
  • 13
  • 35
Joe Davison
  • 1,738
  • 2
  • 13
  • 11
  • 37
    what if each list has uneven number of elements? – ikel Nov 03 '19 at 17:00
  • 7
    _If you wanted to split a column of delimited strings rather than lists, you could similarly do:_ `df["teams"].str.split('', expand=True)` already returns a DataFrame, so it would probably be simpler to just rename the columns. – AMC May 01 '20 at 11:11
  • Thanks @AMC that did it for me, but as my "list" of UUID's was actually a `str` pretending to be a list, first I had to apply a lambda to remove square brackets. – yeliabsalohcin Mar 09 '23 at 15:06
74

This solution preserves the index of the df2 DataFrame, unlike any solution that uses tolist():

df3 = df2.teams.apply(pd.Series)
df3.columns = ['team1', 'team2']

Here's the result:

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG
Kevin Markham
  • 5,778
  • 1
  • 28
  • 36
  • 1
    `.apply(pd.Series)` is easy to remember and type. Unfortunately, as stated in other answers, it is also very slow for large numbers of observations. _If_ the index to be preserved is easily accessible, preservation using the DataFrame constructor approach is as simple as passing the `index` argument to the constructor, as seen in other answers. In the middle of a method chain, one workaround is to store an intermediate Series or DataFrame using an assignment expression (Python 3.8+) and then access the index from there. – Attila the Fun Aug 30 '22 at 15:16
23

There seems to be a syntactically simpler way, and therefore easier to remember, as opposed to the proposed solutions. I'm assuming that the column is called 'meta' in a dataframe df:

df2 = pd.DataFrame(df['meta'].str.split().values.tolist())
mikkokotila
  • 1,403
  • 12
  • 16
  • This answer was helpful, and could be paired with this visual demonstration of the technique: https://www.youtube.com/watch?v=vPKwm1XZjp8 – yeliabsalohcin Mar 10 '23 at 10:47
14

I would like to recommend a more efficient and Pythonic way.

First define the DataFrame as original post:

df = pd.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)]})

My solution:

%%timeit
df['team1'], df['team2'] = zip(*list(df['teams'].values))
>> 761 µs ± 8.35 µs per loop

In comparison, the most upvoted solution:

%%timeit
df[['team1','team2']] = pd.DataFrame(df.teams.tolist(), index=df.index)
df = pd.DataFrame(df['teams'].to_list(), columns=['team1','team2'])
>> 1.31 ms ± 11.2 µs per loop

My solution saves 40% time and is much shorter. The only thing you need to remember is how to unpack and reshape a two-dimension list by using zip(*list).

W. Ding
  • 587
  • 4
  • 15
12

List comprehension

A simple implementation with list comprehension (my favorite)

df = pd.DataFrame([pd.Series(x) for x in df.teams])
df.columns = ['team_{}'.format(x+1) for x in df.columns]

Timing on output:

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 2.71 ms

Output:

team_1    team_2
0    SF    NYG
1    SF    NYG
2    SF    NYG
3    SF    NYG
4    SF    NYG
5    SF    NYG
6    SF    NYG
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Talis
  • 283
  • 3
  • 13
11

The previous solutions didn't work for me since I have nan observations in my dataframe. In my case df2[['team1','team2']] = pd.DataFrame(df2.teams.values.tolist(), index= df2.index) yields:

object of type 'float' has no len()

I solve this using a list comprehension. Here is the replicable example:

import pandas as pd
import numpy as np
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
            ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2.loc[2,'teams'] = np.nan
df2.loc[4,'teams'] = np.nan
df2

Output:

        teams
0   [SF, NYG]
1   [SF, NYG]
2   NaN
3   [SF, NYG]
4   NaN
5   [SF, NYG]
6   [SF, NYG]

df2['team1']=np.nan
df2['team2']=np.nan

Solving with a list comprehension,

for i in [0,1]:
    df2['team{}'.format(str(i+1))]=[k[i] if isinstance(k,list) else k for k in df2['teams']]

df2

yields:

    teams   team1   team2
0   [SF, NYG]   SF  NYG
1   [SF, NYG]   SF  NYG
2   NaN        NaN  NaN
3   [SF, NYG]   SF  NYG
4   NaN        NaN  NaN
5   [SF, NYG]   SF  NYG
6   [SF, NYG]   SF  NYG
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Lucas
  • 1,166
  • 2
  • 14
  • 34
7

Here's another solution using df.transform and df.set_index:

>>> from operator import itemgetter
>>> df['teams'].transform({'item1': itemgetter(0), 'item2': itemgetter(1)})

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

Which of course can be generalized as:

>>> indices = range(len(df['teams'][0]))

>>> df['teams'].transform({f'team{i+1}': itemgetter(i) for i in indices})

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

This approach has the added benefit of extracting the desired indices:

>>> df
                 teams
0  [SF, NYG, XYZ, ABC]
1  [SF, NYG, XYZ, ABC]
2  [SF, NYG, XYZ, ABC]
3  [SF, NYG, XYZ, ABC]
4  [SF, NYG, XYZ, ABC]
5  [SF, NYG, XYZ, ABC]
6  [SF, NYG, XYZ, ABC]

>>> indices = [0, 2]
>>> df['teams'].transform({f'team{i+1}': itemgetter(i) for i in indices})

  team1 team3
0    SF   XYZ
1    SF   XYZ
2    SF   XYZ
3    SF   XYZ
4    SF   XYZ
5    SF   XYZ
6    SF   XYZ
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Sayandip Dutta
  • 15,602
  • 4
  • 23
  • 52
5

Based on the previous answers, here is another solution which returns the same result as df2.teams.apply(pd.Series) with a much faster run time:

pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)

Timings:

In [1]:
import pandas as pd
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [2]: %timeit df2['teams'].apply(pd.Series)

8.27 s ± 2.73 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)

35.4 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
ailurid
  • 51
  • 1
  • 2
1

If someone comes here to find a ready-made function, I wrote one.

  • it finds all columns with lists and unfolds them, if columns are not specified;
  • added columns are named as column_name_0, column_name_1, etc.;
  • the columns order is preserved in final dataframe;
  • if strict=True, it checks whether lists in a given column are of equal size.

Improvements and comments are appreciated.

def unfold_columns(df, columns=[], strict=False):
    assert isinstance(columns, list), "Columns should be a list of column names"
    if len(columns) == 0:
        columns = [
            column for column in df.columns 
            if df.applymap(lambda x: isinstance(x, list)).all()[column]
        ]
    else:
        assert(all([(column in df.columns) for column in columns])), \
            "Not all given columns are found in df"
    columns_order = df.columns
    for column_name in columns:
        if df[column_name].apply(lambda x: isinstance(x, list)).all():
            if strict:
                assert len(set(df[column_name].apply(lambda x: len(x)))) == 1, \
                    f"Lists in df['{column_name}'] are not of equal length"
            unfolded = pd.DataFrame(df[column_name].tolist())
            unfolded.columns = [f'{column_name}_{x}' for x in unfolded.columns]
            columns_order = [
                *columns_order[:list(columns_order).index(column_name)], 
                *unfolded.columns, 
                *columns_order[list(columns_order).index(column_name)+1:]
            ]
            df = df.join(unfolded).drop([column_name], axis=1)
    return df[columns_order]
praegustator
  • 365
  • 4
  • 12
1

To append two new columns to the existing DataFrame:

df[['team1', 'team2']] = df["teams"].to_list()
dougissi
  • 31
  • 2
0

you can try to use two times of apply to create new column 'team1' and 'team2' in your df

df = pd.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)]})
df["team1"]=df['teams'].apply(lambda x: x[0]  )
df["team2"]=df['teams'].apply(lambda x: x[1]  )
df

enter image description here

lu di
  • 11
  • On dataframes that come from a groupby aggregation, this is the only method that worked for me.. – Akantor Dec 02 '22 at 12:15
0

Summarizing all the answers. If you need just create new DataFrame with 2 columns

pd.DataFrame(df['teams'].tolist(), columns=['team1', 'team2'], index=df.index)

If you want to assign to the same df, you have several options.

  1. The shortest

    df[['team1', 'team2']] = df['teams'].tolist()

  2. The slowest (don't recommend, it might be 10x time slower or more without any advantages)

    df[['team1', 'team2']] = df['teams'].apply(pd.Series)

  3. And the fastest for some reason (almost 2x faster than first).

    df['team1'], df['team2'] = zip(*df['teams'].tolist())

So I would recommend 1. If you really need speed, you can try 3, but that looks strange and the advantage may disappear in future versions.

Leonid Mednikov
  • 943
  • 4
  • 13