11

I have a pandas dataframe as follows:

ticker    account      value         date
aa       assets       100,200       20121231, 20131231
bb       liabilities  50, 150       20141231, 20131231

I would like to split df['value'] and df['date'] so that the dataframe looks like this:

ticker    account      value         date
aa       assets       100           20121231
aa       assets       200           20131231 
bb       liabilities  50            20141231
bb       liabilities  150           20131231

Would greatly appreciate any help.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
tan
  • 465
  • 2
  • 5
  • 7
  • Duplicate Question : https://stackoverflow.com/q/45846765/6660373 – Pygirl Jan 23 '21 at 13:19
  • Does this answer your question? [Efficient way to unnest (explode) multiple list columns in a pandas DataFrame](https://stackoverflow.com/questions/45846765/efficient-way-to-unnest-explode-multiple-list-columns-in-a-pandas-dataframe) – Pygirl Jan 23 '21 at 13:20

5 Answers5

16

You can first split columns, create Series by stack and remove whitespaces by strip:

s1 = df.value.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)
s2 = df.date.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)

Then concat both Series to df1:

df1 = pd.concat([s1,s2], axis=1, keys=['value','date'])

Remove old columns value and date and join:

print (df.drop(['value','date'], axis=1).join(df1).reset_index(drop=True))
  ticker      account value      date
0     aa       assets   100  20121231
1     aa       assets   200  20131231
2     bb  liabilities    50  20141231
3     bb  liabilities   150  20131231
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you jezrael and piRSquared for both answers!! jezrael, your approach worked efficiently. – tan Jul 29 '16 at 12:20
9

I'm noticing this question a lot. That is, how do I split this column that has a list into multiple rows? I've seen it called exploding. Here are some links:

So I wrote a function that will do it.

def explode(df, columns):
    idx = np.repeat(df.index, df[columns[0]].str.len())
    a = df.T.reindex_axis(columns).values
    concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
    p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
    return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)

But before we can use it, we need lists (or iterable) in a column.

Setup

df = pd.DataFrame([['aa', 'assets',      '100,200', '20121231,20131231'],
                   ['bb', 'liabilities', '50,50',   '20141231,20131231']],
                  columns=['ticker', 'account', 'value', 'date'])

df

enter image description here

split value and date columns:

df.value = df.value.str.split(',')
df.date = df.date.str.split(',')

df

enter image description here

Now we could explode on either column or both, one after the other.

Solution

explode(df, ['value','date'])

enter image description here


Timing

I removed strip from @jezrael's timing because I could not effectively add it to mine. This is a necessary step for this question as OP has spaces in strings after commas. I was aiming at providing a generic way to explode a column given it already has iterables in it and I think I've accomplished that.

code

def get_df(n=1):
    return pd.DataFrame([['aa', 'assets',      '100,200,200', '20121231,20131231,20131231'],
                         ['bb', 'liabilities', '50,50',   '20141231,20131231']] * n,
                        columns=['ticker', 'account', 'value', 'date'])

small 2 row sample

enter image description here

medium 200 row sample

enter image description here

large 2,000,000 row sample

enter image description here

Community
  • 1
  • 1
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • I am very curious for timings ;) `iterritems` is slow, but on the other hand there is a lot operation like `stack`, `concat` and `join`, so maybe this can be comparable. – jezrael Jul 29 '16 at 07:07
  • I see difference in solution - I use `strip`. Can you add it to your solution too and then try timings again? I think you forget for it. – jezrael Jul 29 '16 at 15:00
  • @jezrael updated. Notice what I wrote just below ###Timing – piRSquared Jul 29 '16 at 15:40
2

I wrote explode function based on previous answers. It might be useful for anyone who want to grab and use it quickly.

def explode(df, cols, split_on=','):
    """
    Explode dataframe on the given column, split on given delimeter
    """
    cols_sep = list(set(df.columns) - set(cols))
    df_cols = df[cols_sep]
    explode_len = df[cols[0]].str.split(split_on).map(len)
    repeat_list = []
    for r, e in zip(df_cols.as_matrix(), explode_len):
        repeat_list.extend([list(r)]*e)
    df_repeat = pd.DataFrame(repeat_list, columns=cols_sep)
    df_explode = pd.concat([df[col].str.split(split_on, expand=True).stack().str.strip().reset_index(drop=True)
                            for col in cols], axis=1)
    df_explode.columns = cols
    return pd.concat((df_repeat, df_explode), axis=1)

example given from @piRSquared:

df = pd.DataFrame([['aa', 'assets', '100,200', '20121231,20131231'],
                   ['bb', 'liabilities', '50,50', '20141231,20131231']],
                  columns=['ticker', 'account', 'value', 'date'])
explode(df, ['value', 'date'])

output

+-----------+------+-----+--------+
|    account|ticker|value|    date|
+-----------+------+-----+--------+
|     assets|    aa|  100|20121231|
|     assets|    aa|  200|20131231|
|liabilities|    bb|   50|20141231|
|liabilities|    bb|   50|20131231|
+-----------+------+-----+--------+
titipata
  • 5,321
  • 3
  • 35
  • 59
2

Pandas >= 0.25

df.value = df.value.str.split(',')
df.date = df.date.str.split(',')
df = df.explode('value').explode("date").reset_index(drop=True)

df:

    ticker  account      value  date
0   aa      assets       100    20121231
1   aa      assets       100    20131231
2   aa      assets       200    20121231
3   aa      assets       200    20131231
4   bb      liabilities  50     20141231
5   bb      liabilities  50     20131231
6   bb      liabilities  50     20141231
7   bb      liabilities  50     20131231
Pygirl
  • 12,969
  • 5
  • 30
  • 43
0

Because I'm too new, I'm not allowed to write a comment, so I write an "answer".

@titipata your answer worked really good, but in my opinion there is a small "mistake" in your code I'm not able to find for my self.

I work with the example from this question and changed just the values.

df = pd.DataFrame([['title1', 'publisher1', '1.1,1.2', '1'],
               ['title2', 'publisher2', '2', '2.1,2.2']],
              columns=['titel', 'publisher', 'print', 'electronic'])

explode(df, ['print', 'electronic'])

    publisher   titel   print   electronic
0   publisher1  title1  1.1     1
1   publisher1  title1  1.2     2.1
2   publisher2  title2  2       2.2

As you see, in the column 'electronic' should be in row '1' the value '1' and not '2.1'.

Because of that, the hole DataSet would change. I hope someone could help me to find a solution for this.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Caro
  • 1
  • 1