pandas: Split dataframe columns into separate rows

Question

Based on this SO question, I would like split my dataframe based on column var1. However, I have no delimiter between the letters.

import pandas as pd
a = pd.DataFrame([{'var1': 'abc', 'var2': 1},
                  {'var1': 'def', 'var2': 2}])
b = pd.DataFrame([{'var1': 'a', 'var2': 1},
                  {'var1': 'b', 'var2': 1},
                  {'var1': 'c', 'var2': 1},
                  {'var1': 'd', 'var2': 2},
                  {'var1': 'e', 'var2': 2},
                  {'var1': 'f', 'var2': 2}])

This is what I want to achieve.

>>> a
  var1  var2
0  abc     1
1  def     2
>>> b
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

.split() does not work on empty characters ("").

pd.concat([Series(row['var2'], row['var1'].split(','))              
                  for _, row in a.iterrows()]).reset_index()

therefore, this above does not work. Any idea how I can achieve that?

jezrael · Accepted Answer · 2018-11-12T12:02:44.793

If performance is important, use list comprehension:

df = pd.DataFrame([[x, j] for i, j in zip(a['var1'], a['var2']) for x in list(i)], 
                   columns=a.columns)
print (df)
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

Perfromance in small DataFrame:

In [215]: %timeit pd.DataFrame([[x, j] for i, j in zip(a['var1'], a['var2']) for x in list(i)], columns=a.columns)
355 µs ± 4.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [216]: %timeit pd.concat([pd.Series(row['var2'], list(row['var1'])) for _, row in a.iterrows()]).reset_index()
2.93 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#2k rows
a = pd.concat([a] * 1000, ignore_index=True)

In [217]: %timeit pd.DataFrame([[x, j] for i, j in zip(a['var1'], a['var2']) for x in list(i)], columns=a.columns)
2.82 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [218]: %timeit pd.concat([pd.Series(row['var2'], list(row['var1'])) for _, row in a.iterrows()]).reset_index()
1.8 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

EDIT:

General solution for multiple columns:

a = pd.DataFrame([{'var1': 'abc', 'var2': 1, 'var3':7},
                  {'var1': 'def', 'var2': 2, 'var3':5}])


b = pd.DataFrame([(y, *x[1:]) for x in a.values.tolist() for y in list(x[0])], 
                  columns=a.columns)
print (b)
  var1  var2  var3
0    a     1     7
1    b     1     7
2    c     1     7
3    d     2     5
4    e     2     5
5    f     2     5

#lower python versions
b = pd.DataFrame([(y,) + tuple(x[1:]) for x in a.values.tolist() for y in list(x[0])], 
                  columns=a.columns)
print (b)
  var1  var2  var3
0    a     1     7
1    b     1     7
2    c     1     7
3    d     2     5
4    e     2     5
5    f     2     5

That is great, thank you. How would i adjust your command if I my data frame has more than two columns (e.g. `var1`, `var2`, `var3`, `var4`)? — Revan, Nov 12 '18 at 11:57
@littlebird This is a duplicate, see https://stackoverflow.com/a/48197395/4909087 for the answer to exactly the question you just asked — cs95, Nov 12 '18 at 11:58

score 2 · Answer 2 · answered Nov 12 '18 at 11:44

2

Make the string a list:

pd.concat([pd.Series(row['var2'], list(row['var1'])) for _, row in a.iterrows()]).reset_index()

answered Nov 12 '18 at 11:44

Alex

6,610
3
20
38

pandas: Split dataframe columns into separate rows

2 Answers2