4

I have a dataset that on one of its columns, each element is a list. I would like to flatten it, such that every list element would have a row of it's own.

I managed to solve it with iterrows, dict and append(see below) but it is too slow with my true DF that is large. Is there a way to make things faster?

I can consider replacing the column with list per element in another format (maybe hierarchical df? ) if that would make more sense.

EDIT: I have many columns, and some might change in the future. The only thing i know for sure is that I have the fields column. That's why I used dict in my solution

A minimal example, creating a df to play with:

import StringIO
df = pd.read_csv(StringIO.StringIO("""
id|name|fields
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]
"""), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
print df

resulting df:

   id name        fields
0   1  abc  [qq, ww, rr]
1   2  efg  [zz, xx, rr]

my (slow) solution:

new_df = pd.DataFrame(index=[], columns=df.columns)

for _, i in df.iterrows():
    flattened_d = [dict(i.to_dict(), fields=c) for c in i.fields]
    new_df = new_df.append(flattened_d )

Resulting with

    id name fields
0  1.0  abc     qq
1  1.0  abc     ww
2  1.0  abc     rr
0  2.0  efg     zz
1  2.0  efg     xx
2  2.0  efg     rr
Yuval Atzmon
  • 5,645
  • 3
  • 41
  • 74

3 Answers3

4

You can use numpy for better performance:

Both solutions use mainly numpy.repeat.

from  itertools import chain

vals = df.fields.str.len()
df1 = pd.DataFrame({
        "id": np.repeat(df.id.values,vals),
        "name": np.repeat(df.name.values, vals),
        "fields": list(chain.from_iterable(df.fields))})
df1 = df1.reindex_axis(df.columns, axis=1)
print (df1)
   id name fields
0   1  abc     qq
1   1  abc     ww
2   1  abc     rr
3   2  efg     zz
4   2  efg     xx
5   2  efg     rr

Another solution:

df[['id','name']].values converts columns to numpy array and duplicate them by numpy.repeat, then stack values in lists by numpy.hstack and add it by numpy.column_stack.

df1 = pd.DataFrame(np.column_stack((df[['id','name']].values.
                   repeat(list(map(len,df.fields)),axis=0),np.hstack(df.fields))),
                   columns=df.columns)

print (df1)
  id name fields
0  1  abc     qq
1  1  abc     ww
2  1  abc     rr
3  2  efg     zz
4  2  efg     xx
5  2  efg     rr

More general solution is filter out column fields and then add it to DataFrame constructor, because always last column:

cols = df.columns[df.columns != 'fields'].tolist()
print (cols)
['id', 'name']

df1 = pd.DataFrame(np.column_stack((df[cols].values.
                   repeat(list(map(len,df.fields)),axis=0),np.hstack(df.fields))), 
                   columns=cols + ['fields'])

print (df1)
  id name fields
0  1  abc     qq
1  1  abc     ww
2  1  abc     rr
3  2  efg     zz
4  2  efg     xx
5  2  efg     rr
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks. I have many columns, and some might change in the future. The only thing i know for sure is that I have the fields column. Is there a way to refactor your solution, s.t. I won't have to manually type 'id', 'name'? That's why in my solution I used dict() – Yuval Atzmon Feb 10 '17 at 20:50
  • Yes, I think second solution is then better. Give me a minute. – jezrael Feb 10 '17 at 20:51
  • It works and fast. Could you explain in the body the input to the constructor? – Yuval Atzmon Feb 10 '17 at 21:21
  • Note that the list is redundant. map(len, df.fields) already returns a list – Yuval Atzmon Feb 10 '17 at 22:26
  • 1
    Sorry, it isnecessary for python 3, in python 2 you can omit it. – jezrael Feb 10 '17 at 22:59
2

If your CSV is many thousands of lines long, then using_string_methods (below) may be faster than using_iterrows or using_repeat:

With

csv = 'id|name|fields'+("""
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]"""*10000)

In [210]: %timeit using_string_methods(csv)
10 loops, best of 3: 100 ms per loop

In [211]: %timeit using_itertuples(csv)
10 loops, best of 3: 119 ms per loop

In [212]: %timeit using_repeat(csv)
10 loops, best of 3: 126 ms per loop

In [213]: %timeit using_iterrows(csv)
1 loop, best of 3: 1min 7s per loop

So for a 10000-line CSV, using_string_methods is over 600x faster than using_iterrows, and marginally faster than using_repeat.


import pandas as pd
try: from cStringIO import StringIO         # for Python2
except ImportError: from io import StringIO # for Python3

def using_string_methods(csv):
    df = pd.read_csv(StringIO(csv), sep='|', dtype=None)
    other_columns = df.columns.difference(['fields']).tolist()
    fields = (df['fields'].str.extract(r'\[(.*)\]', expand=False)
              .str.split(r',', expand=True))
    df = pd.concat([df.drop('fields', axis=1), fields], axis=1)
    result = (pd.melt(df, id_vars=other_columns, value_name='field')
              .drop('variable', axis=1))
    result = result.dropna(subset=['field'])
    return result


def using_iterrows(csv):
    df = pd.read_csv(StringIO(csv), sep='|')
    df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
    new_df = pd.DataFrame(index=[], columns=df.columns)

    for _, i in df.iterrows():
        flattened_d = [dict(i.to_dict(), fields=c) for c in i.fields]
        new_df = new_df.append(flattened_d )
    return new_df

def using_repeat(csv):
    df = pd.read_csv(StringIO(csv), sep='|')
    df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
    cols = df.columns[df.columns != 'fields'].tolist()
    df1 = pd.DataFrame(np.column_stack(
        (df[cols].values.repeat(list(map(len,df.fields)),axis=0),
         np.hstack(df.fields))), columns=cols + ['fields'])
    return df1

def using_itertuples(csv):
    df = pd.read_csv(StringIO(csv), sep='|')
    df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
    other_columns = df.columns.difference(['fields']).tolist()
    data = []
    for tup in df.itertuples():
        data.extend([[getattr(tup, col) for col in other_columns]+[field] 
                     for field in tup.fields])
    return pd.DataFrame(data, columns=other_columns+['field'])

csv = 'id|name|fields'+("""
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]"""*10000)

Generally, fast NumPy/Pandas operations are possible only when the data is in a native NumPy dtype (such as int64 or float64, or strings.) Once you place lists (a non-native NumPy dtype) in a DataFrame the jig is up -- you are forced to use Python-speed loops to process the lists.

So to improve performance, you need to avoid placing lists in a DataFrame.

using_string_methods loads the fields data as strings:

df = pd.read_csv(StringIO(csv), sep='|', dtype=None)

and avoid using the apply method (which is generally as slow as a plain Python loop):

df.fields = df.fields.apply(lambda s: s[1:-1].split(','))

Instead, it uses faster vectorized string methods to break the strings up into separate columns:

fields = (df['fields'].str.extract(r'\[(.*)\]', expand=False)
          .str.split(r',', expand=True))

Once you have the fields in separate columns, you can use pd.melt to reshape the DataFrame into the desired format.

pd.melt(df, id_vars=['id', 'name'], value_name='field')

By the way, you might be interested to see that with a slight modification using_iterrows can be just as fast as using_repeat. I show the changes in using_itertuples. df.itertuples tends to be slightly faster than df.iterrows, but the difference is minor. The majority of the speed gain is achieved by avoiding calling df.append in a for-loop since that leads to quadratic copying.

Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks. I like your approach, but in my case the original data doesn't really come from CSV, so that's not a problem. – Yuval Atzmon Feb 10 '17 at 21:25
1

You can break the lists in the fields column into multiple columns by applying pandas.Series to fields and then merging to id and name like so:

cols = df.columns[df.columns != 'fields'].tolist() # adapted from @jezrael 
df = df[cols].join(df.fields.apply(pandas.Series))

Then you can melt the resulting new columns using set_index and stack, and then reseting the index:

df = df.set_index(cols).stack().reset_index()

Finally, drop the redundant column generated by reset_index and rename the generated column to "field":

df = df.drop(df.columns[-2], axis=1).rename(columns={0: 'field'})
Yuval Atzmon
  • 5,645
  • 3
  • 41
  • 74
cmaher
  • 5,100
  • 1
  • 22
  • 34
  • The first command fails. error is `MergeError: No common columns to perform merge on` – Yuval Atzmon Feb 10 '17 at 20:48
  • Yep sorry, I meant to use `join`, which works based off the index values. I corrected my answer. – cmaher Feb 10 '17 at 20:51
  • Still doesnt work. Here is the result (flattened to one line): `id name level_2 0 0 1 abc fields [qq, ww, rr] 1 2 efg fields [zz, xx, rr]` – Yuval Atzmon Feb 10 '17 at 20:54
  • Also, it looks like the `reset_index` at the end results in an extraneous `level_2` column, which can simply be dropped (i.e. `df.drop('level_2', axis=1, inplace=True)`) – cmaher Feb 10 '17 at 20:55
  • But that doesn't solves the main issue, which is that the DF is not expanded – Yuval Atzmon Feb 10 '17 at 20:57
  • Ah, my answer depends on the fact that `fields` elements are actually lists; when reading in the dataframe with StringIO, each element is actually a string, like `'[qq,ww,rr]'`. Will the 'list' elements always exist as strings that need to be parsed? – cmaher Feb 10 '17 at 21:03
  • Sorry, I had a typo during copy pasting. Your solution works. – Yuval Atzmon Feb 10 '17 at 22:46