1

I have avro data with the following keys: 'id, label, features'. id and label are string while features is a buffer of floats.

import dask.bag as db
avros = db.read_avro('data.avro')
df = avros.to_dataframe()
convert = partial(np.frombuffer, dtype='float64')
X = df.assign(features=lambda x: x.features.apply(convert, meta='float64'))

I eventually end up with this MCVE

  label id         features
0  good  a  [1.0, 0.0, 0.0]
1   bad  b  [1.0, 0.0, 0.0]
2  good  c  [0.0, 0.0, 0.0]
3   bad  d  [1.0, 0.0, 1.0]
4  good  e  [0.0, 0.0, 0.0]

my desired output would be:

  label id   f1   f2   f3
0  good  a  1.0  0.0  0.0
1   bad  b  1.0  0.0  0.0
2  good  c  0.0  0.0  0.0
3   bad  d  1.0  0.0  1.0
4  good  e  0.0  0.0  0.0

I tried some ways that are like pandas, namely df[['f1','f2','f3']] = df.features.apply(pd.Series) did not work like in pandas.

I can traverse with a loop like

for i in range(len(features)):
df[f'f{i}'] = df.features.map(lambda x: x[i])

but in the real use-case I have thousand of features and this traverses the dataset thousands of times.

What would be the best way to achieve the desired outcome?

DeanLa
  • 1,871
  • 3
  • 21
  • 37
  • 2
    Possible duplicate of [Dask Dataframe split column of list into multiple columns](https://stackoverflow.com/questions/45246716/dask-dataframe-split-column-of-list-into-multiple-columns) – rpanai Oct 15 '19 at 12:45
  • in the suggested solution, it seems like it parses the series for every feature. This is not too bad in the MCVE, but in the real world I have thousands of features. this sounds expensive computationally. – DeanLa Oct 15 '19 at 13:15
  • actually there is a newer answer on the topic [link](https://stackoverflow.com/a/54636224/4819376) – rpanai Oct 15 '19 at 13:21
  • This is close, but works on string. My object is already a list. – DeanLa Oct 15 '19 at 14:46

1 Answers1

0
In [68]: import string
    ...: import numpy as np
    ...: import pandas as pd

In [69]: M, N = 100, 100
    ...: labels = np.random.choice(['good', 'bad'], size=M)
    ...: ids = np.random.choice(list(string.ascii_lowercase), size=M)
    ...: features = np.empty((M,), dtype=object)
    ...: features[:] = list(map(list, np.random.randn(M, N)))
    ...: df = pd.DataFrame([labels, ids, features], index=['label', 'id', 'features']).T
    ...: df1 = df.copy()

In [70]: %%time
    ...: columns = [f"f{i:04d}" for i in range(N)]
    ...: features = pd.DataFrame(list(map(np.asarray, df1.pop('features').to_numpy())), index=df.index, columns=columns)
    ...: df1 = pd.concat([df1, features], axis=1)
Wall time: 13.9 ms

In [71]: M, N = 1000, 1000
    ...: labels = np.random.choice(['good', 'bad'], size=M)
    ...: ids = np.random.choice(list(string.ascii_lowercase), size=M)
    ...: features = np.empty((M,), dtype=object)
    ...: features[:] = list(map(list, np.random.randn(M, N)))
    ...: df = pd.DataFrame([labels, ids, features], index=['label', 'id', 'features']).T
    ...: df1 = df.copy()

In [72]: %%time
    ...: columns = [f"f{i:04d}" for i in range(N)]
    ...: features = pd.DataFrame(list(map(np.asarray, df1.pop('features').to_numpy())), index=df.index, columns=columns)
    ...: df1 = pd.concat([df1, features], axis=1)
Wall time: 627 ms

In [73]: df1.shape
Out[73]: (1000, 1002)

Edit: About 2x faster than the original

In [79]: df2 = df.copy()

In [80]: %%time
    ...: features = df2.pop('features')
    ...: for i in range(N):
    ...:     df2[f'f{i:04d}'] = features.map(lambda x: x[i])
    ...:     
Wall time: 1.46 s

In [81]: df1.equals(df2)
Out[81]: True

Edit: Edit: A faster way of constructing the DataFrame gives an 8x improvement over the original:

In [22]: df1 = df.copy()

In [23]: %%time
    ...: features = pd.DataFrame({f"f{i:04d}": np.asarray(row) for i, row in enumerate(df1.pop('features').to_numpy())})
    ...: df1 = pd.concat([df1, features], axis=1)
Wall time: 165 ms
Dave Hirschfeld
  • 768
  • 2
  • 6
  • 15