I have avro data with the following keys: 'id, label, features'. id and label are string while features is a buffer of floats.
import dask.bag as db
avros = db.read_avro('data.avro')
df = avros.to_dataframe()
convert = partial(np.frombuffer, dtype='float64')
X = df.assign(features=lambda x: x.features.apply(convert, meta='float64'))
I eventually end up with this MCVE
label id features
0 good a [1.0, 0.0, 0.0]
1 bad b [1.0, 0.0, 0.0]
2 good c [0.0, 0.0, 0.0]
3 bad d [1.0, 0.0, 1.0]
4 good e [0.0, 0.0, 0.0]
my desired output would be:
label id f1 f2 f3
0 good a 1.0 0.0 0.0
1 bad b 1.0 0.0 0.0
2 good c 0.0 0.0 0.0
3 bad d 1.0 0.0 1.0
4 good e 0.0 0.0 0.0
I tried some ways that are like pandas, namely df[['f1','f2','f3']] = df.features.apply(pd.Series)
did not work like in pandas.
I can traverse with a loop like
for i in range(len(features)):
df[f'f{i}'] = df.features.map(lambda x: x[i])
but in the real use-case I have thousand of features and this traverses the dataset thousands of times.
What would be the best way to achieve the desired outcome?