1

Given a dataset as follows:

   id          words           tags
0   1  ['Φ', '20mm']  ['xc', 'PER']
1   2  ['Φ', '80mm']    ['xc', 'm']
2   3        ['EVA']         ['nz']
3   4       ['Q345']         ['nz']

df in the format of list of dict:

[{'id': 1, 'words': ['Φ', '20mm'], 'tags': ['xc', 'PER']},
 {'id': 2, 'words': ['Φ', '80mm'], 'tags': ['xc', 'm']},
 {'id': 3, 'words': ['EVA'], 'tags': ['nz']},
 {'id': 4, 'words': ['Q345'], 'tags': ['nz']}]

The elements from words have correspondent Part-of-speech tagging (POS tagging) in tags column.

I hope to convert dataframe to the following format:

   id words tags
0   1     Φ   xc
1   1  20mm  PER
2   2     Φ   xc
3   2  80mm    m
4   3   EVA   nz
5   4  Q345   nz

How could I acheive that in Pandas? Thanks.

sammywemmy
  • 27,093
  • 4
  • 17
  • 31
ah bon
  • 9,293
  • 12
  • 65
  • 148
  • 3
    easier if you shared the source code : ``df.to_dict('records')``. Meanwhile try : [explode](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html) : -> `df.explode(['words', 'tags'])` – sammywemmy Sep 01 '21 at 01:53
  • I updated the example data. Pls check. BTW, I try `df.explode(['words', 'tags'])`, it seems generate same result as original one. – ah bon Sep 01 '21 at 01:57
  • 1
    @sammywemmy, this will not work, explode accepts only scalars, it can explode only a single column – Andreas Sep 01 '21 at 01:59
  • 3
    As a note: the accepted answer of the linked duplicate (`df.set_index(['id']).apply(pd.Series.explode).reset_index()`) appears about 3x faster than the accepted answer here in my testing. – Henry Ecker Sep 01 '21 at 02:05
  • 2
    if you are on Pandas 1.3, explode accepts a list/tuple of columns – sammywemmy Sep 01 '21 at 02:06
  • Except `id` column, if other columns not need to explode in the data, `df.set_index(['id']).apply(pd.Series.explode).reset_index()` will not work. – ah bon Sep 01 '21 at 02:10
  • 1
    You can add as many columns in the index as needed. `['id', 'col1', 'col2']` etc. – Henry Ecker Sep 01 '21 at 02:12
  • It raises an error: `TypeError: unhashable type: 'list'` – ah bon Sep 01 '21 at 02:14

1 Answers1

5

You can consider first exploding the dataframe with id and words and the dataframe with id and tags then you can concat them.

import pandas as pd

df = pd.DataFrame(
    {"id":[1,2,3,4],
     "words":[['Φ', '20mm'],['Φ', '80mm'], ['EVA'], ['Q345']],
     "tags": [['xc', 'PER'],  ['xc', 'm'], ['nz'], ['nz']]})

a = df[["id", "words"]].explode("words")
b = df[["id", "tags"]].explode("tags")
pd.concat([a, b], axis=1)
rpanai
  • 12,515
  • 2
  • 42
  • 64