1

I have a dataframe like this:

text                     text2           category 
sfsd sgvv                sfsdfdf         abc,xyz
zydf sefs sdfsd          drdg            yyy
dfsd dsrgd dggr          dgd             xyz
eter vxg wfe             fs              abc
dfvf ertet               dggdss          abc,xyz,bbb

I want an output like this:

text                     text2           category 
sfsd sgvv                sfsdfdf         abc
sfsd sgvv                sfsdfdf         xyz
zydf sefs sdfsd          drdg            yyy
dfsd dsrgd dggr          dgd             xyz
eter vxg wfe             fs              abc
dfvf ertet               dggdss          abc
dfvf ertet               dggdss          xyz
dfvf ertet               dggdss          bbb

Basically create a new row for each two or more category in category column.

I tried this:

df1 = (df.assign(category = df['category'].str.split(','))
         .explode('category')
         .reset_index(drop=True))

But it seems to be creating way more rows than expected. In my original df, I have many columns not just text, text2, category.

Screenshot of my original dataframe.

Here category = NER_Category.

enter image description here

Here is the output of the code:

enter image description here

john doe
  • 151
  • 1
  • 10
  • It seems some data related issue, if testing code with sample data it working nice. :( – jezrael Jan 08 '20 at 13:15
  • 1
    Give me few minutes, I will try to reproduce – john doe Jan 08 '20 at 13:15
  • I attached the screenshots of my original dataframe, before & after. Now see the first row of original df. Then see the whole output df. As you can see in `NER_Category`, there is no `GPE` ,`FAC` in first row of original df, but the code created many rows – john doe Jan 08 '20 at 13:19
  • @jezrael Sorry, If I failed to explain. I really appreciate you helping us out man! – john doe Jan 08 '20 at 13:19
  • hmmm, hard testing without data, are data confidental? – jezrael Jan 08 '20 at 13:23
  • Does this answer your question? [Split (explode) pandas dataframe string entry to separate rows](https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows) – Minions Jan 08 '20 at 13:25
  • I need to ask permission for sharing data. I will get back to you soon – john doe Jan 08 '20 at 13:27
  • ZIPA's answer solved this. Jezrael, did you ever consider creating a patreon and putting it in your profile? If I ever save some money, I will surely love to donate some to you, for your great work .As a token of appreciation. – john doe Jan 08 '20 at 13:29

2 Answers2

2

This should do it:

(df.set_index(df.columns.drop('category',1).tolist())['category']
   .str.split(',', expand=True)
   .stack()
   .reset_index()
   .rename(columns={0:'category'})
   .loc[:, df.columns]
)

              text    text2 category
0        sfsd sgvv  sfsdfdf      abc
1        sfsd sgvv  sfsdfdf      xyz
2  zydf sefs sdfsd     drdg      yyy
3  dfsd dsrgd dggr      dgd      xyz
4     eter vxg wfe       fs      abc
5       dfvf ertet   dggdss      abc
6       dfvf ertet   dggdss      xyz
7       dfvf ertet   dggdss      bbb
zipa
  • 27,316
  • 6
  • 40
  • 58
0

You can still use explode to do this.

(
    df.assign(category=df.category.str.split(','))
    .explode('category')
)

        text            text2   category
0       sfsd sgvv       sfsdfdf abc
0       sfsd sgvv       sfsdfdf xyz
1       zydf sefs sdfsd drdg    yyy
2       dfsd dsrgd dggr dgd     xyz
3       eter vxg wfe    fs      abc
4       dfvf ertet      dggdss  abc
4       dfvf ertet      dggdss  xyz
4       dfvf ertet      dggdss  bbb
Allen Qin
  • 19,507
  • 8
  • 51
  • 67