1

I have a df:

   name    sample
 1  a      Category 1: qwe, asd (line break) Category 2: sdf, erg
 2  b      Category 2: sdf, erg(line break) Category 5: zxc, eru
...
30  p      Category 1: asd, Category PE: 2134, EFDgh, Pdr tke, err 

I need to end up with:

   name    qwe   asd   sdf   erg   zxc   eru 2134  EFDgh  Pdr tke  err
 1  a       1     1     1     1    0     0    0     0       0       0
 2  b       0     0     1     1    1     1    0     0       0       0
...
30  p       0     1     0     0    0     0    0     1       1       0

I'm honestly not even sure where to begin with this one, my first though is to split it at the line break but I kind of get lost after.

M Arroyo
  • 445
  • 3
  • 14

1 Answers1

1

IIUC you could use str.findall with regex pattern to find all words with 3 characters with negative lookbehind and lookahead for non character symbols. Then you could join obtained lists with str.join and get your dummies with str.get_dummies. Then you could drop extra columns:

df['new'] = df['sample'].str.findall('(?<!\w)\w{3}(?!\w)')
df_dummies = df['new'].str.join('_').str.get_dummies(sep='_')
df = pd.concat([df, df_dummies], axis=1)

In [215]: df['new']
Out[215]:
1    [qwe, asd, sdf, erg]
2    [sdf, erg, zxc, eru]
Name: new, dtype: object

In [216]: df
Out[216]:
  name                                             sample                    new  asd  erg  eru  qwe  sdf  zxc 
1    a  Category 1: qwe, asd (line break) Category 2: ...   [qwe, asd, sdf, erg]    1    1    0    1    1    0
2    b  Category 2: sdf, erg(line break) Category 5: z...   [sdf, erg, zxc, eru]    0    1    1    0    1    1

After dropping extra columns you'll get your result:

df = df.drop(['sample', 'new'], axis=1)

In [218]: df
Out[218]:
  name  asd  erg  eru  qwe  sdf  zxc
1    a    1    1    0    1    1    0
2    b    0    1    1    0    1    1
Community
  • 1
  • 1
Anton Protopopov
  • 30,354
  • 12
  • 88
  • 93
  • Works perfect, however I found that deep in the data (not very familiar with it) there are entries like "Category 12: werr, nm eetgd" so simpyl matching 3 letter words work for all of it. Any ideas on how to generalize? – M Arroyo Apr 01 '16 at 16:25
  • @MArroyo could you add that example for your question and expected output? – Anton Protopopov Apr 01 '16 at 19:38