1

I have a dataframe generated from a .csv (I use Python 3.5). The df['category'] contains only strings. What I want is to check this column and if a string contains a specific substring(not really interested where they are in the string as long as they exist) to be replaced. I am using this script

import pandas as pd

df=pd.read_csv('lastfile.csv')


df.dropna(inplace=True)

g='Drugs'
z='Weapons'
c='Flowers'


df.category = df.category.str.lower().apply(lambda x: g if ('mdma' or 'xanax' or 'kamagra' or 'weed' or 'tabs' or 'lsd' or 'heroin' or 'morphine' or 'hci' or 'cap' or 'mda' or 'hash' or 'kush' or 'wax'or 'klonop'or\
                                                            'dextro'or'zepam'or'amphetamine'or'ketamine'or 'speed' or 'xtc' or 'XTC' or 'SPEED' or 'crystal' or 'meth' or 'marijuana' or 'powder' or 'afghan'or'cocaine'or'haze'or'pollen'or\
                                                            'sativa'or'indica'or'valium'or'diazepam'or'tablet'or'codeine'or \
                                                            'mg' or 'dmt'or'diclazepam'or'zepam'or 'heroin' ) in x else(z if ('weapon'or'milit'or'gun'or'grenades'or'submachine'or'rifle'or'ak47')in x else c) )






print(df['category'])

My problem is that some records though they contain some of the substrings I defined, do not get replaced. Is it a regex related problem? Thank you in advance.

Gerasimos
  • 279
  • 2
  • 8
  • 17
  • Related: [Pandas filtering for multiple substrings in series](https://stackoverflow.com/questions/48541444/pandas-filtering-for-multiple-substrings-in-series) – jpp Jan 15 '19 at 12:45

1 Answers1

3

Create dictionary of list of substrings with key for replace strings, loop it and join all list values by | for regex OR, so possible check column by contains and replace matched rows with loc:

df = pd.DataFrame({'category':['sss mdma df','milit ss aa','aa ss']})

a = ['mdma', 'xanax' , 'kamagra']
b = ['weapon','milit','gun']

g='Drugs'
z='Weapons'

c='Flowers'

d = {g:a, z:b}

df['new_category'] = c

for k, v in d.items():
    pat = '|'.join(v)
    mask = df.category.str.contains(pat, case=False)

    df.loc[mask, 'new_category'] = k

print (df)
      category new_category
0  sss mdma df        Drugs
1  milit ss aa      Weapons
2        aa ss      Flowers
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252