1

I just have a column "methods_discussed" in CSV (link is https://github.com/pandas-dev/pandas/files/3496001/multiple_responses.zip) file having values name of family plaaning methods like:

methods_discussed

emergency
female_sterilization
male_sterilization
iud
NaN
injectables male_condoms
male_condoms
female_sterilization male_sterilization
injectables
iud male_condoms

I used df1["methods_discussed"].str.contains(pat = method) but output is not matching as expected. Probably male_sterilization is substring of female_sterilization and it shows TRUE for male_sterilization. It is shown below in Actual output at index2. It must show FALSE as female_sterilization is in method_discussed column at index2.

created list of 8 family planning methods

method_names = ['female_condoms', 'emergency', 'male_condoms', 'pill', 'injectables', 'iud', 'male_sterilization', 'female_sterilization']

for method in method_names:
    df1[method]=df1["methods_discussed"].str.contains(pat = method)
df1.head(2)

Expected Output

id | methods_discussed | female_condoms | emergency | male_condoms | pill | injectables | iud | male_sterilization | female_sterilization
1 | emergency | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE
2 | female_sterilization | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | ***FALSE*** | TRUE

Actual output

id | methods_discussed | female_condoms | emergency | male_condoms | pill | injectables | iud | male_sterilization | female_sterilization
1 | emergency | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE
2 | female_sterilization | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | ***TRUE*** | TRUE

No error in code but only in the output

anky
  • 74,114
  • 11
  • 41
  • 70
  • please dont post links of data instead post a sample of i as text and the code which is not working along with the expected output. The question is unclear at this moment – anky Aug 13 '19 at 11:48
  • I will take care of data posting. However, I got the solution of my question by @jezrael – Ashish Bandhu Aug 14 '19 at 04:26
  • Okay, Please accept the answer by clicking on the grey tick mark to the left of the answer – anky Aug 14 '19 at 04:40

1 Answers1

2

Use words boundary around patterns - \b\b for avoid it, also parameter na=False is nice for avoid NaNs in output - here replaced by False:

for method in method_names:
    df1[method]=df1["methods_discussed"].str.contains(pat = r"\b{}\b".format(method), na=False)

print (df1)
                         methods_discussed  female_condoms  emergency  \
0                                emergency           False       True   
1                     female_sterilization           False      False   
2                       male_sterilization           False      False   
3                                      iud           False      False   
4                                      NaN           False      False   
5                 injectables male_condoms           False      False   
6                             male_condoms           False      False   
7  female_sterilization male_sterilization           False      False   
8                              injectables           False      False   
9                         iud male_condoms           False      False   

   male_condoms   pill  injectables    iud  male_sterilization  \
0         False  False        False  False               False   
1         False  False        False  False               False   
2         False  False        False  False                True   
3         False  False        False   True               False   
4         False  False        False  False               False   
5          True  False         True  False               False   
6          True  False        False  False               False   
7         False  False        False  False                True   
8         False  False         True  False               False   
9          True  False        False   True               False   

   female_sterilization  
0                 False  
1                  True  
2                 False  
3                 False  
4                 False  
5                 False  
6                 False  
7                  True  
8                 False  
9                 False  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I want to understand what r"\b{}\b".format(method) is doing in str.contains(pat = r"\b{}\b".format(method), na=False). Also why str.match(pat=method) is not resolving the issue. – Ashish Bandhu Aug 14 '19 at 06:57
  • 1
    @AshishBandhu - Maybe the best is check [this](https://stackoverflow.com/q/1324676). – jezrael Aug 14 '19 at 06:58
  • Hi Jezrael, how can the same problem be solved in R? I tried but the result is not matching. https://stackoverflow.com/questions/69172450/how-to-correct-the-output-generated-through-str-detect-str-contains-in-r – Ashish Bandhu Sep 14 '21 at 11:20
  • @AshishBandhu - Unfortuantely I am not R familiar, so no idea :( – jezrael Sep 14 '21 at 11:21
  • Hi Jezrael, I have a single column called random_variable in the data frame containing random numbers, say, from 1 to 100 but may not be sequentially ( maybe 1, 7, 2). I want to create 15 new variables in the data set each one containing the first 7 (7 may be arbitrary, example is for weekdays) entries from random_variable except the 15th one which will contain only 2 entries. (14*7=98 + 2 in last column) Anticipating your help Ashish – Ashish Bandhu Feb 18 '22 at 16:10