0

I just have a column "methods_discussed" in CSV (link is https://github.com/pandas-dev/pandas/files/3496001/multiple_responses.zip)

multi<- read.csv("multiple_responses.csv", header = T)

This file having values name of family planning methods in the column name like:

methods_discussed

emergency female_sterilization male_sterilization iud NaN injectables male_condoms -77 male_condoms female_sterilization male_sterilization injectables iud male_condoms

I have created a vector of all but not -77 and NAN of 8 family planning methods as:

method_names = c('female_condoms', 'emergency', 'male_condoms', 'pill', 'injectables', 'iud', 'male_sterilization', 'female_sterilization')

I want to create new indicator variable based on the names of vector (method_names) in the existing data frame multi2, for this I used (I)

    for (abc in method_names) { 
multi2[abc]<- as.integer(str_detect(multi2$methods_discussed, fixed(abc)))
}

(II)

    for (abc in method_names) { 
multi2[abc]<- as.integer(str_contains(abc,multi2$methods_discussed)) 
}

(III) I also tried

   for (abc in method_names) {
      multi2[abc]<- as.integer(stri_detect_fixed(multi2$methods_discussed, abc))
      }

but the output is not matching as expected. Probably male_sterilization is a substring of female_sterilization and it shows 1(TRUE) for male_sterilization for female_sterlization also. It is shown below in the Actual output at row 2. It must show 0 (FALSE) as female_sterilization is in the method_discussed column at row 2. I also don't want to generate any thing like 0/1 (False/True) (should be blank) corresponding to -77 and blank in method_discussed (All are highlighted in Expected output.

Actual Output Actual Output

Expected Output Expected Output No error in code but only in the output.

  • why not just add a number in front of every method to create an index? it would looks something like this: `method_index = c('1_female_condoms', '2_emergency', '3_male_condoms', '4_pill, 5_...')` that way string detection may be hardcoded and more reliable. – D.J Sep 14 '21 at 06:00

1 Answers1

1

You can add word boundaries to fix that issue.

multi<- read.csv("multiple_responses.csv", header = T)
method_names = c('female_condoms', 'emergency', 'male_condoms', 'pill', 'injectables', 'iud', 'male_sterilization', 'female_sterilization')

for (abc in method_names) { 
  multi[abc]<- as.integer(grepl(paste0('\\b', abc, '\\b'), multi$methods_discussed))
}

multi[multi$methods_discussed %in% c('', -77), method_names] <- ''
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • in this case, it will generate 0 for -77 or blank(NaN) present in method_discussed, that I don't want. It should leave blank in generated variable corresponding to -77 or blank(NaN) in method_discussed – Ashish Bandhu Sep 14 '21 at 06:49
  • thanks! I checked the code it is running well for one string in method_discussed but it is not working well whenever there are two or more strings in method_discussed. example row no 5 and 7. Can you please refer to the code used in python at https://stackoverflow.com/questions/57476760/how-to-correct-the-output-generated-through-str-contains-in-python – Ashish Bandhu Sep 14 '21 at 11:00
  • It is working fine, actually, I removed , (comma) in between the two strings in place of space in method_discussed; so it was like iud , pill in place of iud pill. How it can be solved if a comma is there instead of space? – Ashish Bandhu Sep 14 '21 at 11:50
  • Have you tried my updated answer with `\\b` ? I think it should work with commas as well. – Ronak Shah Sep 14 '21 at 11:52
  • I have a single column called random_variable in the data frame containing random numbers, say, from 1 to 100 but may not be sequentially ( maybe 1, 7, 2). I want to create 15 new variables in the data set each one containing the first 7 (7 may be arbitrary, example is for weekdays) entries from random_variable except the 15th one which will contain only 2 entries. (14*7=98 + 2 in last column) Anticipating your help Ashish – Ashish Bandhu Feb 18 '22 at 16:13