0

My question is to improve the efficiency/elegance of my code. I have a df with a list of drugs. I want to identify the drugs that start with C09 and C10. If a person has these drugs, I want to give them a binary indicator (1=yes, 0=no) of whether they have these drugs. Binary indicator will be in a new column called "statins", in the same dataframe. I used this post as a guide: What's the R equivalent of SQL's LIKE 'description%' statement?.

Here is what I have done;

names<-c("tom", "mary", "mary", "john", "tom", "john", "mary", "tom", "mary", "tom", "john")
drugs<-c("C10AA05", "C09AA03", "C10AA07", "A02BC01", "C10AA05", "C09AA03", "A02BC01", "C10AA05", "C10AA07", "C07AB03", "N02AA01")
df<-data.frame(names, drugs)
df

  names   drugs
1    tom C10AA05
2   mary C09AA03
3   mary C10AA07
4   john A02BC01
5    tom C10AA05
6   john C09AA03
7   mary A02BC01
8    tom C10AA05
9   mary C10AA07
10   tom C07AB03
11  john N02AA01

ptn = '^C10.*?'
get_statin = grep(ptn, df$drugs, perl=T)
stats<-df[get_statin,]

names   drugs
1   tom C10AA05
3  mary C10AA07
5   tom C10AA05
8   tom C10AA05
9  mary C10AA07


ptn2='^C09.*?'
get_other=grep(ptn2, df$drugs, perl=T)
other<-df[get_other,]
other

  names   drugs
2  mary C09AA03
6  john C09AA03

df$statins=ifelse(df$drugs %in% stats$drugs,1,0)
df

   names   drugs statins
1    tom C10AA05       1
2   mary C09AA03       0
3   mary C10AA07       1
4   john A02BC01       0
5    tom C10AA05       1
6   john C09AA03       0
7   mary A02BC01       0
8    tom C10AA05       1
9   mary C10AA07       1
10   tom C07AB03       0
11  john N02AA01       0


df$statins=ifelse(df$drugs %in% other$drugs,1,df$statins)
df

   names   drugs statins
1    tom C10AA05       1
2   mary C09AA03       1
3   mary C10AA07       1
4   john A02BC01       0
5    tom C10AA05       1
6   john C09AA03       1
7   mary A02BC01       0
8    tom C10AA05       1
9   mary C10AA07       1
10   tom C07AB03       0
11  john N02AA01       0

So, I can get what I want - but I feel there is probably a better, nicer way to do it and would appreciate any guidance here. An obvious solution that I can feel you all shouting at your screens is just use '^C' as a pattern - and therefore catch all the drugs beginning with C. I won't be able to do this in my main analysis as the 'C' will catch things that I don't want in some instances, so I need to make it as narrow as possible.

oguz ismail
  • 1
  • 16
  • 47
  • 69
user2363642
  • 727
  • 9
  • 26

1 Answers1

5

Here you go:

transform(df, statins=as.numeric(grepl('^C(10|09)', drugs)))
Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113
  • Is there added value by using transform rather than `data.frame`? I am asking for my own understanding. – dayne Jun 28 '13 at 19:46
  • 1
    nope, I'm just in the habit of using it. You could also do the same thing with `within` (save replacing the `=` with `<-`). Of course, `transform` and `within` allow you to modify existing columns, while `data.frame` wouldn't. – Matthew Plourde Jun 28 '13 at 19:50
  • Brilliant! Thank you Matthew! Do you mind if I ask you something else please? If I wanted to combine the drugs beginning with N and A with your code, how could I do it? I tried transform(df, statins=as.numeric(grepl('^N02.*?'|'A02B.*?', drugs)))........... but it gives me an error; Error in "^N02.*?" & "A02B.*?" : operations are possible only for numeric, logical or complex types – user2363642 Jun 28 '13 at 19:56
  • 1
    the `|` should be in the string, drop the quotes around it. – Matthew Plourde Jun 28 '13 at 20:03
  • huzzah! thanks Matthew. Also, apologies; I should have checked here http://stackoverflow.com/questions/7597559/grep-in-r-with-a-list-of-patterns before asking the second question. Thank you again. – user2363642 Jun 28 '13 at 20:27