-2

I am new to R, and could not find specific help for my question on this site.

I have (among others) ten character variables in my dataframe $grant_database, country_1 through country_10. Each contains either a country code, for example E20, F27 or G10, or an NA. Each case is a grant to a project. The ten country variables specify which country/countries a grant is benefitting. In my dataframe, most, but not all cases will have at least one country code, first marked in country_1, many will have one for country_2 as well, and some even for country_3 to _10. All empty fields are marked with an NA.

id  country_1  country_2  country_3  country_4  country_5  country_6 ...new_binaryvar
1   F20        NA         NA         NA         NA         NA           0        
2   E12        E17        E52        NA         NA         NA           0
3   O62        O33        NA         NA         NA         NA           0
4   E21        E20        NA         NA         NA         NA           1
5   NA         NA         NA         NA         NA         NA           0
...

I wish to create a new factor flagging grants which benefit a defined subset of countries. This binary "dummy" variable should give the value "1" to each case that in at least one of the ten country variables corresponds with a list of country codes. It should give "0" to each case/grant that does not have a corresponding country code in any of its ten country variables. Let this subset of country codes to be flagged be: E20, F27 and G10 (in reality, there are about 40 to be flagged, from 150+).

Would you help me out by suggesting a way to program this? Thank you very much for your help!

veyesor
  • 3
  • 3
  • 2
    Please provide some example data – akrun Jan 05 '15 at 15:01
  • See [how to create a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for suggestions on improving your question. – MrFlick Jan 05 '15 at 15:14
  • @veyesor In the example showed you have `E12`, `F20`, `O62` etc, which I guess should not be counted?? – akrun Jan 05 '15 at 17:05
  • @akrun Exactly. Only `E20`, `F27` and `G10` should be "counted", i.e. lead towards a "1" on the factor `new_binaryvar`. Everything else, including `E12`, `F20`, `O62`, `NA`, etc., should lead towards a "0" on `new_binaryvar`. – veyesor Jan 05 '15 at 17:24

1 Answers1

0

Assuming that you wanted to check whether a subset of "countrycodes" are there in each of the "country" variables with the condition that if atleast one of the "countrycode" is present in a particular row, that row will get "1", or else "0". The idea is to create a vector (v1) of "countrycodes" that needs to be checked. Convert the dataset (df) to matrix after removing the "id" column (as.matrix(df[,-1])) and then create a logical vector by comparing with "v1" (%in%). The vector can be changed back to "matrix" by assigning the dimensions (dim<-) to dimension of df[,-1] ie (c(5,7)). Do the rowSums, double negate (!!), finally add 0 to get the binary dummy variable.

 v1 <- c('E20', 'F27', 'G10')
(!!rowSums(`dim<-`(as.matrix(df[,-1]) %in% v1, c(5,7))))+0
#[1] 0 0 0 1 0

newdata

df <- structure(list(id = 1:5, country_1 = c("F20", "E12", "O62", "E21", 
NA), country_2 = c(NA, "E17", "O33", "E20", NA), country_3 = c(NA, 
 "E52", NA, NA, NA), country_4 = c(NA, NA, NA, NA, NA), country_5 = c(NA, 
NA, NA, NA, NA), country_6 = c(NA, NA, NA, NA, NA), country_7 = c(NA, 
NA, NA, NA, NA)), .Names = c("id", "country_1", "country_2", 
"country_3", "country_4", "country_5", "country_6", "country_7"
 ), class = "data.frame", row.names = c(NA, -5L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks @akrun for your response, but it seems I was not specific enough in my description. What I want to do is create a binary variable based on whether each row has the values `'E20'`, `'F27'` or `'G10'`in one of the variables country_1, country_2, ..., country_9 or country_10. Thanks! – veyesor Jan 05 '15 at 16:59
  • @veyesor It would be better if you update the post with the expected result also for the five rows. – akrun Jan 05 '15 at 17:01
  • The example data was not my entire dataset which has many other variables, but only an excerpt. Can I apply your suggestion by altering `c(5,7)` to `c(861,79)`? Also, will converting my dataset `(df)` to matrix keep my other data, character and double variables, and variable names intact? – veyesor Jan 05 '15 at 17:54
  • @veyesor Here, when we compare against the countrycodes, the resulting vector is a `logical` one, which is then converted to matrix. I am not sure how this will affect your dataset. Could you be more specific? If you want to apply this to specific columns, subset the dataset and do it. Here, also I subset the `df` by removing the first column. You can also do something similar. – akrun Jan 05 '15 at 17:57
  • Thank you so much, @akrun. I tried your proposal after subsetting the dataset. This was the code I used: v1 <- c('S31', 'F41', 'F70') (!!rowSums(`dim<-`(as.matrix(subset_ccnr) %in% v1, c(10,861))))+0. This was the output I got: '[1] 1 1 1 1 1 1 1 1 1 1''. Do you have an idea what went wrong? – veyesor Jan 05 '15 at 19:15
  • @veyesor If you look at your own comment, can you see anything wrong? – akrun Jan 05 '15 at 19:16
  • Sorry, @akrun, I was having trouble with the comment function. Now I've edited the code to what I used. I had deleted the [,-1] because I had already deleted the id variable, altered the list of "countrycodes" to what I needed, altered the c(5,7) to the actual size of the matrix and deleted the dashes around `dim<-`because it gave me the following error function otherwise: `Error: unexpected ',' in "(!!rowSums(dim<-(as.matrix(subset_ccnr) %in% v1,"`. Apart from this, I cannot see anything wrong. – veyesor Jan 05 '15 at 19:22
  • 1
    @veyesor Could you update it in your post rather than in comments. It is messy here. It seems to me that you got an output of all `1s`. A small reproducible example that shows the problem would be helpful – akrun Jan 05 '15 at 19:36