counting occurrence of strings across multiple columns in R

Question

I have a dataset in R which looks like the following (only relevant columns shown). It has sex disaggregated data on what crops respondents wanted more information about and how much of a priority this crop for them.

sex     wantcropinfo1    priority1  wantcropinfo2     priority2 
m       wheat            high       eggplant          medium
m       rice             low        cabbage           high
m       rice             high
f       eggplant         medium
f       cotton           low
...

I want to be able to (a) count the total occurrences of each crop across all the wantcropinfoX columns; and (b) get the same count but sort them by priority; and (c) do the same thing but disaggregated by sex.

(a) output should look like this:

 crop     count
 wheat    1
 eggplant 2
 rice     2
 ...

(b) output should look like this:

 crop     countm   countf
 wheat    1        0
 eggplant 1        1
 rice     2        0
 ...

(c) should look like this:

 crop     high_m   med_m  low_m  high_f   med_f   low_f
 wheat    1        0      0      0        0       0
 eggplant 0        1      0      0        1       0
 rice     1        0      1      0        0       0
 ...

I'm a bit of an R newbie and the manuals are slightly bewildering. I've googled a lot but couldn't find anything that was quite like this even though it seems like a fairly common thing one might want to do. Similar questions on stackoverflow seemed to be asking something a bit different.

One question per post would be nice. – mtoto Apr 28 '16 at 10:44 — mtoto, Apr 28 '16 at 10:44

score 2 · Accepted Answer · answered Apr 28 '16 at 10:44

2

We can use melt from data.table to convert from 'wide' to 'long' format. It can take multiple measure columns.

library(data.table)
dM <- melt(setDT(df1), measure = patterns("^want", "priority"),
       value.name = c("crop", "priority"))[crop!='']

In the 'long' format, we get the 3 expected results by either grouping by 'crop' and get the number of rows or convert to 'wide' with dcast specifying the fun.aggregate as length.

dM[,.(count= .N) , crop]
#       crop count
#1:    wheat     1
#2:     rice     2
#3: eggplant     2
#4:   cotton     1
#5:  cabbage     1

dcast(dM, crop~sex, value.var='sex', length)
#       crop f m
#1:  cabbage 0 1
#2:   cotton 1 0
#3: eggplant 1 1
#4:     rice 0 2
#5:    wheat 0 1

dcast(dM, crop~priority+sex, value.var='priority', length)
#       crop high_m low_f low_m medium_f medium_m
#1:  cabbage      1     0     0        0        0
#2:   cotton      0     1     0        0        0
#3: eggplant      0     0     0        1        1
#4:     rice      1     0     1        0        0
#5:    wheat      1     0     0        0        0

answered Apr 28 '16 at 10:44

akrun

874,273
37
540
662

Do you happen to know any good tutorials on data.tables? – Diego Aguado Apr 28 '16 at 10:50
@DiegoAgher You can look at the vignettes. It is a good source – akrun Apr 28 '16 at 10:52
1

Thanks! I'll just add it here for future references https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf – Diego Aguado Apr 28 '16 at 10:55
@akrun - many thanks - this looks perfect... But when I do this (to keep it simple): `dM <- melt(setDT(GPRFS2), measure=patterns("^want"), value.name = c("crop"))[crop!='']` it tells me `Error in eval(expr, envir, enclos) : object 'crop' not found`. Any idea what I'm missing here? I'm also assuming the `^want` will return all columns with 'want' in them? – Andre Ling Apr 28 '16 at 11:17
1

@akrun - OK I got it! it had renamed the variable to crop1 :) Not sure if this needs to be mentioned in your answer but I'm accepting your answer. Thanks again! – Andre Ling Apr 28 '16 at 12:14

score 1 · Answer 2 · answered Apr 28 '16 at 10:44

Use ddply function in the plyr package.

The structure of how you use this function is the following:

ddply(dataframe,.(var1,var2,...), summarize, function)

In this case you might want to do the follow for:

a) ddply(df,.(wantcropinfo1),summarize,count=length(wantcropinfo1))
b)ddply(df,.(wantcropinfo1,priority),summarize,count=length(wantcropinfo1))
c) ddply(df,.(wantcropinfo1,priority,sex),summarize,count=length(wantcropinfo1))

Note that the output will not have the same structure you mention in your question but the information will be the same. For the mentioned structure use the table function

Thanks for this. However, in (a) I want the totals from both columns wantcropinfo1 and wantcropinfo2 and not the count of all the different combinations of wantcropinfo1 and wantcropinfo 2 which is what I'm getting. — Andre Ling, Apr 28 '16 at 11:02

counting occurrence of strings across multiple columns in R

2 Answers2