Elementary: COUNTIF on DataFrame using SUM function

Question

After trying what was recommended previously for a similar challenge, I'm still lost and perhaps am missing something simple.

I have two data frames Uniques and Uniques2.

In Uniques, I have a column with 49,999 rows of a variable.

In Uniques2, I have separated out the unique variables and come up with a total of 403.

Now I would like to count how many time each variable in Uniques2$aa.IndustryGroup appears in a certain column in Uniques$aa.IndustryGroup. I would like it to display in a new column $Count in the Uniques2 data frame.

A previous Stack question recommended using == and SUM to find out the answer, which I thought was straightforward enough.

So I've tried this,

Uniques2$Count = data.frame(sum(Uniques$aa.IndustryGroup == Uniques2$aa.IndustryGroup))

And it returns errors about length which I know means that I'm not asking it to do what I want correctly.

Error in `$<-.data.frame`(`*tmp*`, "Count", value = list(sum.Uniques.aa.IndustryGroup....Uniques2.aa.IndustryGroup. = 138L)) : 
replacement has 1 row, data has 403
In addition: Warning messages:
1: In is.na(e1) | is.na(e2) :
longer object length is not a multiple of shorter object length
2: In `==.default`(Uniques$aa.IndustryGroup, Uniques2$aa.IndustryGroup) :
longer object length is not a multiple of shorter object length

Thanks for being a stellar community and leaving a trail of breadcrumbs. The success of this adventure would be improbable without you.

I think you somethink like `Uniques2$Count = lapply(Uniques2$aa.IndustryGroup,FUN=function(i)(sum(Uniques$aa.IndustryGroup == i)))` — Batanichek, Jul 02 '15 at 13:55
You could probably use `ddply` and `summarize` on a larger data set summarizing by that variable of interest and doing a sum. Smth like: `ddply(Uniques, .(aa.IndustryGroup), summarize, val = length(aa.IndustryGroup))` ? — Alexey Ferapontov, Jul 02 '15 at 13:57
[How to make a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — C8H10N4O2, Jul 02 '15 at 15:48

score 1 · Accepted Answer · edited Jul 02 '15 at 14:29

1

Now I would like to count how many time each variable in Uniques2$aa.IndustryGroup appears in a certain column in Uniques$aa.IndustryGroup. I would like it to display in a new column $Count in the Uniques2 data frame.

# reproducible example!
set.seed(123)
Uniques <- data.frame(aa.IndustryGroup=sample(LETTERS,49999,replace=T))
Uniques2 <- data.frame(aa.IndustryGroup=LETTERS)

Uniques2$Count <- sapply(Uniques2$aa.IndustryGroup, 
                         function(x) sum(Uniques$aa.IndustryGroup==x))

Explanation: What you tried has two problems: first, you cannot store a data.frame in the vector Uniques2$Count. Second, you cannot do vector comparison v1 == v2 because as you know these vectors are of different lengths and you are really asking, for each element of v2, how many times does it appear in v1. The apply family is a good way to do that.

edited Jul 02 '15 at 14:29

Hong Ooi

56,353
13
134
187

answered Jul 02 '15 at 13:51

C8H10N4O2

18,312
8
98
134

This is the sample result: aa.IndustryGroup....LETTERS Count 1 A 1939 2 B 1912 3 C 1947 Problem being that I need to preserve the variables and I need to do a unique count where for some there are 5 and some there are 1,200. I do like your explanation of what I was screwing up, thanks for that, but the implemented code doesn't give the read I need. – Christopher Hastings Jul 02 '15 at 14:58
Would you please edit your question to reflect what you mean by "preserve the variables"? I believe I answered the question as written. For example, 1939 is the number of times "A" appears in `Uniques$aa.IndustryGroup`. – C8H10N4O2 Jul 02 '15 at 15:06
Let's say one of the variables is Barbers, rather than having it change to a letter, I need it to continue to say Barbers since that is what it is looking for. When I run a similar check in Excel, I have 403 different values, with a range of 1 to 1782. When I run your example, I have 26 different values with a range of 1865 to 2017. I may also have just not understood how to implement your code recommendation. – Christopher Hastings Jul 02 '15 at 15:33
It would continue to say Barbers. Since you did not provide example data sets, I created them. In mine the industry groups are represented by the letters A-Z. They are repeated in table 1, unique in table 2. In yours it could be Barbers, Tailors, whatever. You do not need to create my example data set. You just need to use the `sapply` function as I wrote it for you but on *your* data set. The example data set is just to show that it works. – C8H10N4O2 Jul 02 '15 at 15:41
> Uniques$Establishment_Count <- sapply(Uniques$aa.IndustryGroup, + function(x) sum(aa$IndustryGroup==x)) Error in `$<-.data.frame`(`*tmp*`, "Establishment_Count", value = list()) : replacement has 0 rows, data has 403 I think it may also be an issue that it is a data frame, and elsewhere I ran into trouble because it was a data frame and not a list. I've been reading about plyr, apply and sapply, watched the related videos from Coursera and still can't make this work. – Christopher Hastings Jul 02 '15 at 19:27
There is at least one error in that, since `aa$IndustryGroup` needs a data.frame `aa` to work – C8H10N4O2 Jul 02 '15 at 19:34
This exists. Has 49,999 rows with 403 unique values. – Christopher Hastings Jul 02 '15 at 19:35
You said your data.frames were called `Uniques` and `Uniques2` – C8H10N4O2 Jul 02 '15 at 19:39
Sorry, the data started with aa$IndustryGroup and then I used Uniques to create the 403, and removed Uniques2 entirely. Both Uniques and aa are data frames. Trying to simplify the steps to eliminate errors. – Christopher Hastings Jul 02 '15 at 19:46
Check out @Batanichek 's answer above. I ended up getting that one to work. Similar, if not almost identical. Thanks for your long winded perseverance. – Christopher Hastings Jul 05 '15 at 14:34

Elementary: COUNTIF on DataFrame using SUM function

1 Answers1