Efficiently finding the count of column values for distinct rows in a dataframe in r

Question

Suppose I have a data frame as:

id   value
1    "hi"
1    "hi"
1    "hi again"
1    "hi again"
2    "hello"
2    "hi"

Now I want to get the count of each value for each of the distinct values in id column. The output would be like

id    value       Freq
1     "hi"        2
1     "hi again"  2
2     "hello"     1
2     "hi"        1

I tried splitting up the first data frame for each distinct id and get the frequency using the table() function on the value column and appending the id column later. Also, I end up with a lot of dataframes in my memory. I just want to know if I can achieve the above dataframe without chewing up my memory with lot of dataframes(as I have almost 5 million rows).

`as.data.frame(table(DF))` Use `table` on both columns. Alternately, use `data.table` (which will prove more efficient) as below. — Frank, Jun 19 '15 at 15:16

score 4 · Accepted Answer · edited Jun 20 '15 at 09:49

4

assuming your data.frame is called df, using data.table:

library(data.table)
setDT(df)[ , .(Freq = .N), by = .(id, value)]

using dplyr:

libary(dplyr)
group_by(df, id, value) %>% summarise(Freq = n())

You should choose one of those two packages (dplyr or data.table) and learn it really thoroughly. In the long run you will likely use both. But beginning with one and really understanding it will help you tremendously. I use both pretty much everytime I use R.

dplyr tends to be easier for beginners, so I would read a tutorial on it. This will help you forever. There is also a great video-tutorial which can be found on this site under The grammar and graphics of datascience.

I personally prefer data.table because it is faster and more flexible. Check the new HTML vignettes and the PDF vignettes here.

edited Jun 20 '15 at 09:49

Arun

116,683
26
284
387

answered Jun 19 '15 at 15:16

grrgrrbla

2,529
2
16
29

Thank you @grrgrrbla for great explanation and the resources. I tried using dplyr and somehow I haven't achieved the result I needed. – Shiva Jun 19 '15 at 15:22
glad it helped, please accept the answer if it helped by clicking the arrow and upvote it – grrgrrbla Jun 19 '15 at 15:26
By the way, these two approaches run at the same speed if you use dplyr syntax on a data.table. (I just tried it on my computer.) `DT <- data.table(id=1:1e6)[,.(value=sample(letters,sample(5,1))),by=id]; DF <- setDF(copy(DT)); system.time(group_by(DT, id, value) %>% mutate(Freq = n())); system.time(DT[ , .(Freq = .N), by = .(id, value)])` Working with a data.frame is 5x slower, though: `system.time(group_by(DF, id, value) %>% mutate(Freq = n()))` – Frank Jun 19 '15 at 15:30
interesting and strange at the same time – grrgrrbla Jun 19 '15 at 15:31
By "dplyr syntax with a data.table", I mean `group_by(DT,...)` where DT is a data.table. The author of dplyr, hadley, likes to point out that dplyr can be fast when using data.table as a "backend" in this way. – Frank Jun 19 '15 at 15:32
I didnt know that, do you know why? and where can I find more information on this – grrgrrbla Jun 19 '15 at 15:33
2

Sure, here's where I saw him mention it: http://stackoverflow.com/a/27840349/1191259 The other answers there are also worth reading. – Frank Jun 19 '15 at 15:35

Efficiently finding the count of column values for distinct rows in a dataframe in r

1 Answers1