0

I have two data sets. One has 2 million cases (individual donations to various causes), the other has about 38,000 (all zip codes in the U.S.).

I want to sort through the first data set and tally up the total number of donations by zip code. (Additionally, the total for each zip code will be broken down by cause.) Each case in the first data set includes the zip code of the corresponding donation and information about what kind of cause it went to.

Is there an efficient way to do this? The only approach that I (very much a novice) can think of is to use a for ... if loop to go through each case and count them up one by one. This seems like it would be really slow, though, for data sets of this size.

edit: thanks, @josilber. This gets me a step closer to what I'm looking for.

One more question, though. table seems to generate frequencies, correct? What if I'm actually looking for the sum for each cause by zip code? For example, if the data frame looks like this:

dat3 <- data.frame(zip = sample(paste("Zip", 1:3), 2000000, replace=TRUE),
        cause = sample(paste("Cause", 1:3), 2000000, replace=TRUE), 
        amt = sample(250:2500, 2000000, replace=TRUE))

Suppose that instead of frequencies, I want to end up with output that looks like this?

#         Cause 1(amt) Cause 2(amt) Cause 3(amt)
#   Zip 1  (sum)        (sum)       (sum)
#   Zip 2  (sum)        (sum)       (sum)
#   Zip 3  (sum)        (sum)       (sum)
#   etc.    ...          ...         ...

Does that make sense?

  • 2
    welcome to stackoverflow, it would help to have some sample input data and some sample output data, see here: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example ; besides that I would look into the packages data.table (see here for an intro: http://blog.yhathq.com/posts/fast-summary-statistics-with-data-dot-table.html) and or dplyr (still fast but not quite as fast as data.table but easier to use) see here for an intro: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html ; work hard to understand one of them and it will help you forever!!! – grrgrrbla Jun 05 '15 at 19:04

2 Answers2

0

Sure, you can accomplish what you're looking for with the table command in R. First, let's start with a reproducible example (I'll create an example with 2 million cases, 3 zip codes, and 3 causes; I know you have more zip codes and more causes but that won't cause the code to take too much longer to run):

# Data
set.seed(144)
dat <- data.frame(zip = sample(paste("Zip", 1:3), 2000000, replace=TRUE),
                  cause = sample(paste("Cause", 1:3), 2000000, replace=TRUE))

Please note that it's a good idea to include a reproducible example with all your questions on Stack Overflow because it helps make sure we understand what you're asking! Basically you should include a sample dataset (like the one I've just included) along with your desired output for that dataset.

Now you can use the table function to count the number of donations in each zip code, broken down by cause:

table(dat$zip, dat$cause)
#         Cause 1 Cause 2 Cause 3
#   Zip 1  222276  222004  222744
#   Zip 2  222068  222791  222363
#   Zip 3  221015  221930  222809

This took about 0.3 seconds on my computer.

josliber
  • 43,891
  • 12
  • 98
  • 133
  • Thanks very much for your help! :) One more question, though, if you've got the time, edited back into my original post. – Richard S. Jun 07 '15 at 15:28
0

could this work?-

aggregate(amt~cause+zip,data=dat3,FUN=sum)


    cause   zip       amt
1 Cause 1 Zip 1 306231179
2 Cause 2 Zip 1 306600943
3 Cause 3 Zip 1 305964165
4 Cause 1 Zip 2 305788668
5 Cause 2 Zip 2 306306940
6 Cause 3 Zip 2 305559305
7 Cause 1 Zip 3 304898918
8 Cause 2 Zip 3 304281568
9 Cause 3 Zip 3 303939326
Zahiro Mor
  • 1,708
  • 1
  • 16
  • 30