3

I am new to R and trying to learn on my own. I have data in csv format with 1,048,575 rows and 73 columns. I am looking at three columns - year, country, aid_amount. I want to get the sum of aid_amount by country for i) all years, and ii) for years 1991-2010. I tried the following to get for all years BUT the result I get is different from when I sort/sum in Excel. What is wrong here. Also, what change should I make for ii) years 1991-2010. Thanks.

aiddata <- read.csv("aiddata_research.csv")
sum_by_country <- tapply(aiddata$aid_amount, aiddata$country, sum, na.rm=TRUE) # There are missing data on aid_amount
write.csv(sum_by_country, "sum_by_country.csv")

I have also tried:

sum_by_country <- aggregate(aid_amount ~ country, data = aiddata, sum) instead of tapply.

The first few rows for a few columns look like this:

aiddata_id  year    country                  aid_amount
23229017    2004    Bangladesh               685899.2666
14582630    2000    Bilateral, unspecified   15772.77174
28085216    2006    Bilateral, unspecified   38926.82898
28702455    2006    Bilateral, unspecified   12633.85659
29928104    2006    Cambodia                 955412.9884
27783934    2006    Cambodia                 11773.77268
37418683    2008    Guatemala                40150.7331
94726192    2010    Guatemala                151206.3096
SiriN
  • 33
  • 1
  • 5
  • 1
    You have to also include `year` as the grouping variable for the second question. Can you show few lines of your data (with few columns including the `year`)? Regarding the code, it looks okay for the first part. If you have included some reproducible example, it would have been great. – akrun Dec 23 '14 at 17:04
  • As @Akrun said, code looks ok. Does data look ok too? This is when that part of a "reproducible example" becomes most valuable... – PavoDive Dec 23 '14 at 20:27
  • Thanks @akrun and PavoDive. I have added a few rows of data above. The original dataset could be downloaded at http://aiddata.org/aiddata-research-releases. It is the first bullet: • AidData 2.1 "Full" version – SiriN Dec 24 '14 at 17:37
  • @PavoDive Please see the sample data above and original data in my comment. Thanks. – SiriN Dec 24 '14 at 17:38
  • Here is what I did in Excel. Filter "country"; Sum "aid_amount" for one particular country I am interested in. However, now I realize that Excel gave me an error message "File not loaded completely" and "Excel cannot exceed the limit of 1,048,576 rows and 16,384 columns". I guess there is more data than Excel could handle. However, the sum I got from R was lower than I got from Excel. Therefore, I assumed a problem in my R code. – SiriN Dec 24 '14 at 17:42
  • @SiriN Have you crosschecked the results on a very small dataset? – akrun Dec 24 '14 at 17:43
  • 1
    @akrun, I tried it on a smaller dataset and the results seem to be pretty close (last few positions differ - may be due to rounding differences). I am still not able to figure out the sum by country for YEAR 1991-2010 part. I want a sum for all those years, not by each year. – SiriN Dec 24 '14 at 18:33

2 Answers2

5

You could use data.table for the big dataset. If you want to get the sum of aid_amount for each country by year

library(data.table)
setkey(setDT(aiddata), country,year)[, 
         list(aid_amount=sum(aid_amount)), by=list(country, year)]

To get the sum of aid_amount for each country

setkey(setDT(aiddata), country)[, 
          list(aid_amount=sum(aid_amount)), by=list(country)]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks. This is really helpful. Above code will give me the sum by country by year. I want it for a range of years, not by year. Sorry if that was not clear. – SiriN Dec 24 '14 at 19:08
  • @SiriN The second one should give you that. – akrun Dec 24 '14 at 19:09
0
yy=aggregate(df$Column1,by=list(df$Column2),FUN=mean)

Column 2- Categories on which you want to sum. If you want to know the maximum value(sum) among all categories? Use the below code:

which.max(yy$x)
ouflak
  • 2,458
  • 10
  • 44
  • 49