0

I am using kaggle data set. Due to large size of dataset it is hard to insert dput output. But I am trying to do the sum of yearly food production by region. And I am using aggregation for that purpose. For some reason it is displaying below error:

Aggregation R code:

years<-colnames(p[,11:63])
agg<-aggregate(years~area, data=p, sum)

Error:

Error in model.frame.default(formula = years ~ area, data = p) : 
  variable lengths differ (found for 'area')

I tried below link, but it seems not to be very useful to me:

Not very useful link

Note: Dataset contains N/A. It was deleted using na.omit function

Update after bk18 comment

> p[, lapply(.SD, class)]
   area_abb area_code   area item_code   item element_code element   Unit latitude longitude
1:   factor   integer factor   integer factor      integer  factor factor  numeric   numeric
     Y1961   Y1962   Y1963   Y1964   Y1965   Y1966   Y1967   Y1968   Y1969   Y1970   Y1971
1: integer integer integer integer integer integer integer integer integer integer integer
     Y1972   Y1973   Y1974   Y1975   Y1976   Y1977   Y1978   Y1979   Y1980   Y1981   Y1982
1: integer integer integer integer integer integer integer integer integer integer integer
     Y1983   Y1984   Y1985   Y1986   Y1987   Y1988   Y1989   Y1990   Y1991   Y1992   Y1993
1: integer integer integer integer integer integer integer integer integer integer integer
     Y1994   Y1995   Y1996   Y1997   Y1998   Y1999   Y2000   Y2001   Y2002   Y2003   Y2004
1: integer integer integer integer integer integer integer integer integer integer integer
     Y2005   Y2006   Y2007   Y2008   Y2009   Y2010   Y2011   Y2012   Y2013
1: integer integer integer integer integer integer integer integer integer

Output neededAny help is appreciated!

Thanks in advance,

Data_is_Power
  • 765
  • 3
  • 12
  • 30

1 Answers1

1

Not sure what's going on with the different lengths, but you can try a different solution using data.table to see if the error is reproducible:

library(data.table)
setDT(mydata)
mydata[, sum(p, na.rm = T), .(years, area)]

See if this achieves the result you're after.

UPDATE:

Assuming your data is in the format:

year  area  value
...   ...   ...

In other words, it's been melted so that it's "long" over the years, you should just have to do:

p[, area := as.character(area)]
p[, sum(value, na.rm = T), .(year, area)]

If it's not melted first, then melt it with melt() to get it in long form where the columns match what I've written above.

If, however, youd like to keep things wide as in the screenshot you posted, just use lapply:

p[, area := as.character(area)]
p[, lapply(.SD, sum, na.rm = T), area, .SDcols = colnames(p)[grep("Y", colnames(p))]]

What you're doing here is applying sum() to each column (that's the lapply(.SD, sum, na.rm = T) bit. Then, you're doing it by area (that's the third argument). The .SD piece (controlled by .SDcols) allows you to subset the table you're working on. This allows you to sum only on the columns which are returned by .SDcols. We define those columns with a simple grep statement which find column names which include the letter "Y", in your case, the year columns.

C-x C-c
  • 1,261
  • 8
  • 20
  • > library(data.table) > setDT(p) > p[, sum(p, na.rm = T), .(years, area)] Error in `[.data.table`(p, , sum(p, na.rm = T), .(years, area)) : The items in the 'by' or 'keyby' list are length (53,17938). Each must be same length as rows in x or number of rows returned by i (17938). – Data_is_Power Jun 05 '18 at 18:13
  • Thanks for reply. I used above code and it produced above commented error – Data_is_Power Jun 05 '18 at 18:13
  • Interesting, could you run `p[, lapply(.SD, class)]`. I bet one of them is a list, or a factor or something. – C-x C-c Jun 05 '18 at 18:15
  • looks like my area is factor and my years are integer. Please see above update – Data_is_Power Jun 05 '18 at 18:19
  • Okay, your data is in a different format than I thought. That's the issue, so bear with me for a second while I revise my answer. – C-x C-c Jun 05 '18 at 18:21
  • Thanks for your time and effort! – Data_is_Power Jun 05 '18 at 18:23
  • I am taking sum of years i.e 1961 by column based on the area or country. There are 253 countries in dataset, and I would like to display something like above scrreenshot – Data_is_Power Jun 05 '18 at 18:29
  • See the last part of my answer. – C-x C-c Jun 05 '18 at 18:36
  • Thanks, Worked great. Would it be possible for you to provide explanation of last part of the code ? Also, in my output I noticed that it is displaying only first and last 5-6 areas. Does it have to do anything with code ? Sorry, if it's beginner question. But I just started exploring R – Data_is_Power Jun 05 '18 at 18:46
  • To address the later question first, data.table abbreviates the output automatically, to make it easier to read. This is really useful with huge datasets that might take an eternity to print otherwise. As for the first question, I'll update my answer. – C-x C-c Jun 05 '18 at 19:07
  • Great. Thanks! Now it very clear to me! Thank you again for your time and efforts! – Data_is_Power Jun 05 '18 at 19:21