Efficient conditional summing by multiple conditions in R

Question

I'm struggling with finding an efficient solution for the following problem:

I have a large manipulated data frame with around 8 columns and 80000 rows that generally includes multiple data types. I want to create a new data frame that includes the sum of one column if conditions from the large data frame are met.

Imagine the head of the original data frame looks like this. The column $years.raw indicates that the company measured data for x years.

> cbind(company.raw,years.raw,source,amount.inkg)
     company.raw years.raw source      amount.inkg
[1,] "C1"        "1"       "Ink"       "5"        
[2,] "C1"        "1"       "Recycling" "2"        
[3,] "C2"        "1"       "Coffee"    "10"       
[4,] "C2"        "1"       "Combusted" "15"       
[5,] "C2"        "2"       "Printer"   "14"       
[6,] "C2"        "2"       "Tea"       "18"

What I need to do now is to create a new data frame that sums up the values of column $amount.inkg for every company and every year based on certain string elements. I saved the string elements in three vectors below. The string elements are part of the column $source from the original data frame.

> vector1 <- c("Tea","Coffee")
> vector2 <- c("Ink","Printer")
> vector3 <- c("Recycling","Combusted")

The preferred data frame would then look like this:

Company Year              amount.vector1    amount.vector 2 amount.vector 3
C1           1                 0             5                 2
C2           1                 10            0                15        
C2           2                 18            14                0

The general approach for $amount.vector1 would be: Sum up the values of column $amount.inkg for every company and every year where string elements of the original data frame column $source== string elements of vector1. The same for column $amount.vector2 except that the elements are different of course.

If there are no values available, a "0" should be added instead a NA error. This needs to be done for the whole raw data frame that includes around 250 companies with data for every company for the years 1:8 (differs quite a lot).

Edit: For the data frame I need one-row-per-company-per-year.

C1 Year 1  
C1 Year 2
C1 Year 3
C2 Year 1
C2 Year 2

I tried to write a function that combines these conditions but I failed. I'm quite new to R and didn't know how to link these conditions and apply them on the whole data frame.

Please post a small [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) if you want executable answers. Use `dput` on the dataframe and chop it down to small size, etc. — smci, Mar 10 '14 at 02:44
Other example of [**aggregate with conditionals**](http://stackoverflow.com/questions/14650370/exclude-row-with-state-having-less-than-some-value-on-aggregation) — smci, Mar 10 '14 at 03:02
You didn't need to create your intermediate `amount.vector*` dataframe, that's just an `aggregate/ddply(summarize)` operation on the original. — smci, Mar 10 '14 at 03:14
In your title, these are not 'cross references'. It's simply an aggregate over multiple columns (company, year) with conditionals on source. That's all. You don't need to manually aggregate first by year for each company, save that intermediate result, then aggregate by company with a conditional on source. — smci, Mar 10 '14 at 03:23

smci · Accepted Answer · 2014-03-12T12:01:18.810

2

Your data is in 'long form' (multiple rows of company, source, year, ...)

You want to aggregate amount.inkg over each company and year, for multiple values of source. Specifically you want to aggregate with conditionals on 'source' field.

Again, please give us reproducible example. (Thanks josilber). This is a four-liner with either Split-Apply-Combine(ddply) or logical indexing:

df = data.frame(company.raw = c("C1", "C1", "C2", "C2", "C2", "C2"),
                years.raw = c(1, 1, 1, 1, 2, 2),
                source = c("Ink", "Recycling", "Coffee", "Combusted", "Printer", "Tea"),
                amount.inkg = c(5, 2, 10, 15, 14, 18))

# OPTION 1. Split-Apply-Combine: ddply(...summarize) with a conditional on the data
require(plyr) # dplyr if performance on large d.f. becomes an issue
ddply(df, .(company.raw,years.raw), summarize,
    amount.vector1=sum(amount.inkg[source %in% c('Tea','Coffee')]),
    amount.vector2=sum(amount.inkg[source %in% c('Ink','Printer')]),
    amount.vector3=sum(amount.inkg[source %in% c('Recycling','Combusted')])
)


# OPTION 2. sum with logical indexing on the df:
# (This is from before you modified the question to one-row-per-company-and-per-year)
df$amount.vector1 <- sum( df[(df$source %in% c('Tea','Coffee')),]$amount.inkg )
# josilber clarifies you want one-row-per-company
...

Option 3. You could also use aggregate(manpage here) with subset(...), although aggregate for a sum is overkill.

aggregate(df, source %in% c('Tea','Coffee'), FUN = sum)

The by argument to aggregate is where the action is (selecting, subsetting by criteria).

Note: %in% performs a scan operation, so if your vector and d.f. get large, or for scalability, you'd need to break it into boolean operations which can be vectorized: (source=='Tea' | source=='Coffee')

As to preventing NA sums if the subset was empty, sum(c()) = 0 so don't worry about that. But if you do, either use na.omit, or do ifelse(is.na(x),0,x) on the final result.

edited Mar 12 '14 at 12:01

answered Mar 10 '14 at 02:54

smci

32,567
20
113
146

None of these options appear to be returning a data frame with one row for each company/year pair. I believe the OP is looking for one row for each pair (aka split on a combination of `company.raw` and `years.raw`). – josliber Mar 10 '14 at 13:19
@josilber: he never said anything like that. All he said was 'create a new data frame that sums up the values of column $amount.inkg for every company and every year based on certain string elements.' So both our solutions match his criteria. – smci Mar 10 '14 at 17:01
I was looking at the part of the question where it says "The preferred data frame would look like this" -- the example has a row for each company/year pair. Later, "It is quite important for me that the final data frame only has one row for every year per company" – josliber Mar 10 '14 at 17:03
@josilber: my # OPTION 1. ddply code does precisely that. One output line for totals, with three columns. Do you mean OP wants one line for every company? That's easily done with `ddply(df, .(Company) ...` – smci Mar 10 '14 at 17:08
Yeah, the OP wants a line for every company/year pair. I recognize this is a simple modification of your `ddply` call, and I was suggesting you edit your answer with this modification. – josliber Mar 10 '14 at 17:10
Ok, done, but OP also needs to edit their question for clarity. – smci Mar 10 '14 at 17:13
1

@smci Thanks for editing, correcting and answering my questions. I'm sorry that I didn't make it clear enough. I used your Options 1 and 2 to work on my problem. Great valuable input for my R learning and solving this problem. I haven't looked into aggregate yet, but it's on my to do list. Thanks for your time and help on this topic. I will edit the question now and try to make it clear. – Joe K. Mar 12 '14 at 00:24

score 1 · Answer 2 · answered Mar 10 '14 at 03:20

This is a good task for the split-apply-combine paradigm. First, you split your data frame by company/year pair:

data = data.frame(company.raw = c("C1", "C1", "C2", "C2", "C2", "C2"),
                  years.raw = c(1, 1, 1, 1, 2, 2),
                  source = c("Ink", "Recycling", "Coffee", "Combusted", "Printer", "Tea"),
                  amount.inkg = c(5, 2, 10, 15, 14, 18))
spl = split(data, paste(data$company.raw, data$years.raw))

Now, you compute the rolled-up data frame for each element in the split-up data:

spl2 = lapply(spl, function(x) {
  data.frame(Company=x$company.raw[1],
             Year=x$years.raw[1],
             amount.vector1 = sum(x$amount.inkg[x$source %in% vector1]),
             amount.vector2 = sum(x$amount.inkg[x$source %in% vector2]),
             amount.vector3 = sum(x$amount.inkg[x$source %in% vector3]))
})

And finally, combine everything together:

do.call(rbind, spl2)
#      Company Year amount.vector1 amount.vector2 amount.vector3
# C1 1      C1    1              0              5              2
# C2 1      C2    1             10              0             15
# C2 2      C2    2             18             14              0

Efficient conditional summing by multiple conditions in R

2 Answers2

Linked

Related