R - Comparing differently sized data frames

Question

I have two data frames of volume-by-date. They're both the same data, but one is filtered. I'd like to plot a trendline of the ratio between filtered data and non-filtered data on any given day—but am having a hugely hard time molding the data frames so that they're comparable. Here's an example:

unFiltered <- data.frame(date = c("01-01-2015", "01-01-2015", "01-02-2015"), item = c("item1", "item2", "item1"), volume = c(100, 100, 50))

filtered <- data.frame(date = c("01-01-2015", "01-03-2015"), item = c("item1", "item1"), volume = c(10, 40))

From these data sets, I'd like to construct a third data set that is "The percentage of unfiltered item-volume that is being filtered". That is, I want a data frame that will look like this:

    date          item    percentage
1 "01-01-2015"    item1   .1
2 "01-01-2015"    item2    0
3 "01-02-2015"    item1    0
4 "01-02-2015"    item2    0
5 "01-03-2015"    item1   .8
6 "01-03-2015"    item2    0

(Note: Neither data frame has 6 entries—but the resulting data frame has unique values of item and unique values of date.)

Anyone have any ideas? I've been stuck on this for ~2 hours, fumbling around with for loops, merging, joins, manually creating data frames, etc. If anyone has a solution, would you mind explaining what's going on in said solution, too? (I still kind of suck at R, and often times I read code that someone writes without having any idea why it actually works).

See `merge` to get your data frames together. Then calculate your percentage. — Gregor Thomas, Feb 02 '16 at 20:42
@Gregor I've tried ``merge`` to no avail, though it might be that I don't actually know how to utilize it. Can you be more specific? — AmagicalFishy, Feb 02 '16 at 20:45
Try `names(unFiltered)[3]="unf_vol"; names(filtered)[3]="f_vol"; merge(unFiltered,filtered,all=T)` — slamballais, Feb 02 '16 at 20:51

Gregor Thomas · Accepted Answer · 2016-02-02T22:27:37.077

By default, merge will only keep rows that are in both data frames, so we set all.x = T to make sure it keep all rows from the x data frame. It will also, by default, try to match all columns with the same name; since we don't want to match the volume column, we specify the columns that we do want to match in the by argument:

both = merge(x = unFiltered, y = filtered,
             all.x = TRUE, by = c("date", "item"))

This gives us variants of the volume column from each source. (We also could have renamed the volume columns in the original data frames to get the same result, as in Laterow's comment.)

both  # just checking out what's there
#         date  item volume.x volume.y
# 1 01-01-2015 item1      100       10
# 2 01-01-2015 item2      100       NA
# 3 01-02-2015 item1       50       NA

# fill in missing values with 0
both$volume.y[is.na(both$volume.y)] = 0

# calculate the percentage
both$percentage = both$volume.y / both$volume.x

both  # demonstrate the result
#         date  item volume.x volume.y percentage
# 1 01-01-2015 item1      100       10        0.1
# 2 01-01-2015 item2      100        0        0.0
# 3 01-02-2015 item1       50        0        0.0

# drop unwanted columns
both = both[c("date", "item", "percentage")]

I commented and showed results above, but I want to make sure it's clear how simple this is. The only commands that need running are:

both = merge(x = unFiltered, y = filtered,
             all.x = TRUE, by = c("date", "item"))
both$volume.y[is.na(both$volume.y)] = 0
both$percentage = both$volume.y / both$volume.x
both = both[c("date", "item", "percentage")]

Some people (like me!) find dplyr more readable. Here is a dplyr version of the same:

library(dplyr)
unFiltered %>%
    rename(all_volume = volume) %>%
    left_join(filtered) %>%
    mutate(volume = ifelse(is.na(volume), 0, volume),
           percentage = volume / all_volume) %>%
    select(-all_volume, -volume)

#         date  item percentage
# 1 01-01-2015 item1        0.1
# 2 01-01-2015 item2        0.0
# 3 01-02-2015 item1        0.0

score 0 · Answer 2 · answered Feb 02 '16 at 21:01

So either your example code is insufficient, or your problem description is. In particular, if one dataset is filtered from another then you'd never expect entries in filtered that weren't in unfiltered.

In any case, here's a solution to at least one of your problems:

itemsAndDate = unique(rbind(unFiltered[,c("date", "item")],
                            filtered[,c("date", "item")]))

## Here is how you would expand the concept to unobserved things.
combos = expand.grid(itemsAndDate[,1], itemsAndDate[,2])
head(combos)

combined = merge(merge(itemsAndDate, unFiltered, by = c("date", "item"), all.x = TRUE), filtered, by = c("date", "item"), all.x = TRUE)
head(combined)

I'm not sure what you mean by your second sentence—if I filter ``Dataset A`` and call the filtered data ``Dataset B``, then ``Dataset B`` is different than ``Dataset A``; while it may be a subset, it can't be said that its being a subset makes the problem trivial. — AmagicalFishy, Feb 02 '16 at 21:06

R - Comparing differently sized data frames

2 Answers2