Two dataframes in R: How to match multiple columns by row to find another row value

Question

Rather than doing a nested for loop like this:

    for (rowAll in 1:nrow(groupDataUnadjusted)) {
      year <- groupDataUnadjusted[rowAll, "year"]
      income  <- groupDataUnadjusted[rowAll, "income_group"]
      joint  <- groupDataUnadjusted[rowAll, "Joint"]
      child  <- groupDataUnadjusted[rowAll, "children"]

      for (rowPuf in 1:nrow(nationalPuf)) {
        yearPuf <- nationalPuf[rowPuf, "year"]
        incomePuf  <- nationalPuf[rowPuf, "income_group"]
        jointPuf  <- nationalPuf[rowPuf, "Joint"]
        childPuf  <- nationalPuf[rowPuf, "children"]

        if ((year == yearPuf) && (income == incomePuf) && (joint == jointPuf) && (child == childPuf)) {
          groupDataUnadjusted[rowAll, 'tax_difference_pct'] <- groupDataUnadjusted[rowAll, 'tax_difference_pct']   + nationalPuf[rowPuf, 'diff']
          break
        }
      }
    }
    groupDataAdjusted <- groupDataUnadjusted

I feel like there must be a faster way to find the corresponding rows between two dataframes. I am matching by taking to dataframes, different lengths, and looking where three columns are the same. If they are the same, I know that row is a match between them. Then I take one value from that row and add it to a value in the other dataframe.

But there must be a better way in R.

You cannot really expect an answer without an example, an answer is usually very contextually dependent. — zacdav, Dec 01 '17 at 04:29
@zacdav I disagree. This is such a simple query pattern. It's obvious from my example code. I'm matching on a few columns to try to find another column. — Union find, Dec 01 '17 at 04:30
Obvious to you as the person with all the information sure. You've taken something that contextually makes sense to you and abstracted it to a general question. A specific code example with a non-specific question. To be honest you can google " way to find the corresponding rows between two dataframes" and find the answer as the first link. — zacdav, Dec 01 '17 at 04:33
@zacdavYou are completely wrong and I'll leave it at that. And from your "to be honest" you make it clear it's a simple question -- like I said. — Union find, Dec 01 '17 at 04:34
Actually, an example would help here, as in addition to joining, you are also aggregating the `diff` variable over groups that match in both data frames. — Alex, Dec 01 '17 at 04:34
@Alex It wouldn't help but isn't necessary. The downvotes are uncalled for. — Union find, Dec 01 '17 at 04:35
@incodeveritas exactly my point. A simple question should not have an abstracted example especially if this is something you could have just googled to solve. Particularly in data manipulation, a clear example is everything. Don't sit here and think that every person voting here is wrong. — zacdav, Dec 01 '17 at 04:38
or this: https://stackoverflow.com/questions/5031116/joining-aggregated-values-back-to-the-original-data-frame — Alex, Dec 01 '17 at 04:40
Anyway, joining data frames has been extensively covered on this site, as well as aggregation by groups. With small datasets the order should not matter. — Alex, Dec 01 '17 at 04:41
@Alex. Great feel free to close. In the past you could delete a question like this.. even with an answer. You no longer can. — Union find, Dec 01 '17 at 04:43

Sarah · Answer 1 · 2017-12-01T04:27:17.303

1

you can use join functions from dplyr

Depending if you want to keep all rows or just the ones with a match probably

library(dplyr)
groupDataAdjusted  <- left_join(groupDataUnadjusted, nationalPuf, by = c("year", "income_group","Joint","children") %>% 
                            mutate(tax_difference_pct = tax_difference_pct + diff)

Note this is untested as you did not provide reproducible data, but should give you the idea.

If these are the only matching column names you don't have to specify "by"

or use full_join to keep all rows

See top right of 2nd page of this: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

edited Dec 01 '17 at 04:27

answered Dec 01 '17 at 04:26

Sarah

3,022
1
19
40

The DFs are different lengths. – Union find Dec 01 '17 at 04:27
2

it doesnt matter if they are different lengths – zacdav Dec 01 '17 at 04:28
3

Different lengths shouldn't matter in a join. If this doesn't give then desired output then I'd need example data to better explain your desired output – Sarah Dec 01 '17 at 04:40
you probably want: `groupDataAdjusted <- left_join(groupDataUnadjusted, nationalPuf, by = c("year", "income_group","Joint","children") %>% group_by(year, income_group, Joint, children) %>% summarise(tax_difference_pct = tax_difference_pct[1] + sum(diff, na.rm = T)) %>% ungroup`. This assumes that the grouping variables define unique rows in `groupDataUnadjusted`. – Alex Dec 01 '17 at 04:44

Two dataframes in R: How to match multiple columns by row to find another row value

1 Answers1