Merge two data frames by a max number condition in r

Question

Cheers, I have a data frame df1 with the Major City with max visitors in 2011.

df1:
Country      City             Visitors_2011
UK           London           100000
USA          Washington D.C   200000
USA          New York         100000
France       Paris            100000

The other data frame df2 consists of top visited cities in the country for 2012:

df2:
Country      City               Visitors_2012
USA          Washington D.C     200000
USA          New York           100000
USA          Las Angeles        100000
UK           London             100000
UK           Manchester         100000
France       Paris              100000
France       Nice               100000

The Output I would need is: Logic: To obtain df3, merge df1 and df2 by Country and City and if you can't find city in df1 then add that volume to biggest city in df1. Example: Los Angeles visitor count here is added to Washington D.C because Los Angeles is not present in df1 and Washington D.C has more visitors(2012) than New York.

df3:
Country      City             Visitors_2011    Visitors_2012
UK           London           100000           200000
USA          Washington D.C   200000           300000
USA          New York         100000           100000
France       Paris            100000           200000

Can anyone point me to the right direction?

Lookup `?match` and try `df3 <- df1; df3$Visitors_2012 <- df2$Visitors_2012[match(df1$City, df2$City)];` — Maurits Evers, Oct 08 '16 at 07:47
@Maurits this won't merge cities that are not in df1 as specified by the question. — mathematical.coffee, Oct 08 '16 at 07:53
@SairamReddy, you seem to be asking many very similar questions. Perhaps try concentrate on one at a time. It could be that the tools to solve one will also solve the other. It would help also to provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), as well as show us what you have tried. — mathematical.coffee, Oct 08 '16 at 07:54
@mathematical.coffee yes you're right. Sloppy reading on my part. Ignore my comment. — Maurits Evers, Oct 08 '16 at 07:59
I actually did spend some time on your previous question and managed to solve it but by the time I went to post my answer the question was gone. — Sotos, Oct 08 '16 at 08:09
@MauritsEvers I just wanted to change the question so that it's better understandable. — Sairam Reddy, Oct 08 '16 at 08:14
@Sotos Apologies for taking off the question. Just wanted to make better understandable. Nevertheless can you direct me to the approach? — Sairam Reddy, Oct 08 '16 at 08:19
Well I am not at a computer anymore but I will post an answer as soon as I get home if it is still unanswered. — Sotos, Oct 08 '16 at 08:30

Maurits Evers · Answer 1 · 2016-10-08T08:49:37.673

Assume df1.txt and df2.txt contain your space-delimited dataframes.

Here is a solution in base R:

df1 <- read.table("df1.txt", header = T, stringsAsFactors = F);
df2 <- read.table("df2.txt", header = T, stringsAsFactors = F);

# Merge with all = TRUE, see ?merge 
df <- merge(df1, df2, all = TRUE);

# Deal with missing values
tmp <- lapply(split(df, df$Country), function(x) {
    # Make sure NA's are at the bottom
    x <- x[order(x$Visitors_2011), ];
    # Select first max Visitors_2012 entry
    idx <- which.max(x$Visitors_2012);
    # Add any NA's to max entry
    x$Visitors_2012[idx] <- x$Visitors_2012[idx] + sum(x$Visitors_2012[is.na(x$Visitors_2011)]);
    # Return dataframe
    return(x[!is.na(x$Visitors_2011), ])});
# Bind list entries into dataframe
df <- do.call(rbind, tmp);

print(df);
       Country           City Visitors_2011 Visitors_2012
France  France          Paris        100000        200000
UK          UK         London        100000        200000
USA.6      USA       New_York        100000        100000
USA.7      USA Washington_D.C        200000        300000

aichao · Accepted Answer · 2016-10-08T11:11:00.457

A dplyr approach:

library(dplyr)
max.cities <- df1 %>% group_by(Country) %>% summarise(City = City[which.max(Visitors_2011)])
result <- df2 %>% mutate(City=ifelse(City %in% df1$City, City,
                                     max.cities$City[match(Country, max.cities$Country)])) %>%
                  group_by(Country,City) %>%
                  summarise(Visitors_2012=sum(Visitors_2012)) %>% 
                  left_join(df1,., by=c("Country", "City"))

Notes:

First, compute the City that has the max visitors group_by Country in df1 and set that to a separate data frame max.cities.
mutate the City column in df2 so that if the City is in df1, then the name is unchanged; otherwise, the City from max.cites that matches the Country is used.
Once the City has been suitably modified, group_by both Country and City and sum up the Visitors_2012.
Finally, left_join with df1 by c("Country", "City") to get the final result.

The result using your posted data is as expected:

print(result)
##  Country           City Visitors_2011 Visitors_2012
##1      UK         London        100000        200000
##2     USA Washington D.C        200000        300000
##3     USA       New York        100000        100000
##4  France          Paris        100000        200000

Merge two data frames by a max number condition in r

2 Answers2