0

I have two dataframes in R, one of them quite big (say 150000 observations with 160 variables) and one smaller (76 observations of 5 variables).

One of the variables in the big dataframe is country saved as a string, while the other consists of various countries and specific characteristics. Now I want to construct new variables in my new dataframe by adding columns for each of those characteristics and linking an observation to the characteristics of their corresponding country. I have however a few problems:

  1. One problem is that not all countries are represented in the smaller dataframe, I'd want to drop observations in my first frame from a country not represented in the second.
  2. The second problem is that it seems that I can't use the standard merge function, as the second dataframe is formatted in the following way:
Country Var1        Var2        Var3        Var4        Var5
NIC     -0.61252    -0.54723    -0.41597    -0.38825    -0.17819
RWA     -0.60603    -0.28969    -0.57998    -0.05933    -0.14199
GEO     -0.48543    -0.08132    0.56275     -0.25436    0.62782

While my first dataframe is formatted in the following way

CNTRY Var1 Var2 etc

Does it seem that I'll need to hardcode a function myself?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
pokemonfan
  • 209
  • 1
  • 2
  • 9

2 Answers2

0

you could use:

for 1:

library(dplyr)
bigdataframe %>% filter(Country%in%smalldataframe$CNTRY) -> newdataframe

for 2:

merge(smalldataframe, newdataframe, by = c("CNTRY","Country"))
ava
  • 840
  • 5
  • 19
  • Hi thanks for your answer! The filter part was exactly what I was looking for. However, when merging the data I get the problem ```Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column```. I do not understand this error after reading about possible causes and the function documentation. Do you know what's causing this/did you have the same problems? – pokemonfan May 11 '19 at 11:36
0

If we call the big one df1 and the small one df2, you should be able to do a join that satisfies your demands. If I understand your question correctly you want a dataframe with many rows (close to 150k) each of which has the country attributes attached. But you want to drop any of the 150k rows that do not have a corresponding country.

As suggested by @aurelianoguedes you should just do an inner join straight away, which gets rid of any rows in the big dataframe that don't have a match in the small df.

df.merged = dplyr::inner_join(df1, df2, by=c('Country' = 'CNTRY'))

Old solution with a filter + left_join:

df.merged = df1 %>%
  filter(Country %in% df2$CNTRY) %>%
  left_join(df2, by=c('Country' = 'CNTRY'))
  • I have similar issues with this solution as with the other one (see comment below on Ava's answer). Filter woks just fine, but ```left_join``` gives me the following error ```Error: `by` can't contain join column `Country` which is missing from LHS```. What's could be the cause of this error? – pokemonfan May 11 '19 at 11:39
  • I had a mistake in my original answer, updated now (should use `=` for the `by` argument). Nonetheless, are you sure `Country` is in the df1 (big dataframe)? It would be best to update your original question by adding the actual output of `head(df)` or `str(df)` so we can see what's really in your dataframe. This is why [reproducible examples](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) are important, so we can isolate the problem :) – Peter Smittenaar May 11 '19 at 13:28
  • 1
    In his case I think is better to use the `inner_join` leading to drop the missing rows – Aureliano Guedes May 11 '19 at 13:38
  • Late last night I dismissed `inner_join` for some reason, but I can't see why anymore :) I'll update answer, thanks – Peter Smittenaar May 11 '19 at 13:40