0

Let's say these are my two datasets:

# This is dataset "df"
df <- tibble(country = rep(c("US", "UK", "FRA", "SPA"), times = 3),
             date = rep(c("2020-01-01", "2020-02-01", "2020-03-01"), each = 4))

   country date
 1 FRA     2020-01-01
 2 SPA     2020-01-01
 3 UK      2020-01-01
 4 US      2020-01-01
 5 FRA     2020-02-01
 6 SPA     2020-02-01
 7 UK      2020-02-01
 8 US      2020-02-01
 9 FRA     2020-03-01
10 SPA     2020-03-01
11 UK      2020-03-01
12 US      2020-03-01

# This is dataset "dd"
dd <- tibble(country = rep(c("US", "UK", "FRA"), times = 3),
             date = rep(c("2020-01-01", "2020-02-01", "2020-03-01"), each = 3),
             cases = seq(1:9),
             deaths = seq(from = 2, to = 18, by = 2))

  country date       cases deaths
1 FRA     2020-01-01     1      2
2 UK      2020-01-01     2      4
3 US      2020-01-01     3      6
4 FRA     2020-02-01     4      8
5 UK      2020-02-01     5     10
6 US      2020-02-01     6     12
7 FRA     2020-03-01     7     14
8 UK      2020-03-01     8     16
9 US      2020-03-01     9     18

I would like to create a new dataset called dj which combines df and the cases column in dd by country and date. I also want NA to be printed in the cases column (in dj) where there is missing data (i.e. for country "SPA").

In essence, I am trying to get dj to have 12 rows and 3 columns (country, date, cases), with NA for cases in "SPA". Like this:

   country date       cases
1  FRA     2020-01-01     1
2  SPA     2020-01-01    NA
3  UK      2020-01-01     2
4  US      2020-01-01     3
5  FRA     2020-02-01     4
6  SPA     2020-02-01    NA
7  UK      2020-02-01     5
8  US      2020-02-01     6
9  FRA     2020-03-01     7
10 SPA     2020-03-01    NA
11 UK      2020-03-01     8
12 US      2020-03-01     9

I know you can combine two datasets using dplyr package *_join() functions, but how can I use these when I only want to merge a dataset with one particular column from another dataset and use more than one 'by' argument (i.e. country and date)??

I'm having difficulty in figuring this out myself since I'm new to R, so any help would be very much appreciated :)

kiwi
  • 565
  • 3
  • 11

2 Answers2

1

With dplyr you can use left_join and then select which fields you do or do not want. Remember ?left_join will indicate the "left" dataframe (df here) will keep all records which will maintain the SPA values.

You could also perform a select on the dataframes before the join, removing 'deaths' from df. The important piece here is making sure you're using the correct join to keep all the records within df.

library(dplyr)

dj <- left_join(df, dd, by = c('date', 'country')) %>%
  select(-deaths) # OR select(country, date, cases)

Output:

# A tibble: 12 x 3
   country date       cases
   <chr>   <chr>      <int>
 1 FRA     2020-01-01     3
 2 FRA     2020-02-01     6
 3 FRA     2020-03-01     9
 4 SPA     2020-01-01    NA
 5 SPA     2020-02-01    NA
 6 SPA     2020-03-01    NA
 7 UK      2020-01-01     2
 8 UK      2020-02-01     5
 9 UK      2020-03-01     8
10 US      2020-01-01     1
11 US      2020-02-01     4
12 US      2020-03-01     7
TTS
  • 1,818
  • 7
  • 16
1

You can use left_join(), as follows:

dj <- left_join(df, dd, by = c('country', 'date')) %>%
  select(-deaths)
dj

Here is the output:

# A tibble: 12 x 3
   country date       cases
   <chr>   <chr>      <int>
 1 FRA     2020-01-01     3
 2 FRA     2020-02-01     6
 3 FRA     2020-03-01     9
 4 SPA     2020-01-01    NA
 5 SPA     2020-02-01    NA
 6 SPA     2020-03-01    NA
 7 UK      2020-01-01     2
 8 UK      2020-02-01     5
 9 UK      2020-03-01     8
10 US      2020-01-01     1
11 US      2020-02-01     4
12 US      2020-03-01     7
rodolfoksveiga
  • 1,181
  • 4
  • 17