0

I'm trying to bind a data frame with the last column of another dataframe, but they have a different number of rows. I want to fill with NA this gap, but by group. Example:

I have df and df_B:

library(lubridate)
library(ggplot2)
library(tidyr)

date <- seq(from = as_date("2019-11-01"), to = as_date("2020-10-01"), by = "month")
            
country <- c(rep("Brazil",6), rep("UK", 6))

A <- 1:12

df <- data.frame(date, country, A)
> view(df)

         date country  A
1  2019-11-01  Brazil  1
2  2019-12-01  Brazil  2
3  2020-01-01  Brazil  3
4  2020-02-01  Brazil  4
5  2020-03-01  Brazil  5
6  2020-04-01  Brazil  6
7  2020-05-01      UK  7
8  2020-06-01      UK  8
9  2020-07-01      UK  9
10 2020-08-01      UK 10
11 2020-09-01      UK 11
12 2020-10-01      UK 12

date <- c(seq(from = as_date("2020-01-01"), to = as_date("2020-04-01"), by = "month"),
           seq(from = as_date("2020-07-01"), to = as_date("2020-10-01"), by = "month"))
            
country <- c(rep("Brazil",4), rep("UK", 4))

B <- 5:12

df_B <- data.frame(date, country, B)
> view(df_B)

        date  country B
1 2020-01-01  Brazil  5
2 2020-02-01  Brazil  6
3 2020-03-01  Brazil  7
4 2020-04-01  Brazil  8
5 2020-07-01      UK  9
6 2020-08-01      UK 10
7 2020-09-01      UK 11
8 2020-10-01      UK 12

What I want is the following result:

         date country  A  B
1  2019-11-01  Brazil  1  NA
2  2019-12-01  Brazil  2  NA
3  2020-01-01  Brazil  3  5
4  2020-02-01  Brazil  4  6
5  2020-03-01  Brazil  5  7
6  2020-04-01  Brazil  6  8
7  2020-05-01      UK  7  NA
8  2020-06-01      UK  8  NA
9  2020-07-01      UK  9  9
10 2020-08-01      UK 10  10
11 2020-09-01      UK 11  11
12 2020-10-01      UK 12  12

I appreciate it if someone can help!!

r2evans
  • 141,215
  • 6
  • 77
  • 149
Jones
  • 333
  • 1
  • 11

1 Answers1

2

This is a merge/join operation. More explanation on these concepts can be found How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272.

Using dplyr,

library(dplyr)
left_join(df, df_B, by=c("date", "country"))
#          date country  A  B
# 1  2019-11-01  Brazil  1 NA
# 2  2019-12-01  Brazil  2 NA
# 3  2020-01-01  Brazil  3  5
# 4  2020-02-01  Brazil  4  6
# 5  2020-03-01  Brazil  5  7
# 6  2020-04-01  Brazil  6  8
# 7  2020-05-01      UK  7 NA
# 8  2020-06-01      UK  8 NA
# 9  2020-07-01      UK  9  9
# 10 2020-08-01      UK 10 10
# 11 2020-09-01      UK 11 11
# 12 2020-10-01      UK 12 12

As a word of "caution", or least a simple warning: joining with numeric-like columns is fine as long as they are effectively discrete, such as integer, or POSIXct/Date that are close-enough to be integers. If you note that dput(df$date) shows integer-like numbers, but they are really numeric not integers, we are now in the realm of "R doing what one would expect to happen.

However, dates can (if mathematically calculated) be partial-days, as in

Sys.Date() + c(0, 0.5)
# [1] "2020-12-14" "2020-12-14"
dput(Sys.Date() + c(0, 0.5))
# structure(c(18610, 18610.5), class = "Date")

we realize that what looks discrete on the surface may not be. Going one step further, the concept of "equality" with continuous (not discrete) numbers is a hit-or-miss thing, in R and many/most other programming languages. Computers have limitations when it comes to floating-point numbers (aka double, numeric, float). This is a fundamental limitation of computers in general, in how they deal with non-integer numbers. There are some add-on libraries or packages that are much better at arbitrary-precision math, but I believe most main-stream languages (this is relative/subjective, I admit) do not use these by default. Refs: Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754

With this in mind, if in the future you do a search using Date objects and something is not joined correctly, take a look at whether your numbers are truly discrete or if you have some floating-point dates in there.

r2evans
  • 141,215
  • 6
  • 77
  • 149