Problems of joining datasets on R

Question

I have a dataset containing variables and a quantity of goods sold: for some days, however, there are no values.

I created a dataset with all 0 values in sales and all NA in the rest. How can I add those lines to the initial dataset?

At the moment, I have this:

sales
day    month    year    employees    holiday    sales
1      1        2018    14           0          1058
2      1        2018    25           1          2174 
4      1        2018    11           0          987

sales.NA
day    month    year    employees    holiday    sales
1      1        2018    NA           NA         0
2      1        2018    NA           NA         0
3      1        2018    NA           NA         0
4      1        2018    NA           NA         0

I would like to create a new dataset, inserting the days where I have no observations, value 0 to sales, and NA on all other variables. Like this

new.data
day    month    year    employees    holiday    sales
1      1        2018    14           0          1058
2      1        2018    25           1          2174 
3      1        2018    NA           NA         0
4      1        2018    11           0          987

I tried used something like this

merge(sales.NA,sales, all.y=T, by = c("day","month","year"))

But it does not work

Sorry for my forgetfulness: The match should be made based on the day, month and year columns — Kevin, Jan 17 '19 at 10:15
If you can make a small example of both datasets that would be great. Please refer to: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — s_baldur, Jan 17 '19 at 10:17
I would create another dataset where you are storing all the dates continously and then just merge the two sales datasets against that one. You can use `dput` to provide the datasets, making it easier to help you. — hannes101, Jan 17 '19 at 10:37
Use an ifelse statement instead after merginng - with the condition of outcome being 0 — , Jan 17 '19 at 10:44

score 1 · Answer 1 · answered Jan 17 '19 at 10:52

Using dplyr, you could use a "right_join". For example:

sales <- data.frame(day = c(1,2,4), 
                    month = c(1,1,1),
                    year = c(2018, 2018, 2018),
                    employees = c(14, 25, 11), 
                    holiday = c(0,1,0), 
                    sales = c(1058, 2174, 987)
                    )

sales.NA <- data.frame(day = c(1,2,3,4),
                       month = c(1,1,1,1), 
                       year = c(2018,2018,2018, 2018)
                       )

right_join(sales, sales.NA)

This leaves you with

  day month year employees holiday sales
1   1     1 2018        14       0  1058
2   2     1 2018        25       1  2174
3   3     1 2018        NA      NA    NA
4   4     1 2018        11       0   987

This leaves NA in sales where you want 0, but that could be fixed by including the sales data in sales.NA, or you could use "tidyr"

right_join(sales, sales.NA) %>% mutate(sales = replace_na(sales, 0))

hannes101 · Answer 2 · 2019-01-17T12:42:44.643

That's an answer using the data.table package, since I am more familiar with the syntax, but regular data.frames should work pretty much the same. I also would switch to a proper date format, which will make life easier for you down the line. Actually, in this way you would not need the Sales.NA table, since it would automatically be solved by all days which have NAs after the first join.

library(data.table)


dt.dates <- data.table(Date = seq.Date(from = as.Date("2018-01-01"), to = as.Date("2018-12-31"),by = "day"  ))
dt.sales <- data.table(day = c(1,2,4)
                       , month = c(1,1,1)
                       , year = c(2018,2018,2018)
                       , employees = c(14, 25, 11)
                       , holiday = c(0,1,0)
                       , sales = c(1058, 2174, 987)
                       )


dt.sales[, Date := as.Date(paste(year,month,day, sep = "-")) ]

merge( x = dt.dates
       , y = dt.sales
       , by.x = "Date"
       , by.y = "Date"
       , all.x = TRUE
)
>             Date day month year employees holiday sales
    1: 2018-01-01   1     1 2018        14       0  1058
    2: 2018-01-02   2     1 2018        25       1  2174
    3: 2018-01-03  NA    NA   NA        NA      NA    NA
    4: 2018-01-04   4     1 2018        11       0   987
...

s_baldur · Answer 3 · 2019-01-17T10:50:38.453

Here is another data.table solution:

jvars = c("day","month","year")
merge(sales.NA[, ..jvars], sales, by = jvars, all.x = TRUE)[is.na(sales), sales := 0L][]

   day month year employees holiday sales
1:   1     1 2018        14       0  1058
2:   2     1 2018        25       1  2174
3:   3     1 2018        NA      NA     0
4:   4     1 2018        11       0   987

Or with some neater syntax:

sales[sales.NA[, ..jvars], on = jvars][is.na(sales), sales := 0][]

Reproducible data:

sales <- structure(list(day = c(1L, 2L, 4L), month = c(1L, 1L, 1L), year = c(2018L, 
2018L, 2018L), employees = c(14L, 25L, 11L), holiday = c(0L, 
1L, 0L), sales = c(1058L, 2174L, 987L)), row.names = c(NA, -3L
), class = c("data.table", "data.frame"))
sales.NA <- structure(list(day = 1:4, month = c(1L, 1L, 1L, 1L), year = c(2018L, 
2018L, 2018L, 2018L), employees = c(NA, NA, NA, NA), holiday = c(NA, 
NA, NA, NA), sales = c(0L, 0L, 0L, 0L)), row.names = c(NA, -4L
), class = c("data.table", "data.frame"))

Problems of joining datasets on R

3 Answers3