5

I have an R data processing problem. I have a large set of data in a .csv file that I can load in using readr. The problem is that there are missing values that I need to add to the dataframe.

Example data:

X1    X2    Value
Coal  1     300
Coal  3     100
Coal  5     150
NatG  2     175
NatG  4     150

This file will load in correctly. However, I want to add processing so that the final dataframe contains an entry for every X1 type and the entire sequence 1:5 in X2 with the missing values set equal to zero. The final dataframe would like like this:

X1    X2    Value
Coal  1     300
Coal  2     0
Coal  3     100
Coal  4     0
Coal  5     150
NatG  1     0
NatG  2     175
NatG  3     0
NatG  4     150
NatG  5     0

I tried using readr to load in a file that had all entries equal to zero and then read in the real data to the same data frame, but it was a destructive overwrite and the missing rows were removed. I need to know how to add appropriate rows to the dataframe without knowing which values in the 1:5 sequence will be missing for each value under X1.

user20650
  • 24,654
  • 5
  • 56
  • 91
doriengard
  • 51
  • 1
  • 5
  • doriengard; for future editing, to add the example data, and code in the code bloacks (in grey) you can indetn each line by four spaces or highlight the code you want and click the `{}` icon. – user20650 Dec 23 '17 at 22:40
  • maybe useful: https://stackoverflow.com/questions/10438969/fastest-way-to-add-rows-for-missing-values-in-a-data-frame ; https://stackoverflow.com/questions/41007851/expanding-long-format-time-series-data-with-missing-rows ; and given www's answers https://stackoverflow.com/questions/39961798/how-to-complete-missing-factor-levels-in-data-frame ; https://stackoverflow.com/questions/32874239/how-do-i-use-tidyr-to-fill-in-completed-rows-within-each-value-of-a-grouping-var – user20650 Dec 23 '17 at 22:42

2 Answers2

7

We can use the complete function from the package. The complete function creates combination based on columns provided. The fill argument can specify the fill values.

library(tidyr)

dt2 <- dt %>%
  complete(X1, X2, fill = list(Value = 0))
dt2
# # A tibble: 10 x 3
#       X1    X2 Value
#    <chr> <int> <dbl>
#  1  Coal     1   300
#  2  Coal     2     0
#  3  Coal     3   100
#  4  Coal     4     0
#  5  Coal     5   150
#  6  NatG     1     0
#  7  NatG     2   175
#  8  NatG     3     0
#  9  NatG     4   150
# 10  NatG     5     0

We can also specify the group values for the combination if some group values are missing. For example, we can do the following to make sure 1 to 5 in column X2 all has been considered.

dt2 <- dt %>%
  complete(X1, X2 = 1:5, fill = list(Value = 0))

Or use full_seq, which will automatically create the sequence based on the minimum and maximum values in a column.

dt2 <- dt %>%
  complete(X1, X2 = full_seq(X2, period = 1), fill = list(Value = 0))

DATA

dt <- read.table(text = "X1    X2    Value
Coal  1     300
                 Coal  3     100
                 Coal  5     150
                 NatG  2     175
                 NatG  4     150",
                 header = TRUE, stringsAsFactors = FALSE)
www
  • 38,575
  • 12
  • 48
  • 84
3

Great solution from @www with tidyr. This is a way to do it in base R.

levels <- 1:5

type <- c("Coal", "NatG")

df2 <- expand.grid(levels,type)

df3 <- merge(dt, df2, by.x=c("X1","X2"), by.y=c("Var2","Var1"),all.y=T)

df3[is.na(df3)] <- 0

     X1 X2 Value
1  Coal  1   300
2  Coal  2     0
3  Coal  3   100
4  Coal  4     0
5  Coal  5   150
6  NatG  1     0
7  NatG  2   175
8  NatG  3     0
9  NatG  4   150
10 NatG  5     0
Mako212
  • 6,787
  • 1
  • 18
  • 37
  • Got these to work, but with a small addition. Reading in the data from a .csv file using [readr] caused the initial input to be a dataframe. I had to change it to a data.table first using [setDT] from the [data.table] package. After doing this, both solutions work. – doriengard Dec 24 '17 at 17:16
  • Nothing here should require `data.table`. You might need to set `dt` `as.data.frame`, but that's it. I just carried `dt` through as the name for your primary table because that's what you used. – Mako212 Dec 25 '17 at 00:37