1

I'm working in R and am using data.table. I have a data set that looks like this:

ID   country_id    weight
1    BGD           56
1    NA            57
1    NA            63
2    SA            12
2    NA            53
2    SA            54

If the value in country_id is NA I need to replace it with a non-na country_id value given to the same ID. I would like the data set to look like this:

ID   country_id    weight
1    BGD           56
1    BGD           57
1    BGD           63
2    SA            12
2    SA            53
2    SA            54

This data set contains millions of IDs, so fixing each ID manually is not an option.

Thanks for your help!

Edit: Solved!

I used this code: dt[, country_id := country_id[!is.na(country_id)][1], by = ID]

bziggy
  • 463
  • 5
  • 19
  • I think [tidyr::fill](https://tidyr.tidyverse.org/reference/fill.html) is what you need. – Stéphane Laurent Nov 26 '19 at 15:24
  • or with ```na.locf``` from ```library(zoo)``` – jeffverboon Nov 26 '19 at 15:26
  • 1
    Does this answer your question? [Replace missing values (NA) with most recent non-NA by group](https://stackoverflow.com/questions/23340150/replace-missing-values-na-with-most-recent-non-na-by-group) – camille Nov 26 '19 at 15:28
  • 1
    or `dt[, country_id := country_id[!is.na(country_id)][1], by = ID]` should work – Andrew Nov 26 '19 at 15:31
  • 1
    @Andrew Thank you!! This worked! – bziggy Nov 26 '19 at 15:38
  • Two people have *incorrectly* marked this post as a duplicated of another one. This post is using `data.table` while the other post asks explicitly for a *dplyr-way* – s_baldur Nov 26 '19 at 16:08
  • 3
    @sindri_baldur that's fair, although the [list of 29 questions linking](https://stackoverflow.com/questions/linked/23340150?lq=1) to the dupe I flagged include several data.table ones. I tried to find a post that a lot of others linked back to. Do you want to mark a more data.table-specific one? It's certainly a question that's been covered – camille Nov 26 '19 at 19:24
  • 1
    You might want to track this [GitHub issue](https://github.com/Rdatatable/data.table/issues/3992) on `nafill` and `setnafill` for character columns. – ismirsehregal Nov 27 '19 at 05:42

3 Answers3

2

Another option is to use a join:

DT[is.na(country_id), country_id := 
    DT[!is.na(country_id)][.SD, on=.(ID), mult="first", country_id]]

Explanation:

  1. DT[is.na(country_id) subset the dataset to those with NAs in country_id column

  2. .SD is the Subset of Data (also a data.table) from the previous step.

  3. DT[!is.na(country_id)][.SD, on=.(ID) left joins .SD with DT[!is.na(country_id)] using ID as the key.

  4. j=country_id returns the country_id column from the right table DT[!is.na(country_id)] and where there are multiple matches, mult="first" returns the first match.

  5. country_id := updates the column country_id at rows of DT where is.na(country_id) is TRUE, to the results from the join.

Timing code and similar but larger data as per Andrew's:

library(data.table)
set.seed(42)

nr <- 1e7
dt <- data.table(ID = rep(1:(nr/4), each = 4),
    country_id = rep(rep(c("BGD", "SA", "USA", "DEN", "THI"), each = 4)),
    weight = sample(10:100, nr, TRUE))
dt[sample(1:nr, nr/2), country_id := NA]
DT <- copy(dt)

microbenchmark::microbenchmark(
    first_nonmissing = dt[, country_id := country_id[!is.na(country_id)][1L], by = ID],
    use_join=DT[is.na(country_id), country_id := DT[!is.na(country_id)][.SD, on=.(ID), mult="first", country_id]],
    times = 1L
)

timings:

Unit: milliseconds
             expr       min        lq      mean    median        uq       max neval
 first_nonmissing 3282.1373 3282.1373 3282.1373 3282.1373 3282.1373 3282.1373     1
         use_join  554.5314  554.5314  554.5314  554.5314  554.5314  554.5314     1
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
1

Taking the answers / suggestions in the comments, you have a few options. I simulated a dataset with 1,000,000 rows and 30% missing in your country_id column to get an idea of what scales best in your situation.

The answer that scales best in this benchmark replaces the NA with the first non-missing value with the same ID: dt[, country_id := country_id[!is.na(country_id)][1], by = ID].

Unit: milliseconds
             expr       min        lq      mean    median        uq       max neval
 first_nonmissing  253.0039  267.0272  284.3988  271.4015  274.5101  405.2004    10
            tidyr  943.6658  951.9638  970.7185  960.6233  971.0660 1069.3023    10
          na.locf 7173.9556 7218.2757 7267.6968 7271.0279 7325.6820 7344.9142    10

Benchmark code:

microbenchmark::microbenchmark(
  first_nonmissing = dt[, country_id := country_id[!is.na(country_id)][1], by = ID],
  tidyr = tidyr::fill(dplyr::group_by(dt, ID), country_id),
  na.locf = dt[, country_id := zoo::na.locf(country_id, na.rm = FALSE), by = ID],
  times = 10
)

Data:

library(data.table)
set.seed(42)

dt <- data.table(ID = rep(1:250000, each = 4),
                 country_id = rep(rep(c("BGD", "SA", "USA", "DEN", "THI"), each = 4)),
                 weight = sample(10:100, 1e6, replace = T))

dt$country_id[sample(1:1e6, 3e5)] <- NA
Andrew
  • 5,028
  • 2
  • 11
  • 21
0

Hope the following code can help you to fill up NA

res <- Reduce(rbind,
       lapply(split(df,df$ID), function(v) 
         {v$country_id <- head(v$country_id[!is.na(v$country_id)],1);v}))

yielding

  ID country_id weight
1  1        BGD     56
2  1        BGD     57
3  1        BGD     63
4  2         SA     12
5  2         SA     53
6  2         SA     54
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81