1

I have a data frame that looks like this:

line = c(1, 2, NA, 4 ,5, NA, 7)
group = c("1.0 Group A", "2.0 Group B", "3.0 Group C", "4.0 Group D", "5.0  Group E", "6.0 Group F", "7.0 Group G")
df <- data.frame(line, group)

view(df)
   line    group
1    1   1.0 Group A
2    2   2.0 Group B
3   NA   3.0 Group C
4    4   4.0 Group D
5    5   5.0 Group E
6   NA   6.0 Group F
7    7   7.0 Group G

What I want to do is to find all the NA value in the "line" column and place a row underneath that row in "group" column saying "Not Applicable". So that the new data frame should look like:

view(df)
    line    group
1    1   1.0 Group A
2    2   2.0 Group B
3   NA   3.0 Group C
4   NA   Not Applicable
5    4   4.0 Group D
6    5   5.0 Group E
7   NA   6.0 Group F
8   NA  Not Applicable
9    7   7.0 Group G

I am thinking about using an ifelse statement or using case_when from dplyr. But I don't know how to work it out. Does anyone have any suggestion?

Thank you!

Connie
  • 89
  • 1
  • 1
  • 8
  • Is there a particular reason to do this? It's not a very `tidy` approach, since you are basically inserting values in your table that aren't actually values. If you want to do something at these particular boundaries, you could create a vector of indices; if you want to divide the groups for some visualisation you could make a group id variable. – Calum You Mar 15 '18 at 19:01
  • Thank you @CalumYou ! Yes, I am working with a bigger data that actually requires R to automatically identify the NA and put a line of "Not Applicable" underneath it instead of me manually doing it. And I will try looking into vector of indices. – Connie Mar 15 '18 at 19:04
  • Here's a [related question](https://stackoverflow.com/questions/11561856) for additional info, but your question is generalized to inserting multiple rows. And I don't think that Dec 2015 dplyr solution respects the insert order? – wibeasley Mar 15 '18 at 19:37

4 Answers4

4

Here's a base R method: split the data by the cumulative NA count, add in the new lines, recombine.

    df$group = as.character(df$group)
    split_df = split(df, cumsum(is.na(df$line)))
    split_df[-1] = lapply(split_df[-1], function(d) rbind(d[1, ], data.frame(line = NA, group = "Not applicable"), d[-1, ]))
    do.call(rbind, split_df)
    #     line          group
    # 0.1    1    1.0 Group A
    # 0.2    2    2.0 Group B
    # 1.3   NA    3.0 Group C
    # 1.1   NA Not applicable
    # 1.4    4    4.0 Group D
    # 1.5    5   5.0  Group E
    # 2.6   NA    6.0 Group F
    # 2.1   NA Not applicable
    # 2.7    7    7.0 Group G

Note that I converted group to character to make adding new values easy, and I put NAs in the line column - you can't just have blanks in a numeric vector, every element needs to either be a number or NA.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • 1
    Very valid explanations of limitation. – MKR Mar 15 '18 at 19:15
  • Thank you @Gregor, I think that is a good correction. I will edit my question to make sure there is no blank. And thanks for the solution! – Connie Mar 15 '18 at 19:31
3

An solution could be achieved using dplyr.

The approach is simple. Add a column representing row number. Take out rows having line value as NA. Replace group with Not Applicable and increment row number column by 0.5. Bind those two data frames.

library(dplyr)

df %>% mutate(rownum = row_number()) %>% 
  bind_rows(., filter(., is.na(line)) %>% 
  mutate(group = "Not Applicable", rownum = rownum+.5)) %>% 
  arrange(rownum) %>%
  select(-rownum)


# line          group
# 1    1    1.0 Group A
# 2    2    2.0 Group B
# 3   NA    3.0 Group C
# 4   NA Not Applicable
# 5    4    4.0 Group D
# 6    5   5.0  Group E
# 7   NA    6.0 Group F
# 8   NA Not Applicable
# 9    7    7.0 Group G

Limitations mentioned by @Gregor are valid. Numeric column can have value as NA but not blank.

MKR
  • 19,739
  • 4
  • 23
  • 33
3

Create a separate data.frame, ds_blank, and then use a union query to stack, then order it by a temp variable called index.

library(magrittr)
na_index <- which(is.na(df$line))

ds_blank <- tibble::tibble(
  index   = na_index + .5,
  line    = rep(NA_real_          , length(na_index)),
  group   = rep("Not Applicable"  , length(na_index))
)

df <- df %>% 
  tibble::rowid_to_column("index") %>% 
  dplyr::union(ds_blank) %>% 
  dplyr::arrange(index) %>% 
  dplyr::select(-index)

Result

> df
  line          group
1    1    1.0 Group A
2    2    2.0 Group B
3   NA    3.0 Group C
4   NA Not Applicable
5    4    4.0 Group D
6    5   5.0  Group E
7   NA    6.0 Group F
8   NA Not Applicable
9    7    7.0 Group G

I wanted to try tibble::add_row(), but that apparently doesn't allow multiple rows to be inserted if you specify a position.

Secondary approach

...using @Gregor's tip to use a for loop. Notice the na_index is reverse sorted now.

na_index <- sort(which(is.na(df$line)), decreasing = T)
for( i in na_index ) {
  df <- df %>% 
    tibble::add_row(
      line    = NA_integer_,
      group   = "Not Applicable",
      .after  = i
    )
}
wibeasley
  • 5,000
  • 3
  • 34
  • 62
  • 1
    You could still use `add_row` with a for loop. – Gregor Thomas Mar 15 '18 at 19:20
  • I was thinking about doing that but it seemed like it would have a lot of overhead, since you'd need to find the next NA row each loop instead of being able to start with a vector positions. – Calum You Mar 15 '18 at 19:21
  • @Gregor, thanks. I incorporated your suggestion. I was trying things like [`purrr::map_dfr()`](https://www.rdocumentation.org/packages/purrr/versions/0.2.4/topics/map) and failed to consider other approaches. – wibeasley Mar 15 '18 at 19:28
  • 1
    The going backwards is a clever solution to the row problem. `which` will be in order, so you don't need to sort you can just `rev(which(...))`. Another possible solution, since you know you are adding just one row each time, is to go forwards and use `na_index = which(is.na(df$line)); add_row_index = na_index + seq_along(na_index) - 1`. (I like yours better, just tossing out another option.) – Gregor Thomas Mar 15 '18 at 19:40
1

I feel like the tidyr::uncount function might also be what you're looking for. Just mark line==NA rows with a 2 in a new column, which we'll call n, and then uncount will duplicate each row based on the value in n. By mutating an ifelse that sets NA rows n == 2, we essentially only duplicate the NA rows directly below where they were, as opposed to at the bottom of the df and thereby needing to arrange. Finally, the dplyr::mutate_at just says to set group equal to "Not Applicable" if the line and the previous row's line (i.e. lag(line)) are both NA (which tells us to focus just on these duplicated rows). I'd like to think this method is also pretty scalable too!

library(tidyverse)
df %>%
    modify_if(is.factor, as.character) %>%
    mutate(n = ifelse(is.na(line), 2, 1)) %>%
    uncount(n) %>%
    mutate_at(vars(group), ~ifelse(is.na(line) & is.na(lag(line)), "Not Applicable", .))
#   line          group
# 1    1    1.0 Group A
# 2    2    2.0 Group B
# 3   NA    3.0 Group C
# 4   NA Not Applicable
# 5    4    4.0 Group D
# 6    5   5.0  Group E
# 7   NA    6.0 Group F
# 8   NA Not Applicable
# 9    7    7.0 Group G
jackbdolg
  • 105
  • 6