1

My data frame looks like this:

c1 c2 c3 c4

T1 NA NA NA
NA a  NA NA
NA NA B  NA
NA NA NA b
T2 NA NA NA
NA NA C  NA
NA NA NA c

I want to have it like

c1 c2 c3 c4

T1 a  B b
T2 NA C c

I tried with something like the following, which I saw from another post, but I don't think it applies to my problem, any help please?

stri_list2matrix(lapply(., function(x) x[x!='NA']), fill='', byrow=FALSE)
camille
  • 16,432
  • 18
  • 38
  • 60
alexj
  • 45
  • 7

5 Answers5

2

Here is one option with lapply from base R. After looping over the columns of the dataset, remove the NA elements with is.na (returns a logical vector for subsetting). Then, pad the list output with NA at the end based on the maximum length of list elements and cbind

lst1 <- lapply(df1, function(x) x[!is.na(x)])
do.call(cbind, lapply(lst1, `length<-`, max(lengths(lst1))))
#    c1   c2  c3  c4 
#[1,] "T1" "a" "B" "b"
#[2,] "T2" NA  "C" "c"

It can also done with cbind.fill from rowr and map

library(purrr)
library(rowr)
map(df1, ~ .x[!is.na(.x)]) %>%
    reduce(cbind.fill, fill = NA) %>%
    set_names(names(df1))
#  c1   c2 c3 c4
#1 T1    a  B  b
#2 T2 <NA>  C  c

Or by reshaping into 'long' format while dropping the rows and then reshape it back to 'wide' format

library(tidyr)
df1 %>% 
     pivot_longer(everything(), values_drop_na = TRUE) %>% 
     group_by(name) %>% 
     mutate(rn = row_number()) %>%
     pivot_wider(names_from = name, values_from = value) %>%
     select(-rn)
# A tibble: 2 x 4
#  c1    c2    c3    c4   
#  <chr> <chr> <chr> <chr>
#1 T1    a     B     b    
#2 T2    <NA>  C     c    

Or with melt/dcast

library(data.table)
dcast(melt(setDT(df1)[, rn := seq_len(.N)], id.var = 'rn',
        na.rm = TRUE), rowid(variable) ~ variable, value.var = 'value')

data

df1 <- structure(list(c1 = c("T1", NA, NA, NA, "T2", NA, NA), c2 = c(NA, 
"a", NA, NA, NA, NA, NA), c3 = c(NA, NA, "B", NA, NA, "C", NA
), c4 = c(NA, NA, NA, "b", NA, NA, "c")), class = "data.frame",
row.names = c(NA, 
-7L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    I like seeing all the approaches you generate. A question on that first way (which uses do.call & cbind). Does it work if T1-c1 is `NA`, while T2-c1 is "a". Does the "a" get mis-assigned to "T1"? – wibeasley Jan 31 '20 at 21:37
  • 1
    @wibeasley thank you. If the column is all `NA` as well, it should work because we are assigning to length `df1$c1 <- NA;lst1 <- lapply(df1, function(x) x[!is.na(x)]);do.call(cbind, lapply(lst1, `length<-`, max(lengths(lst1))))` I guess that is what you meant right? – akrun Jan 31 '20 at 21:39
  • 1
    @wibeasley Or use the same example you showed `df1$c1[c(1, 5)] <- c(NA, 'a')` then I get the first row as output `[1,] "a" "a" "B" "b"` and second row `[2,] NA NA "C" "c"` Not clear how the positions should be matched based on the OP's example – akrun Jan 31 '20 at 21:42
  • 1
    Yea, there is some ambiguity. I interpreted it like @Hugh, where the values of `c1` need to be carried forward. For the scenario I'm describing, slide the value of "a" down to the bottom cell of `c2`. "a" is still attributed to "T1", instead of "T2". `df1 <- structure(list(c1 = c("T1", NA, NA, NA, "T2", NA, NA), c2 = c(NA, NA, NA, NA, NA, NA, "a")), class = "data.frame", row.names = c(NA, -7L))` – wibeasley Jan 31 '20 at 21:45
  • 1
    @wibeasley if that is the case, a group by as in your solution makes perfect sense. I was also thinking in that direction, but the `lapply` in the OP's code was a distraction – akrun Jan 31 '20 at 21:48
2

I would use zoo::na.locf to fill the columns with the first non-NA value then just take the last row in each group determined by the first column.

library(zoo)
library(dplyr)

df1 <- 
  data.frame(c1 = c("T1", NA, NA, NA, "T2", NA, NA),
             c2 = c(NA, "a", NA, NA, NA, NA, NA), 
             c3 = c(NA, NA, "B", NA, NA, "C", NA), 
             c4 = c(NA, NA, NA, "b", NA, NA, "c"),
             stringsAsFactors = FALSE)

df1 %>%
  mutate(c1 = na.locf(c1)) %>% 
  group_by(c1) %>%
  summarise_all(function(x) last(na.locf(x, na.rm = TRUE)))
#> # A tibble: 2 x 4
#>   c1    c2    c3    c4   
#>   <chr> <chr> <chr> <chr>
#> 1 T1    a     B     b    
#> 2 T2    <NA>  C     c

Created on 2020-02-01 by the reprex package (v0.3.0)

Hugh
  • 15,521
  • 12
  • 57
  • 100
1

Another base R solution (similar to @akrun)

dfout <- data.frame(lapply(l<-Map(na.omit,df),`length<-`,max(lengths(l))))

such that

> dfout
  c1   c2 c3 c4
1 T1    a  B  b
2 T2 <NA>  C  c
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • depending on the OP's parameters, this might undesirably slide values into the wrong T1/T2 classification. (@akrun explained it better in the comments following his post.) – wibeasley Jan 31 '20 at 21:53
1

One approach involving dplyr and purrr could be:

map(.x = split.default(df, names(df)), ~ .x %>% filter(!is.na(.)) %>%
     rowid_to_column()) %>%
 reduce(left_join, by = "rowid") %>%
 select(-rowid)

  c1   c2 c3 c4
1 T1    a  B  b
2 T2 <NA>  C  c
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
  • See the comment's after @akrun's post. This might classify like akrun's first approach, which may or may not be what the OP intended (it's tough to tell). – wibeasley Jan 31 '20 at 22:08
1

Another approach is to

  1. fill in the implied c1 values with tidyr::fill(),
  2. group by c1, and
  3. find the first nonmissing value within that group.
# Option 1
df1 %>% 
  tidyr::fill(c1) %>% 
  dplyr::group_by(c1) %>% 
  dplyr::summarize(
    c2  = first_nonmissing(c2),
    c3  = first_nonmissing(c3),
    c4  = first_nonmissing(c4),
  ) %>% 
  dplyr::ungroup()

# Option 2
df1 %>% 
  tidyr::fill(c1) %>% 
  dplyr::group_by(c1) %>% 
  dplyr::summarize_at(
    .vars = c("c2", "c3", "c4"),
    .funs = first_nonmissing
  ) %>% 
  dplyr::ungroup()

# Option 3 -suggested by @tmfmnk.  Doesn't need `first_nonmissing()`
df1 %>% 
  tidyr::fill(c1) %>% 
  dplyr::group_by(c1) %>% 
  dplyr::summarize_at(
    .vars = c("c2", "c3", "c4"),
    .funs = ~ last(na.omit(.))
  ) %>% 
  dplyr::ungroup()

Result:

# A tibble: 2 x 4
  c1    c2    c3    c4   
  <chr> <chr> <chr> <chr>
1 T1    a     B     b    
2 T2    NA    C     c    

The helper function is adapted from Dplyr : how to find the first-non missing string by groups?. This approach has worked well for me for a while, and here's a revision that accommodate a few more scenarios, which you might encounter if your real data is messier than the simplified one used in the example.

first_nonmissing <- function(x) {
  x[which(!is.na(x))[1]]
}

The df1 definition is stolen from @akrun's post.

wibeasley
  • 5,000
  • 3
  • 34
  • 62
  • 1
    You can perhaps make it a little bit more compact by using `df1 %>% fill(c1) %>% group_by(c1) %>% summarise_all(~ last(na.omit(.)))`. – tmfmnk Jan 31 '20 at 21:57
  • @tmfmnk, I like how the helper function no longer needs to be defined. I added it above. I kept the `summarize_at()` (instead of `summarize_all()`) in case the OP's actual dataset has new columns that aren't collapsed in this way. But maybe those columns should be dropped before getting to the `fill()` step. – wibeasley Jan 31 '20 at 22:04