R unbalanced data frame data cleaning

Question

My data frame looks like this:

c1 c2 c3 c4

T1 NA NA NA
NA a  NA NA
NA NA B  NA
NA NA NA b
T2 NA NA NA
NA NA C  NA
NA NA NA c

I want to have it like

c1 c2 c3 c4

T1 a  B b
T2 NA C c

I tried with something like the following, which I saw from another post, but I don't think it applies to my problem, any help please?

stri_list2matrix(lapply(., function(x) x[x!='NA']), fill='', byrow=FALSE)

akrun · Answer 1 · 2020-01-31T21:18:04.247

2

Here is one option with lapply from base R. After looping over the columns of the dataset, remove the NA elements with is.na (returns a logical vector for subsetting). Then, pad the list output with NA at the end based on the maximum length of list elements and cbind

lst1 <- lapply(df1, function(x) x[!is.na(x)])
do.call(cbind, lapply(lst1, `length<-`, max(lengths(lst1))))
#    c1   c2  c3  c4 
#[1,] "T1" "a" "B" "b"
#[2,] "T2" NA  "C" "c"

It can also done with cbind.fill from rowr and map

library(purrr)
library(rowr)
map(df1, ~ .x[!is.na(.x)]) %>%
    reduce(cbind.fill, fill = NA) %>%
    set_names(names(df1))
#  c1   c2 c3 c4
#1 T1    a  B  b
#2 T2 <NA>  C  c

Or by reshaping into 'long' format while dropping the rows and then reshape it back to 'wide' format

library(tidyr)
df1 %>% 
     pivot_longer(everything(), values_drop_na = TRUE) %>% 
     group_by(name) %>% 
     mutate(rn = row_number()) %>%
     pivot_wider(names_from = name, values_from = value) %>%
     select(-rn)
# A tibble: 2 x 4
#  c1    c2    c3    c4   
#  <chr> <chr> <chr> <chr>
#1 T1    a     B     b    
#2 T2    <NA>  C     c

Or with melt/dcast

library(data.table)
dcast(melt(setDT(df1)[, rn := seq_len(.N)], id.var = 'rn',
        na.rm = TRUE), rowid(variable) ~ variable, value.var = 'value')

data

df1 <- structure(list(c1 = c("T1", NA, NA, NA, "T2", NA, NA), c2 = c(NA, 
"a", NA, NA, NA, NA, NA), c3 = c(NA, NA, "B", NA, NA, "C", NA
), c4 = c(NA, NA, NA, "b", NA, NA, "c")), class = "data.frame",
row.names = c(NA, 
-7L))

edited Jan 31 '20 at 21:18

answered Jan 31 '20 at 20:39

akrun

874,273
37
540
662

1

I like seeing all the approaches you generate. A question on that first way (which uses do.call & cbind). Does it work if T1-c1 is `NA`, while T2-c1 is "a". Does the "a" get mis-assigned to "T1"? – wibeasley Jan 31 '20 at 21:37
1

@wibeasley thank you. If the column is all `NA` as well, it should work because we are assigning to length `df1$c1 <- NA;lst1 <- lapply(df1, function(x) x[!is.na(x)]);do.call(cbind, lapply(lst1, `length<-`, max(lengths(lst1))))` I guess that is what you meant right? – akrun Jan 31 '20 at 21:39
1

@wibeasley Or use the same example you showed `df1$c1[c(1, 5)] <- c(NA, 'a')` then I get the first row as output `[1,] "a" "a" "B" "b"` and second row `[2,] NA NA "C" "c"` Not clear how the positions should be matched based on the OP's example – akrun Jan 31 '20 at 21:42
1

Yea, there is some ambiguity. I interpreted it like @Hugh, where the values of `c1` need to be carried forward. For the scenario I'm describing, slide the value of "a" down to the bottom cell of `c2`. "a" is still attributed to "T1", instead of "T2". `df1 <- structure(list(c1 = c("T1", NA, NA, NA, "T2", NA, NA), c2 = c(NA, NA, NA, NA, NA, NA, "a")), class = "data.frame", row.names = c(NA, -7L))` – wibeasley Jan 31 '20 at 21:45
1

@wibeasley if that is the case, a group by as in your solution makes perfect sense. I was also thinking in that direction, but the `lapply` in the OP's code was a distraction – akrun Jan 31 '20 at 21:48

score 2 · Accepted Answer · answered Jan 31 '20 at 21:14

2

I would use zoo::na.locf to fill the columns with the first non-NA value then just take the last row in each group determined by the first column.

library(zoo)
library(dplyr)

df1 <- 
  data.frame(c1 = c("T1", NA, NA, NA, "T2", NA, NA),
             c2 = c(NA, "a", NA, NA, NA, NA, NA), 
             c3 = c(NA, NA, "B", NA, NA, "C", NA), 
             c4 = c(NA, NA, NA, "b", NA, NA, "c"),
             stringsAsFactors = FALSE)

df1 %>%
  mutate(c1 = na.locf(c1)) %>% 
  group_by(c1) %>%
  summarise_all(function(x) last(na.locf(x, na.rm = TRUE)))
#> # A tibble: 2 x 4
#>   c1    c2    c3    c4   
#>   <chr> <chr> <chr> <chr>
#> 1 T1    a     B     b    
#> 2 T2    <NA>  C     c

^{Created on 2020-02-01 by the reprex package (v0.3.0)}

answered Jan 31 '20 at 21:14

Hugh

15,521
12
57
100

1

Try running in a fresh vanilla session. I can't reproduce your error (which looks like the result of a typo) – Hugh Jan 31 '20 at 21:31
1

I suspect some functions were masked. Great solution, +1. – tmfmnk Jan 31 '20 at 21:40
I guess the `df1 %>% group_by(c1 = na.locf0(c1)) %>% summarise_all(~ last(na.locf0(.x)))` may be more compact – akrun Jan 31 '20 at 21:53
That's cool. I didn't know you could assign variables in `group_by()`. – wibeasley Jan 31 '20 at 21:55

score 1 · Answer 3 · answered Jan 31 '20 at 20:55

1

Another base R solution (similar to @akrun)

dfout <- data.frame(lapply(l<-Map(na.omit,df),`length<-`,max(lengths(l))))

such that

> dfout
  c1   c2 c3 c4
1 T1    a  B  b
2 T2 <NA>  C  c

answered Jan 31 '20 at 20:55

ThomasIsCoding

96,636
9
24
81

depending on the OP's parameters, this might undesirably slide values into the wrong T1/T2 classification. (@akrun explained it better in the comments following his post.) – wibeasley Jan 31 '20 at 21:53

tmfmnk · Answer 4 · 2020-01-31T21:15:36.950

1

One approach involving dplyr and purrr could be:

map(.x = split.default(df, names(df)), ~ .x %>% filter(!is.na(.)) %>%
     rowid_to_column()) %>%
 reduce(left_join, by = "rowid") %>%
 select(-rowid)

  c1   c2 c3 c4
1 T1    a  B  b
2 T2 <NA>  C  c

edited Jan 31 '20 at 21:15

answered Jan 31 '20 at 21:00

tmfmnk

38,881
4
47
67

See the comment's after @akrun's post. This might classify like akrun's first approach, which may or may not be what the OP intended (it's tough to tell). – wibeasley Jan 31 '20 at 22:08

wibeasley · Answer 5 · 2020-01-31T22:01:49.567

Another approach is to

fill in the implied c1 values with tidyr::fill(),
group by c1, and
find the first nonmissing value within that group.

# Option 1
df1 %>% 
  tidyr::fill(c1) %>% 
  dplyr::group_by(c1) %>% 
  dplyr::summarize(
    c2  = first_nonmissing(c2),
    c3  = first_nonmissing(c3),
    c4  = first_nonmissing(c4),
  ) %>% 
  dplyr::ungroup()

# Option 2
df1 %>% 
  tidyr::fill(c1) %>% 
  dplyr::group_by(c1) %>% 
  dplyr::summarize_at(
    .vars = c("c2", "c3", "c4"),
    .funs = first_nonmissing
  ) %>% 
  dplyr::ungroup()

# Option 3 -suggested by @tmfmnk.  Doesn't need `first_nonmissing()`
df1 %>% 
  tidyr::fill(c1) %>% 
  dplyr::group_by(c1) %>% 
  dplyr::summarize_at(
    .vars = c("c2", "c3", "c4"),
    .funs = ~ last(na.omit(.))
  ) %>% 
  dplyr::ungroup()

Result:

# A tibble: 2 x 4
  c1    c2    c3    c4   
  <chr> <chr> <chr> <chr>
1 T1    a     B     b    
2 T2    NA    C     c

The helper function is adapted from Dplyr : how to find the first-non missing string by groups?. This approach has worked well for me for a while, and here's a revision that accommodate a few more scenarios, which you might encounter if your real data is messier than the simplified one used in the example.

first_nonmissing <- function(x) {
  x[which(!is.na(x))[1]]
}

The df1 definition is stolen from @akrun's post.

You can perhaps make it a little bit more compact by using `df1 %>% fill(c1) %>% group_by(c1) %>% summarise_all(~ last(na.omit(.)))`. — tmfmnk, Jan 31 '20 at 21:57
@tmfmnk, I like how the helper function no longer needs to be defined. I added it above. I kept the `summarize_at()` (instead of `summarize_all()`) in case the OP's actual dataset has new columns that aren't collapsed in this way. But maybe those columns should be dropped before getting to the `fill()` step. — wibeasley, Jan 31 '20 at 22:04

R unbalanced data frame data cleaning

5 Answers5

data