0

I have a dataframe with categories and numerous corresponding values. Values here are non-numeric. The categories are in triplicate and need to be condensed to one row per category, while also condensing corresponding values.

Current format looks like this:

Df <- data.frame(A = c("category1", "category1", "category1", 
                       "category2", "category2", "category2",
                       "category3", "category3", "category3"),
                 B = c("NA", "NA", "a",
                       "NA", "b", "NA",
                       "NA", "NA", "a"),
                 C = c("b", "NA", "NA", 
                       "a", "NA", "NA",
                       "b", "NA", "NA"),
                 D = c("NA", "ab", "NA", 
                       "NA", "NA", "ab",
                       "NA", "NA", "NA"),
                 E = c("NA", "NA", "NA", 
                        "NA", "NA", "NA",
                        "NA", "c", "NA"))

I want data to look like this:

Category 3 is a bit of a nuisance as it forces NA values to be introduced. Last column would be empty if not for one "c" in category 3. Category 3 is the only one out of hundreds of categories that has a value in the 5th column and none in the fourth. If it is easier I can perhaps just drop this category for now and add it in manually later.

Originally I had both numeric and non-numeric values but with the following code I was able to condense almost as intended:

data1 <- data %>%
  group_by(category) %>%
  summarize_all(list(~toString(unique(.))))

However this did not work for the non-numeric columns above, leaving only one column value per category. The answer to the question needs to retain the category column to allow me to merge back with numeric column table.

noob123
  • 3
  • 2
  • 2
    We cannot read data into R from images. Please [make this question reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) by including a small representative dataset in a plain text format - for example the output from `dput(data)`, if that is not too large. – neilfws Nov 24 '22 at 23:16

1 Answers1

0

First we transform "NA" to a true NA, then we collapse each group by the first occurrence of a non-NA for each column.

library(tidyverse)

Df |>
  mutate(across(everything(), \(x) ifelse(x=="NA", NA, x))) |>
  group_by(A) |>
  summarise(across(everything(), \(x) first(x[!is.na(x)])))
#> # A tibble: 3 x 5
#>   A         B     C     D     E    
#>   <chr>     <chr> <chr> <chr> <chr>
#> 1 category1 a     b     ab    <NA> 
#> 2 category2 b     a     ab    <NA> 
#> 3 category3 a     b     <NA>  c
AndS.
  • 7,748
  • 2
  • 12
  • 17
  • Hey, thanks for your answer. Are you using any packages other than tidyverse here? Ran the above on example dataset and got a few errors: > Df |> Error: unexpected '>' in "Df |>" > mutate(across(everything(), (x) ifelse(x=="NA", NA, x))) |> Error: unexpected input in " mutate(across(everything(), \" > group_by(A) |> Error: unexpected '>' in " group_by(A) |>" > summarise(across(everything(), (x) first(x[!is.na(x)]))) Error: unexpected input in " summarise(across(everything(), \" – noob123 Nov 25 '22 at 11:06
  • I don't know the package that @AndS. used but you get a working tidyverse code (actually, only the dyplr package is needed) if you replace `|>` by `%>%`, replace the `\(x)` by `~` and put a dot before the remaining x, i.e. `.x` instead of `x`. – benimwolfspelz Nov 25 '22 at 14:02
  • I am only using the package shown. You just need to update your version of R. `|>` and `\(x)` are both from base R when you update. – AndS. Nov 25 '22 at 15:39