0

I am new in this forum, sorry for any issues... I have a dataframe (classification of substances with the classes) in the following format:

A B C D
1 Organic compounds Benzenoids Benzene NA
2 Organic compounds Benzenoids Benzene NA
3 Organic compounds Organic oxygen compounds NA NA
4 NA NA NA NA
5 Organic compounds Benzenoids NA NA

At the end i need a dataframe with 2 columns. The result should be something like this:

class count
Organic compounds; Benzenoids; Benzene 2
Organic compounds; Organic oxygen compounds 1
Organic compounds; Benzenoids 1

What is my first step? I tried to create a new column with the paste content of all the other columns like this:

df$class <- paste(df$A,df$B,df$C,df$D ,sep = "; ")

But the result is:

class
Organic compounds; Benzenoids; Benzene; NA
Organic compounds; Benzenoids; Benzene; NA
Organic compounds; Organic oxygen compounds; NA; NA
NA; NA; NA; NA
Organic compounds; Benzenoids; NA; NA

What would be a conceivable approach for this problem, to get the final result?

Thanks alot!

Flow91
  • 63
  • 6

2 Answers2

0
    library(dplyr)

    df$class<-gsub('; NA','',  paste(df$A,df$B,df$C,df$D ,sep = "; ") )
    df <- df[df$class!='NA',]
    
    df<-ddply(df,.(class),summarize, count=length(class) )
Ashish Baid
  • 513
  • 4
  • 9
0

Will this work:

library(dplyr)
library(string)
df %>% mutate(across(everything(),~ replace_na(., ''))) %>% 
   mutate(class = trimws(paste(A,B,C,D, sep = ';'),whitespace = "''"), class = str_remove(class, ';+$')) %>% 
   count(class, name = 'count') %>% filter(!str_detect(class,'^$'))
# A tibble: 3 x 2
  class                                      count
  <chr>                                      <int>
1 Organic compounds;Benzenoids                   1
2 Organic compounds;Benzenoids;Benzene           2
3 Organic compounds;Organic oxygen compounds     1
Karthik S
  • 11,348
  • 2
  • 11
  • 25
  • 1
    Thank you for your help! The solution from @Ashish Baid is a bit better for me to understand. – Flow91 Apr 09 '21 at 13:08