0

I'm working with data transformation in R and one that I haven't been able is to filter lines with similar values selecting the one with higher "Expression value" and later on split data in columns by Expression level and aggregate them. Since I know the explanation won't give the Nobel prize, below is the original data, the result and what I've achieved so far.

ORIGINAL DATA

df <- read.table(text = 
           "Tissue          Species   Expression  
1           dentritic       Human     moderate
2           liver           Human     high
3           liver           Human     moderate
4           liver           Human     moderate
5           liver           Human     high
6           liver           Monkey    high
7           liver           Monkey    moderate
8           liver           Dog       high
9           liver           Dog       high
10          liver           Minipig   moderate
11          liver           Rat       low
12          liver           Rat       cutoff
13          liver           Monkey    moderate
14          lung            Monkey    high
15          quadriceps     Monkey     cutoff"  , header = TRUE)

The result that I need to achieve is, in case of both values of Tissue and Species repeated, select only the highest value on Expression.

    Tissue           High_Expression        Moderate_Expression    Low_Expression    cutoff

1   dentritic                               Human
2   liver            Human, Monkey,Dog      Minipig                Rat
3   lung             Monkey
4   quadriceps                                                                       Monkey                

What I have so far:

df$Expression <- factor(df$Expression, levels = c("cutoff", "low", "moderate", "high"), ordered = TRUE)
df$Species <- as.character(df$Species)

df <- df %>% 
  mutate(High_expressed = ifelse(Expression == "high", Species, "")) %>% 
  mutate(moderate_expressed = ifelse(Expression == "moderate", Species, "")) %>% 
  mutate(low_expressed = ifelse(Expression == "low", Species, "")) %>% 
  mutate(below_cutoff_expressed = ifelse(Expression == "cutoff", Species, "")) %>% 
  select(-c("Expression", "Species"))

df <- aggregate(. ~ groupTissue, data = df, paste, collapse = ",")


That gives:

    Tissue           High_Expression        Moderate_Expression      Low_Expression    cutoff

1   dentritic                               Human
2   liver            Human,,,Human,         ,Human,Human,,,           ,,,,,,,,,Rat,,    ,,,,,,,,,Rat,
                     Monkey,,Dog,Dog,,,,    Monkey,,,Minipig,,,Monkey 
3   lung             Monkey
4   quadriceps                                                                          Monkey     

Thanks in advance

CarlosSR
  • 93
  • 5

1 Answers1

0

You can first arrange the data based on their Expression value, select only the higher value in each Tissue and Species and get data in wide format.

library(dplyr)

df %>%
  arrange(match(Expression, c('high', 'moderate', 'low', 'cutoff'))) %>%
  distinct(Tissue, Species, .keep_all = TRUE) %>%
  pivot_wider(names_from = Expression,values_from = Species,values_fn = toString) %>%
  arrange(Tissue)

#  Tissue     high               moderate low   cutoff
#  <chr>      <chr>              <chr>    <chr> <chr> 
#1 dentritic  NA                 Human    NA    NA    
#2 liver      Human, Monkey, Dog Minipig  Rat   NA    
#3 lung       Monkey             NA       NA    NA    
#4 quadriceps NA                 NA       NA    Monkey
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213