3

I have a dataframe that looks like this.

 > head(zeisel)
  gene_name ClusterName       p
1     GNAI3         ABC 0.29914
2     GNAI3        ACBG 0.33417
3     GNAI3        ACMB 0.21984
4     GNAI3       ACNT1 0.14727
5     GNAI3       ACNT2 0.22205
6     GNAI3        ACOB 0.16913

I would like to convert it into this:

enter image description here

Is there a way to do this? I tried setting the names first, but this would mean iteratively rbinding every row.

For example:

#get name of new df
cells <- as.data.frame(table(df$ClusterName))

#now create an empty dataframe. 
unmelted_df <- setNames(data.frame(matrix(ncol = length(cells$Var1), nrow = 0)), as.character(cells$Var1))

Is there a way to do this in one step for a massive dataframe?

Community
  • 1
  • 1
Workhorse
  • 1,500
  • 1
  • 17
  • 27

1 Answers1

2

An option would be to create sequence column and then spread into 'wide' format

library(tidyverse)
zeisel %>%
    mutate(rn = 1) %>%
    spread(ClusterName, p)
#  gene_name rn     ABC    ACBG    ACMB   ACNT1   ACNT2    ACOB
#1     GNAI3  1 0.29914 0.33417 0.21984 0.14727 0.22205 0.16913

From the newer version of tidyr, spread will be deprecated and in place pivot_wider can be used

zeisel %>% 
    pivot_wider(names_from = 'ClusterName', values_from = 'p')
# A tibble: 1 x 7
#  gene_name   ABC  ACBG  ACMB ACNT1 ACNT2  ACOB
#  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 GNAI3     0.299 0.334 0.220 0.147 0.222 0.169

Or using dcast from data.table

library(data.table)
dcast(setDT(zeisel), gene_name ~ ClusterName, value.var = 'p')
#   gene_name     ABC    ACBG    ACMB   ACNT1   ACNT2    ACOB  
#1:     GNAI3 0.29914 0.33417 0.21984 0.14727 0.22205 0.16913

data

zeisel <- structure(list(gene_name = c("GNAI3", "GNAI3", "GNAI3", "GNAI3", 
"GNAI3", "GNAI3"), ClusterName = c("ABC", "ACBG", "ACMB", "ACNT1", 
"ACNT2", "ACOB"), p = c(0.29914, 0.33417, 0.21984, 0.14727, 0.22205, 
0.16913)), class = "data.frame", row.names = c(NA, -6L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Why use `mutate(rn=1)`? – Workhorse Sep 22 '19 at 01:54
  • 1
    @Workhorse It is not needed, but I thought first that you may need a unique identifier to state that it is the first row (as you provided only the head of the dataset) – akrun Sep 22 '19 at 01:55