0

I have a large dataframe with 2 columns: The one has a discrete number of values that appear repetitively, while the other only has unique values. Essentially multiple values in column 2 will correspond to one value in column 1.

As the data has currently been acquired, it lists each unique variable in column 2 as a row, which means there are repeated values in column 1.

I want to transform (essentially flip) the data so that I can see which column 2 values fall under each unique value in column 1.

For example, the df is:

Contig Gene
C20 G1
C10 G2
C40 G3
C20 G4
C40 G5
C30 G6

And I want:

Contig Gene
C10 G2
C20 G1, G4
C30 G6
C40 G3, G5

If I only get the number of unique values that will also be okay:

Contig Gene(s)
C10 1
C20 2
C30 1
C40 2

I hope it makes sense. I've been struggling to find the right keywords to explain this issue and really don't know where to begin. Although I get the feeling I should maybe turn the data into a list.

sarasa
  • 9
  • 2

3 Answers3

0

To get a dataframe like your first desired table, you may use

library(tidyr)
chop(df, Gene)

while to get the last one you may use:

library(dplyr)
df %>% group_by(Config) %>% summarise(n = n())
iago
  • 2,990
  • 4
  • 21
  • 27
0

You can also use dplyr. First group by Contig and then summarise.

library(dplyr)

# Small test data
test_data <- data.frame(Contig = c("C10", "C10", "C20", "C20","C20"), 
                        Gene = c("G1", "G3", "G4", "G5", "G6"))


clean_data <- test_data %>% 
  group_by(Contig) %>% 
  summarise(number_of_genes = length(unique(Gene)), 
            specific_genes = paste0(unique(Gene), collapse = ", "))


Tob
  • 245
  • 1
  • 9
0
library(plyr)
M = data.frame(
  Contig = c("C20", "C10", "C40", "C20", "C40", "C30"),
  Gene = paste0("G",1:6)
)
f = function(m) {
  data.frame(
    Contig = unique(m$Contig),
    Gene = paste(m$Gene, collapse=",")
  )
}
ddply(M, .(Contig), f)
Niels Holst
  • 586
  • 4
  • 9