0

I have a genetic dataset that looks like this:

Pathway       Gene
Pathway1      Gene1
Pathway1      Gene2
Pathway2      Gene3
Pathway2      Gene1
Pathway3      Gene1
Pathway3      Gene4
Pathway3      Gene5

I am looking to transpose the Pathways rows to columns, whilst keeping track of which genes are present in which pathway with 1s and 0s. Creating an output like this:

Gene  Pathway1  Pathway2  Pathway3
Gene1    1           1         1
Gene2    1           0         0
Gene3    0           1         0
Gene4    0           0         1
Gene5    0           0         0

My real data is around 3000 rows long, and I'm not confident in R so I've been playing with using t() but I'm not sure where to start with coding to get the binary counts I'm looking for - any help or suggestions on functions to try would help.

Input example data:

structure(list(Pathway = c("Pathway1", "Pathway1", "Pathway2", 
"Pathway2", "Pathway3", "Pathway3", "Pathway3"), Gene = c("Gene1", 
"Gene2", "Gene3", "Gene1", "Gene1", "Gene4", "Gene5")), row.names = c(NA, 
-7L), class = c("data.table", "data.frame"))
LN3
  • 67
  • 1
  • 2
  • 10

3 Answers3

2

A quick and dirty tidyverse solution:

library(tidyr)

# edit thanks to @Ronak Shah
df %>%
pivot_wider(names_from = Pathway,
            values_from = Pathway,
            values_fn = length, values_fill = 0)

# A tibble: 5 x 4
  Gene  Pathway1 Pathway2 Pathway3
  <chr>    <dbl>    <dbl>    <dbl>
1 Gene1        1        1        1
2 Gene2        1        0        0
3 Gene3        0        1        0
4 Gene4        0        0        1
5 Gene5        0        0        1
s__
  • 9,270
  • 3
  • 27
  • 45
2

data.table approach

library(data.table)
dcast(setDT(mydata), Gene ~ Pathway, value.var = "Pathway", fun.aggregate = length)
#     Gene Pathway1 Pathway2 Pathway3
# 1: Gene1        1        1        1
# 2: Gene2        1        0        0
# 3: Gene3        0        1        0
# 4: Gene4        0        0        1
# 5: Gene5        0        0        1
Wimpel
  • 26,031
  • 1
  • 20
  • 37
2

You can use janitor::tabyl.

janitor::tabyl(df, Gene, Pathway)

#  Gene Pathway1 Pathway2 Pathway3
# Gene1        1        1        1
# Gene2        1        0        0
# Gene3        0        1        0
# Gene4        0        0        1
# Gene5        0        0        1
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213