Text file to list in R

Question

I have a large text file with a variable number of fields in each row. The first entry in each row corresponds to a biological pathway, and each subsequent entry corresponds to a gene in that pathway. The first few lines might look like this

path1   gene1 gene2
path2   gene3 gene4 gene5 gene6
path3   gene7 gene8 gene9

I need to read this file into R as a list, with each element being a character vector, and the name of each element in the list being the first element on the line, for example:

> pathways <- list(
+     path1=c("gene1","gene2"), 
+     path2=c("gene3","gene4","gene5","gene6"),
+     path3=c("gene7","gene8","gene9")
+ )
> 
> str(pathways)
List of 3
 $ path1: chr [1:2] "gene1" "gene2"
 $ path2: chr [1:4] "gene3" "gene4" "gene5" "gene6"
 $ path3: chr [1:3] "gene7" "gene8" "gene9"
> 
> str(pathways$path1)
 chr [1:2] "gene1" "gene2"
> 
> print(pathways)
$path1
[1] "gene1" "gene2"

$path2
[1] "gene3" "gene4" "gene5" "gene6"

$path3
[1] "gene7" "gene8" "gene9"

...but I need to do this automatically for thousands of lines. I saw a similar question posted here previously, but I couldn't figure out how to do this from that thread.

Thanks in advance.

See this post for inspiration, *might* help http://stackoverflow.com/questions/6592850/what-software-package-can-you-suggest-for-a-programmer-who-rarely-works-with-stat/6593608#6593608 — Fredrik Pihl, Jul 06 '11 at 21:04
Thank you all for the varied and elegant solutions. 4 valid answers in less than an hour is why I use SO. Much obliged. — Stephen Turner, Jul 06 '11 at 22:03

Joshua Ulrich · Accepted Answer · 2011-07-06T21:25:25.197

48

Here's one way to do it:

# Read in the data
x <- scan("data.txt", what="", sep="\n")
# Separate elements by one or more whitepace
y <- strsplit(x, "[[:space:]]+")
# Extract the first vector element and set it as the list element name
names(y) <- sapply(y, `[[`, 1)
#names(y) <- sapply(y, function(x) x[[1]]) # same as above
# Remove the first vector element from each list element
y <- lapply(y, `[`, -1)
#y <- lapply(y, function(x) x[-1]) # same as above

edited Jul 06 '11 at 21:25

answered Jul 06 '11 at 21:18

Joshua Ulrich

173,410
32
338
418

Thanks! I don't completely understand what `[[` and `[` are doing, but the explicit function definitions make perfect sense. – Stephen Turner Jul 06 '11 at 22:01
1

It's just a way to explicitly call the subsetting functions. Like `+`, `%*%`, etc., they have to be quoted. They're .Primitive so they match arguments based on position only. – Joshua Ulrich Jul 06 '11 at 22:11

score 8 · Answer 2 · answered Jul 06 '11 at 21:19

One solution is to read the data in via read.table(), but use the fill = TRUE argument to pad the rows with fewer "entries", convert the resulting data frame to a list and then clean up the "empty" elements.

First, read your snippet of data in:

con <- textConnection("path1   gene1 gene2
path2   gene3 gene4 gene5 gene6
path3   gene7 gene8 gene9
")
dat <- read.table(con, fill = TRUE, stringsAsFactors = FALSE)
close(con)

Next we drop the first column, first saving it for the names of the list later

nams <- dat[, 1]
dat <- dat[, -1]

Convert the data frame to a list. Here I just split the data frame on the indices 1,2,...,n where n is the number of rows:

ldat <- split(dat, seq_len(nrow(dat)))

Clean up the empty cells:

ldat <- lapply(ldat, function(x) x[x != ""])

Finally, apply the names

names(ldat) <- nams

Giving:

> ldat
$path1
[1] "gene1" "gene2"

$path2
[1] "gene3" "gene4" "gene5" "gene6"

$path3
[1] "gene7" "gene8" "gene9"

Ditto your solution. My regex-fu is weak so didn't see an easy way of working with `scan()`. — Gavin Simpson, Jul 06 '11 at 21:35
This solution is prone to a potentially difficult to find bug: https://stackoverflow.com/questions/32066049, you should first get the max number of the columns — alephreish, Feb 12 '19 at 14:57

score 3 · Answer 3 · answered Jul 06 '11 at 21:21

3

A quick solution based on the linked page...

inlist <- strsplit(readLines("file.txt"), "[[:space:]]+")
pathways <- lapply(inlist, tail, n = -1)
names(pathways) <- lapply(inlist, head, n = 1)

answered Jul 06 '11 at 21:21

JAShapiro

196
1
5

I thought about using `readLines` but it's going to give missing values (`""`) for blank lines (perhaps at the end of the file?). – Joshua Ulrich Jul 06 '11 at 21:27
Yes, I noticed that. If you use the connection from my Answer and do `readLines(con)` you'll see this newline problem. – Gavin Simpson Jul 06 '11 at 21:32

score 3 · Answer 4 · answered Jul 06 '11 at 21:33

3

One more solution:

sl <- c("path1 gene1 gene2", "path2 gene1 gene2 gene3") # created by readLines 
f <- function(l, s) {
  v <- strsplit(s, " ")[[1]]
  l[[v[1]]] <- v[2:length(v)]
  return(l)
}
res <- Reduce(f, sl, list())

answered Jul 06 '11 at 21:33

Karsten W.

17,826
11
69
103

+1 Nice use of `Reduce`. The OP's file has multiple spaces though, so you need to handle that in your `strsplit` call. – Joshua Ulrich Jul 06 '11 at 21:37

Text file to list in R

4 Answers4

Linked

Related