Does `tfread` exist?

Question

In R is there an efficient way to read a transposed .csv file?

For example consider the following text file:

Name,Peter,Paul,Marry
Age,40,9,38

This could be read into a data.table with useful column classes using:

library(data.table)    
file <-  tempfile("tmp.txt")
writeLines("Name,Peter,Paul,Mary\nAge,40,5,38\n", file)    

lines <- readLines(file)
lines <- lapply(lines, function(x) gsub(pattern=",", replacement="\n", x, fixed=TRUE))
lines <- lapply(lines[-3], fread)
do.call(cbind,lines)
#>     Name Age
#> 1: Peter  40
#> 2:  Paul   5
#> 3:  Mary  38

Is there a simpler way to achieve this? Is there a more efficient version (my file is 1 GB)?

Note, that such column-major storage should be easier to read for a column-wise storage as in a data.table.

@DirkEddelbuettel is it? I know fread (I am even using it here). The problem is that "rows are columns", i.e. fread would read everything as character. — jan-glx, Mar 02 '18 at 18:43
You may need "something" for the transpose, and mayne reading it all via `fread()`, transposing, then writing out again and re-reading is still best ... because otherwise you will _not_ get the per-final-column type info hidden now in each row. — Dirk Eddelbuettel, Mar 02 '18 at 18:44
@Gregor yes, `> t(fread(file)) [,1] Name "Age" Peter "40" Paul "5" Mary "38" `, i.e. the Age column is `character` instead of `integer`. — jan-glx, Mar 02 '18 at 18:48
Because type is guestimated _by column_ -- see my previous comment. — Dirk Eddelbuettel, Mar 02 '18 at 18:51
I'd suggest you look at `chunk.map` from the `iotools` package. You should be able to read in multiple rows from the file (a chunk), perform the splitting on this chunk, transpose, and write out each chunk to a file or file connection. It is designed to be super fast and the chunking should reduce overall memory use for this part of the process. In the second step, you'd have some number of files that you'd have to read in separately and `cbind` or similar. — lmo, Mar 02 '18 at 18:53
Perhaps use something that excels in text manipulation rather than using something in `R`. Searching "transpose a file" gives a number of useful answers, such as https://stackoverflow.com/questions/1729824/an-efficient-way-to-transpose-a-file-in-bash in which `awk` and `perl` solutions are given. — ngm, Mar 02 '18 at 19:02
@jan-glx If you care about the column class, use `type.convert`. This variant of Gregor's code works for the example: `d = fread(x, header=FALSE); d[, lapply(transpose(.SD[,-1]), type.convert)][, setnames(.SD, d[[1]])]`, where `x = "Name,Peter,Paul,Marry\nAge,40,9,38"`. — Frank, Mar 02 '18 at 19:07
A workhorse function of `iotools` is `mstrsplit` which splits a character vector or raw vector and returns a matrix. In your example, you'd do `t(mstrsplit(charToRaw("Name,Peter,Paul,Mary\nAge,40,5,38\n"), sep=","))` to return a character matrix which could be written to file and then re-read using `fread` or similar. — lmo, Mar 02 '18 at 19:21
Potentially relevant: https://stackoverflow.com/questions/17288197/reading-a-csv-file-organized-horizontally — Artem Sokolov, Mar 02 '18 at 21:16

Clayton Stanley · Answer 1 · 2018-03-02T20:18:48.057

This is an implementation of @Dirk Eddelbuettel's suggested approach in the comments.

> library(data.table)                                                                                                          
> aTbl = fread("file.csv", colClasses="character", header=F)
> aTbl

     V1    V2   V3   V4
1: Name Peter Paul Mary
2:  Age    40    5   38     

> aTbl[, .SD
       ][, transpose(.SD)
       ][, setnames(.SD, .SD[1, t(.SD)])                                                                                                                   
       ][2:.N                                                                                                                  
       ][, fread(paste0(capture.output(write.csv(.SD, stdout(), row.names=F, quote=F)), collapse='\n'))                        
       ][, {bTbl <<- copy(.SD); .SD}                                                                                           
       ]  

    Name Age                                                                                                                   
1: Peter  40                                                                                                                   
2:  Paul   5                                                                                                                   
3:  Mary  38  

> lapply(bTbl, class)     

$Name                                                                                                                          
[1] "character"                                                                                                                

$Age                                                                                                                           
[1] "integer"                                                                                                                  

>

Onyambu · Answer 2 · 2018-03-02T19:45:01.200

3

DT=setDT(read.table(text=do.call(paste,transpose(fread(file,h=F))),h=T,stringsAsFactors = F))
DT
    Name Age
1: Peter  40
2:  Paul   5
3:  Mary  38



sapply(DT,class)
       Name         Age 
"character"   "integer"

edited Mar 02 '18 at 19:45

answered Mar 02 '18 at 19:27

Onyambu

67,392
3
24
53

jan-glx · Accepted Answer · 2018-03-12T10:49:49.523

Unfortunately, tfread doesn't seem to exist.

Timing of the proposed solutions for 200 observations of 20000 character/integer variables / 20000 observations of 200 character/integer variables:

readLines-fread (@jan-glx): 7 s / 1.2 s
fread-transpose-paste-read.table (@Onymambu): 8 s / 36 s
all-within-j: fread-transpose-write.csv-paste-fread (@Clayton Stanley) : 5 min / 12 s
command line transpose-fread (@jan-glx): 2.4 s / 1.6 s
iotools-paste-paste-fread (@jan-glx): 1.4 s / 1.2 s
fread-transpose-type.convert (@Frank ): 4.2 s / 3.6 s

Code:

library(data.table)    
file <-  tempfile("tmp.txt")
p <- 100 # = 200 lines/columns
n <- 10000 # = 20000 values per line / rows
writeLines(rep(c(paste("Name",paste0(rep(c("Peter","Paul"), n), collapse = ","), sep=","),
                 paste("Age",paste0(rep(c("40","5"), n), collapse = ","), sep=",")
               ), p), file(file,"wb"))    

system.time({ # 1
lines <- readLines(file)
lines <- lapply(lines, function(x) gsub(pattern=",", replacement="\n", x, fixed=TRUE))
lines <- lapply(lines, fread)
dt <- do.call(cbind,lines)
dim(dt)
})

system.time({ # 2
DT=setDT(read.table(text=do.call(paste,transpose(fread(file,h=F))),h=T,stringsAsFactors = F))
dim(DT)
})

system.time({ # 3
aTbl = fread(file, colClasses="character", header=F)
invisible(
  aTbl[, .SD
       ][, transpose(.SD)
         ][, setnames(.SD, .SD[1, t(.SD)])                                                                                                                   
           ][2:.N                                                                                                                  
             ][, fread(paste0(capture.output(write.csv(.SD, stdout(), row.names=F, quote=F)), collapse='\n'))                        
               ][, {bTbl <<- copy(.SD); .SD}                                                                                           
                 ]  
)
dim(bTbl)
})

system.time({ # 4 wide
  dt <- fread(paste0("transpose -t -l 20005x205 --fsep , \"", file, "\""))
  dim(dt)
})

system.time({ # 4 long
dt <- fread(paste0("transpose -t -l 205x20005 --fsep , \"", file, "\""))
dim(dt)
})

system.time({ # 5
  infile <- file(file, "rb")
  df <- iotools::chunk.tapply(infile, function(x) {
    fread(paste0(apply(iotools::mstrsplit(x, sep=","), 2, paste0, collapse = ","), collapse = "\n"))
  }, CH.MERGE = cbind)
  dim(df)
})

system.time({ # 6
d <-  fread(file, header=FALSE); 
d <- d[, lapply(transpose(.SD[,-1]), type.convert)][, setnames(.SD, d[[1]])]
dim(d)
})

score 2 · Answer 4 · answered Mar 04 '18 at 22:30

To extend @lmo 's comment to use iotools:

  dt <- iotools::chunk.tapply(file(file, "rb"),  function(x) {
    fread(paste0(apply(iotools::mstrsplit(x, sep=","), 2, paste0, collapse = ","), collapse = "\n"))
  }, CH.MERGE = cbind)

This is a bit more efficient than the other current solutions.

score 1 · Answer 5 · answered Mar 04 '18 at 21:39

To extend @ngm 's comment to use command-line tools:

download: transpose.c
compile: gcc transpose.c -o transpose
use: fread(paste0("transpose -t -l 205x20005 --fsep , \"", file, "\""))

This requires knowledge of the rough size of the table and reqires single character line endings, i.e, \n. It is quite inconvenient, but faster than the other solutions.

Does `tfread` exist?

5 Answers5