7

In R is there an efficient way to read a transposed .csv file?

For example consider the following text file:

Name,Peter,Paul,Marry
Age,40,9,38

This could be read into a data.table with useful column classes using:

library(data.table)    
file <-  tempfile("tmp.txt")
writeLines("Name,Peter,Paul,Mary\nAge,40,5,38\n", file)    

lines <- readLines(file)
lines <- lapply(lines, function(x) gsub(pattern=",", replacement="\n", x, fixed=TRUE))
lines <- lapply(lines[-3], fread)
do.call(cbind,lines)
#>     Name Age
#> 1: Peter  40
#> 2:  Paul   5
#> 3:  Mary  38

Is there a simpler way to achieve this? Is there a more efficient version (my file is 1 GB)?

Note, that such column-major storage should be easier to read for a column-wise storage as in a data.table.

jan-glx
  • 7,611
  • 2
  • 43
  • 63
  • Have you tried `t(fread(your_file))`? – Gregor Thomas Mar 02 '18 at 18:42
  • @DirkEddelbuettel is it? I know fread (I am even using it here). The problem is that "rows are columns", i.e. fread would read everything as character. – jan-glx Mar 02 '18 at 18:43
  • 2
    Send the data back and ask them to create it correctly. :) – Rich Scriven Mar 02 '18 at 18:44
  • 2
    You may need "something" for the transpose, and mayne reading it all via `fread()`, transposing, then writing out again and re-reading is still best ... because otherwise you will _not_ get the per-final-column type info hidden now in each row. – Dirk Eddelbuettel Mar 02 '18 at 18:44
  • @Gregor yes, `> t(fread(file)) [,1] Name "Age" Peter "40" Paul "5" Mary "38" `, i.e. the Age column is `character` instead of `integer`. – jan-glx Mar 02 '18 at 18:48
  • Because type is guestimated _by column_ -- see my previous comment. – Dirk Eddelbuettel Mar 02 '18 at 18:51
  • I'd suggest you look at `chunk.map` from the `iotools` package. You should be able to read in multiple rows from the file (a chunk), perform the splitting on this chunk, transpose, and write out each chunk to a file or file connection. It is designed to be super fast and the chunking should reduce overall memory use for this part of the process. In the second step, you'd have some number of files that you'd have to read in separately and `cbind` or similar. – lmo Mar 02 '18 at 18:53
  • 2
    Perhaps use something that excels in text manipulation rather than using something in `R`. Searching "transpose a file" gives a number of useful answers, such as https://stackoverflow.com/questions/1729824/an-efficient-way-to-transpose-a-file-in-bash in which `awk` and `perl` solutions are given. – ngm Mar 02 '18 at 19:02
  • 1
    @jan-glx If you care about the column class, use `type.convert`. This variant of Gregor's code works for the example: `d = fread(x, header=FALSE); d[, lapply(transpose(.SD[,-1]), type.convert)][, setnames(.SD, d[[1]])]`, where `x = "Name,Peter,Paul,Marry\nAge,40,9,38"`. – Frank Mar 02 '18 at 19:07
  • 1
    A workhorse function of `iotools` is `mstrsplit` which splits a character vector or raw vector and returns a matrix. In your example, you'd do `t(mstrsplit(charToRaw("Name,Peter,Paul,Mary\nAge,40,5,38\n"), sep=","))` to return a character matrix which could be written to file and then re-read using `fread` or similar. – lmo Mar 02 '18 at 19:21
  • 2
    Potentially relevant: https://stackoverflow.com/questions/17288197/reading-a-csv-file-organized-horizontally – Artem Sokolov Mar 02 '18 at 21:16

5 Answers5

3

This is an implementation of @Dirk Eddelbuettel's suggested approach in the comments.

> library(data.table)                                                                                                          
> aTbl = fread("file.csv", colClasses="character", header=F)
> aTbl

     V1    V2   V3   V4
1: Name Peter Paul Mary
2:  Age    40    5   38     

> aTbl[, .SD
       ][, transpose(.SD)
       ][, setnames(.SD, .SD[1, t(.SD)])                                                                                                                   
       ][2:.N                                                                                                                  
       ][, fread(paste0(capture.output(write.csv(.SD, stdout(), row.names=F, quote=F)), collapse='\n'))                        
       ][, {bTbl <<- copy(.SD); .SD}                                                                                           
       ]  

    Name Age                                                                                                                   
1: Peter  40                                                                                                                   
2:  Paul   5                                                                                                                   
3:  Mary  38  

> lapply(bTbl, class)     

$Name                                                                                                                          
[1] "character"                                                                                                                

$Age                                                                                                                           
[1] "integer"                                                                                                                  

> 
Clayton Stanley
  • 7,513
  • 9
  • 32
  • 46
3
DT=setDT(read.table(text=do.call(paste,transpose(fread(file,h=F))),h=T,stringsAsFactors = F))
DT
    Name Age
1: Peter  40
2:  Paul   5
3:  Mary  38



sapply(DT,class)
       Name         Age 
"character"   "integer" 
Onyambu
  • 67,392
  • 3
  • 24
  • 53
2

Unfortunately, tfread doesn't seem to exist.

Timing of the proposed solutions for 200 observations of 20000 character/integer variables / 20000 observations of 200 character/integer variables:

  1. readLines-fread (@jan-glx): 7 s / 1.2 s
  2. fread-transpose-paste-read.table (@Onymambu): 8 s / 36 s
  3. all-within-j: fread-transpose-write.csv-paste-fread (@Clayton Stanley) : 5 min / 12 s
  4. command line transpose-fread (@jan-glx): 2.4 s / 1.6 s
  5. iotools-paste-paste-fread (@jan-glx): 1.4 s / 1.2 s
  6. fread-transpose-type.convert (@Frank ): 4.2 s / 3.6 s

Code:

library(data.table)    
file <-  tempfile("tmp.txt")
p <- 100 # = 200 lines/columns
n <- 10000 # = 20000 values per line / rows
writeLines(rep(c(paste("Name",paste0(rep(c("Peter","Paul"), n), collapse = ","), sep=","),
                 paste("Age",paste0(rep(c("40","5"), n), collapse = ","), sep=",")
               ), p), file(file,"wb"))    

system.time({ # 1
lines <- readLines(file)
lines <- lapply(lines, function(x) gsub(pattern=",", replacement="\n", x, fixed=TRUE))
lines <- lapply(lines, fread)
dt <- do.call(cbind,lines)
dim(dt)
})

system.time({ # 2
DT=setDT(read.table(text=do.call(paste,transpose(fread(file,h=F))),h=T,stringsAsFactors = F))
dim(DT)
})

system.time({ # 3
aTbl = fread(file, colClasses="character", header=F)
invisible(
  aTbl[, .SD
       ][, transpose(.SD)
         ][, setnames(.SD, .SD[1, t(.SD)])                                                                                                                   
           ][2:.N                                                                                                                  
             ][, fread(paste0(capture.output(write.csv(.SD, stdout(), row.names=F, quote=F)), collapse='\n'))                        
               ][, {bTbl <<- copy(.SD); .SD}                                                                                           
                 ]  
)
dim(bTbl)
})

system.time({ # 4 wide
  dt <- fread(paste0("transpose -t -l 20005x205 --fsep , \"", file, "\""))
  dim(dt)
})

system.time({ # 4 long
dt <- fread(paste0("transpose -t -l 205x20005 --fsep , \"", file, "\""))
dim(dt)
})

system.time({ # 5
  infile <- file(file, "rb")
  df <- iotools::chunk.tapply(infile, function(x) {
    fread(paste0(apply(iotools::mstrsplit(x, sep=","), 2, paste0, collapse = ","), collapse = "\n"))
  }, CH.MERGE = cbind)
  dim(df)
})

system.time({ # 6
d <-  fread(file, header=FALSE); 
d <- d[, lapply(transpose(.SD[,-1]), type.convert)][, setnames(.SD, d[[1]])]
dim(d)
})
jan-glx
  • 7,611
  • 2
  • 43
  • 63
2

To extend @lmo 's comment to use iotools:

  dt <- iotools::chunk.tapply(file(file, "rb"),  function(x) {
    fread(paste0(apply(iotools::mstrsplit(x, sep=","), 2, paste0, collapse = ","), collapse = "\n"))
  }, CH.MERGE = cbind)

This is a bit more efficient than the other current solutions.

jan-glx
  • 7,611
  • 2
  • 43
  • 63
1

To extend @ngm 's comment to use command-line tools:

  1. download: transpose.c
  2. compile: gcc transpose.c -o transpose
  3. use: fread(paste0("transpose -t -l 205x20005 --fsep , \"", file, "\""))

This requires knowledge of the rough size of the table and reqires single character line endings, i.e, \n. It is quite inconvenient, but faster than the other solutions.

jan-glx
  • 7,611
  • 2
  • 43
  • 63