21

I'd like to read only the first character from each line of a text file, ignoring the rest.

Here's an example file:

x <- c(
  "Afklgjsdf;bosfu09[45y94hn9igf",
  "Basfgsdbsfgn",
  "Cajvw58723895yubjsdw409t809t80",
  "Djakfl09w50968509",
  "E3434t"
)
writeLines(x, "test.txt")

I can solve the problem by reading everything with readLines and using substring to get the first character:

lines <- readLines("test.txt")
substring(lines, 1, 1)
## [1] "A" "B" "C" "D" "E"

This seems inefficient though. Is there a way to persuade R to only read the first characters, rather than having to discard them?

I suspect that there ought to be some incantation using scan, but I can't find it. An alternative might be low level file manipulation (maybe with seek).


Since performance is only relevant for larger files, here's a bigger test file for benchmarking with:

set.seed(2015)
nch <- sample(1:100, 1e4, replace = TRUE)    
x2 <- vapply(
  nch, 
  function(nch)
  {
    paste0(
      sample(letters, nch, replace = TRUE), 
      collapse = ""
    )    
  },
  character(1)
)
writeLines(x2, "bigtest.txt")

Update: It seems that you can't avoid scanning the whole file. The best speed gains seem to be using a faster alternative to readLines (Richard Scriven's stringi::stri_read_lines solution and Josh O'Brien's data.table::fread solution), or to treat the file as binary (Martin Morgan's readBin solution).

Community
  • 1
  • 1
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • 2
    FYI, at the level of primitive I/O operations (on both Unix and Windows) there is no way to avoid scanning every character of the file. The operating system doesn't track where lines end. All performance variation in the various answers is probably down to how many copies of whole-line strings are being constructed and discarded. – zwol Jan 03 '15 at 18:13
  • Richie, please refer to my post on an observation using `readBin()`, especially if reading really large files. In the example shown, the step `idx = which(x == what)` would require ~7.5GB (allocating a logical vector of length(x) = 2 billion). – Arun Jan 03 '15 at 19:36

6 Answers6

21

If you allow/have access to Unix command-line tools you can use

scan(pipe("cut -c 1 test.txt"), what="", quiet=TRUE) 

Obviously less portable but probably very fast.

Using @RichieCotton's benchmarking code with the OP's suggested "bigtest.txt" file:

           expr         min          lq        mean      median          uq
     RC readLines   14.797830   17.083849   19.261917   18.103020   20.007341
      RS read.fwf  125.113935  133.259220  148.122596  138.024203  150.528754
 BB scan pipe cut    6.277267    7.027964    7.686314    7.337207    8.004137
      RC readChar 1163.126377 1219.982117 1324.576432 1278.417578 1368.321464
          RS scan   13.927765   14.752597   16.634288   15.274470   16.992124
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • 2
    This works under Windows with Rtools installed. And it looks like there's a Windows alternative to `cut` for those that don't. http://stackoverflow.com/q/25066360/134830 Though under Windows the performance is weaker than I expected. – Richie Cotton Jan 02 '15 at 20:01
  • 1
    I'm picking Richard S's answer as the solution since it seems that `cut` doesn't perform quite as well under Windows (perhaps inevitably), whereas `scan` works well under all platforms. This is a great answer too though. Thanks. – Richie Cotton Jan 02 '15 at 20:43
15

data.table::fread() seems to beat all of the solutions so far proposed, and has the great virtue of running comparably fast on both Windows and *NIX machines:

library(data.table)
substring(fread("bigtest.txt", sep="\n", header=FALSE)[[1]], 1, 1)

Here are microbenchmark timings on a Linux box (actually a dual-boot laptop, booted up as Ubuntu):

Unit: milliseconds
             expr         min          lq        mean      median          uq        max neval
     RC readLines   15.830318   16.617075   18.294723   17.116666   18.959381   27.54451   100
        JOB fread    5.532777    6.013432    7.225067    6.292191    7.727054   12.79815   100
      RS read.fwf  111.099578  113.803053  118.844635  116.501270  123.987873  141.14975   100
 BB scan pipe cut    6.583634    8.290366    9.925221   10.115399   11.013237   15.63060   100
      RC readChar 1347.017408 1407.878731 1453.580001 1450.693865 1491.764668 1583.92091   100

And here are timings from the same laptop booted up as a Windows machine (with the command-line tool cut supplied by Rtools):

Unit: milliseconds
             expr         min          lq       mean      median          uq        max neval   cld
     RC readLines   26.653266   27.493167   33.13860   28.057552   33.208309   61.72567   100  b 
        JOB fread    4.964205    5.343063    6.71591    5.538246    6.027024   13.54647   100 a  
      RS read.fwf  213.951792  217.749833  229.31050  220.793649  237.400166  287.03953   100   c 
 BB scan pipe cut  180.963117  263.469528  278.04720  276.138088  280.227259  387.87889   100    d 
      RC readChar 1505.263964 1572.132785 1646.88564 1622.410703 1688.809031 2149.10773   100     e
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • Just so I understand this: the `\b` is a regex word boundary? Is that to guarantee that the file will be rad as a single column data table? – Richie Cotton Jan 02 '15 at 20:49
  • @RichieCotton -- No, it's actually the backspace character, [as documented here](http://cran.r-project.org/doc/manuals/r-release/R-lang.html#Literal-constants). I was just needing **some** character that's unlikely/guaranteed not to be found in any text file (so that each line ends up getting read into a single column). `"\a"`, the "bell" character might've actually been a better choice... – Josh O'Brien Jan 02 '15 at 20:59
  • I'm having trouble with `\a` and `\b` as separators. Should `\n` work? – Ben Bolker Jan 02 '15 at 21:04
  • On Linux, with `\n` I get a mean of 8.3 ms for scan/pipe/cut and 6.7 ms for fread. – Ben Bolker Jan 02 '15 at 21:08
  • @BenBolker -- Yes, thanks, `\n` is much better -- and in fact required (at least) by the current **data.table** package run under *NIX. Post is now fixed to include your suggested edit. – Josh O'Brien Jan 02 '15 at 21:18
13

Figure out the file size, read it in as a single binary blob, find the offsets of the characters of interest (don't count the last '\n', at the end of the file!), and coerce to final form

f0 <- function() {
    sz <- file.info("bigtest.txt")$size
    what <- charToRaw("\n")
    x = readBin("bigtest.txt", raw(), sz)
    idx = which(x == what)
    rawToChar(x[c(1L,  idx[-length(idx)] + 1L)], multiple=TRUE)
}

The data.table solution (was I think the fastest so far -- need to include the first line as part of the data!)

library(data.table)
f1 <- function()
    substring(fread("bigtest.txt", header=FALSE)[[1]], 1, 1)

and in comparison

> identical(f0(), f1())
[1] TRUE
> library(microbenchmark)
> microbenchmark(f0(), f1())
Unit: milliseconds
 expr      min       lq     mean    median        uq       max neval
 f0() 5.144873 5.515219 5.571327  5.547899  5.623171  5.897335   100
 f1() 9.153364 9.470571 9.994560 10.162012 10.350990 11.047261   100

Still wasteful, since the entire file is read in to memory before mostly being discarded.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
11

01/04/2015 Edited to bring the better solution to the top.


Update 2 Changing the scan() method to run on an open connection instead of opening and closing on every iteration allows to read line-by-line and eliminates the looping. The timing improved quite a bit.

## scan() on open connection 
conn <- file("bigtest.txt", "rt")
substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
close(conn)

I also discovered the stri_read_lines() function in the stringi package, Its help file says it's experimental at the moment, but it is very fast.

## stringi::stri_read_lines()
library(stringi)
stri_sub(stri_read_lines("bigtest.txt"), 1, 1)

Here are the timings for these two methods.

## timings
library(microbenchmark)

microbenchmark(
    scan = {
        conn <- file("bigtest.txt", "rt")
        substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
        close(conn)
    },
    stringi = {
        stri_sub(stri_read_lines("bigtest.txt"), 1, 1)
    }
)
# Unit: milliseconds
#    expr      min       lq     mean   median       uq      max neval
#    scan 50.00170 50.10403 50.55055 50.18245 50.56112 54.64646   100
# stringi 13.67069 13.74270 14.20861 13.77733 13.86348 18.31421   100

Original [slower] answer :

You could try read.fwf() (fixed width file), setting the width to a single 1 to capture the first character on each line.

read.fwf("test.txt", 1, stringsAsFactors = FALSE)[[1L]]
# [1] "A" "B" "C" "D" "E"

Not fully tested of course, but works for the test file and is a nice function for getting substrings without having to read the entire file.


Update 1 : read.fwf() is not very efficient, calling scan() and read.table() internally. We can skip the middle-men and try scan() directly.

lines <- count.fields("test.txt")   ## length is num of lines in file
skip <- seq_along(lines) - 1        ## set up the 'skip' arg for scan()
read <- function(n) {
    ch <- scan("test.txt", what = "", nlines = 1L, skip = n, quiet=TRUE)
    substr(ch, 1, 1)
}
vapply(skip, read, character(1L))
# [1] "A" "B" "C" "D" "E"

version$platform
# [1] "x86_64-pc-linux-gnu"
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • 2
    This is nice, but oddly slow. I think, from a quick glance at the `read.fwf` source, that the function is doing something insane like calling `readLines`, then splitting the content, then writing to a temp file, then reading it in a second time with `read.table`. – Richie Cotton Jan 02 '15 at 19:45
  • Unfortunately, your second solution turns out to be **really** slow when applied to the 10000 line file `"bigtest.txt"`, clocking in at 142187.5 milliseconds in my single microbenchmark test of it. (Also, unlike the other solutions benchmarked in my answer, it doesn't handle lines with spaces in them.) – Josh O'Brien Jan 02 '15 at 22:11
  • @JoshO'Brien - I've improved the `scan` method and added a `stringi` option that's pretty fast – Rich Scriven Jan 02 '15 at 23:53
  • @RichardScriven Does that compare favorably with the OP's `substring(readLines())` solution, or is it about equally fast? – Josh O'Brien Jan 03 '15 at 00:21
  • @JoshO'Brien - With bigtest.txt, `readLines` clocks at 44.4ms on my machine. So stringi is a bit better here. `scan` is about a wash – Rich Scriven Jan 03 '15 at 00:31
  • @RichardScriven -- Thanks. Martin Morgan just added an answer that's fastest-so-far. Interestingly, his uses `readBin()`, which is what `str_read_lines()` is using "under the hood". – Josh O'Brien Jan 03 '15 at 00:34
5

Benchmarks for each answer, under Windows.

library(microbenchmark)
microbenchmark(
  "RC readLines" = {
    lines <- readLines("test.txt")
    substring(lines, 1, 1)
  },
  "RS read.fwf" = read.fwf("test.txt", 1, stringsAsFactors = FALSE)$V1,
  "BB scan pipe cut" = scan(pipe("cut -c 1 test.txt"),what=character()),
  "RC readChar" = {  
    con <- file("test.txt", "r")
    x <- readChar(con, 1)
    while(length(ch <- readChar(con, 1)) > 0)
    {
      if(ch == "\n")
      {
        x <- c(x, readChar(con, 1))
      }
    }
    close(con)
  } 
)

## Unit: microseconds
##              expr        min         lq        mean     median          uq
##      RC readLines    561.598    712.876    830.6969    753.929    884.8865
##       RS read.fwf   5079.010   6429.225   6772.2883   6837.697   7153.3905
##  BB scan pipe cut 308195.548 309941.510 313476.6015 310304.412 310772.0005
##       RC readChar   1238.963   1549.320   1929.4165   1612.952   1740.8300
##         max neval
##    2156.896   100
##    8421.090   100
##  510185.114   100
##   26437.370   100

And on the bigger dataset:

## Unit: milliseconds
##              expr         min          lq       mean      median          uq         max neval
##      RC readLines   52.212563   84.496008   96.48517  103.319789  104.124623  158.086020    20
##       RS read.fwf  391.371514  660.029853  703.51134  766.867222  777.795180  799.670185    20
##  BB scan pipe cut  283.442150  482.062337  516.70913  562.416766  564.680194  567.089973    20
##       RC readChar 2819.343753 4338.041708 4500.98579 4743.174825 4921.148501 5089.594928    20
##           RS scan    2.088749    3.643816    4.16159    4.651449    4.731706    5.375819    20
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • 1
    `cut` is slower on Linux too; means in order given above are (RC=140, RS=1399, BB=5225, RC=615). I suspect the answers would change a whole lot with a larger input file ... – Ben Bolker Jan 02 '15 at 20:09
2

I don't find it very informative to benchmark operations in the order of micro or milliseconds. But I understand that in some cases it can't be avoided. In those cases, still, I find it essential to test data of different (increasing sizes) to get a rough measure of how well the method scales..

Here's my run on @MartinMorgan's tests using f0() and f1() on 1e4, 1e5 and 1e6 rows and here are the results:

1e4

# Unit: milliseconds
#  expr      min       lq     mean   median        uq      max neval
#  f0() 4.226333 7.738857 15.47984 8.398608  8.972871 89.87805   100
#  f1() 8.854873 9.204724 10.48078 9.471424 10.143601 84.33003   100

1e5

# Unit: milliseconds
#  expr      min        lq     mean   median       uq      max neval
#  f0() 71.66205 176.57649 174.9545 184.0191 187.7107 307.0470   100
#  f1() 95.60237  98.82307 104.3605 100.8267 107.9830 205.8728   100

1e6

# Unit: seconds
#  expr      min       lq     mean   median       uq      max neval
#  f0() 1.443471 1.537343 1.561025 1.553624 1.558947 1.729900    10
#  f1() 1.089555 1.092633 1.101437 1.095997 1.102649 1.140505    10

identical(f0(), f1()) returned TRUE on all the tests.

Update:

1e7

I also ran on 1e7 rows.

f1() (data.table) ran in 9.7 seconds, where as f0() ran in 7.8 seconds the first time, and 9.4 and 6.6s the second time.

However, f1() resulted in no noticeable change in memory while reading the entire 0.479GB file, whereas, f0() resulted in a spike of 2.4GB.

Another observation:

set.seed(2015)
x2 <- vapply(
  1:1e5, 
  function(i)
  {
    paste0(
      sample(letters, 100L, replace = TRUE), 
      collapse = "_"
    )    
  },
  character(1)
)
# 10 million rows, with 200 characters each
writeLines(unlist(lapply(1:100, function(x) x2)), "bigtest.txt")

## readBin() results in a 2 billion row vector
system.time(f0()) ## explodes on memory

Because the readBin() step results in a 2 billion length vector (~1.9GB to read the file), and which(x == what) step takes ~4.5+GB (= ~6.5GB in total) at which point I stopped the process.

fread() takes ~23 seconds in this case.

HTH

Arun
  • 116,683
  • 26
  • 284
  • 387
  • is there a reason the 1e5 results are an order of magnitude slower than the 1e6 results? (something else running on your machine at the same time?) (I don't think this would invalidate the relative timings, just the comparisons among sizes.) I have to say I prefer the format of `rbenchmark::benchmark` to `microbenchmark` ... – Ben Bolker Jan 03 '15 at 18:16
  • Reran again and got almost the same timings. Not sure why. I ran each of them in a separate session if that matters. – Arun Jan 03 '15 at 18:38
  • I didn't notice before that 1e6 case has 10 evals rather than 100! (although even multiplying by 10 wouldn't obviously make the numbers come out sensibly ...) – Ben Bolker Jan 03 '15 at 18:48
  • Agree that benchmarks are often irrelevant on this time scale. FWIW on my system f0() / f1() is consistently ~ 5/11, linear increase across the smaller three scales, then 10/11 at the largest. R / data.table were compiled at -O0, which could make a difference. SSD drive, x86_64-unknown-linux-gnu. The time and space bottleneck in the data.table solution is likely copying the data from the table to a character vector. Presumably in both cases the garbage collector is run at least some times, and this will have quadratic effects on time and will be difficult to control for in small #s of runs. – Martin Morgan Jan 03 '15 at 19:16
  • @MartinMorgan, any particular reason for running with no compiler optimisations? I'd think it makes more sense to run benchmarks in realistic situations. Also, I find that `readBin()` can be quite memory consuming depending on the number of characters in the file.. I've edited my post with the observation. Would be nice to know if there's a way around it.. – Arun Jan 03 '15 at 19:24
  • @BenBolker, I think I know the answer to your question now. I created the 1e6L data by replicating 1e5L data 10 times (as shown in the edit, though I create 1e7L rows there). Therefore the number of characters are 10 times that of the 1e5 row data. Since `readBin()`'s speed/memory requirements depend on the number of characters in the file (irrespective of the number of rows), I find it really hard (and tricky) to assess it's performance... – Arun Jan 03 '15 at 19:40
  • Doesn't fread("bigtest.txt", header=FALSE)[[1]] consume at least as much memory (the file size plus 1 + nrow SEXPs?) as reading the entire file via readBin() (file size + 1 SXP? yes, which(...) creates an additional SXP + nrow vector)? I use -O0 for de-bugging, and yes it's not the way most users would experience things (there could be significant platform-specific differences, too; it's as you said hard to arrive at meaningful generalizations from benchmarks. – Martin Morgan Jan 03 '15 at 22:51
  • @MartinMorgan, file size for the example was ~480MB (IIRC). That's not the issue. The problem is that `readBin()` returns a vector of length = the number of characters in the file. With 10 million rows and 200 characters, that amounts to 2 billion characters. And doing `which(x == what)` on that results in allocating a 2 billion logical vector (= ~7.5GB). On my device it took a total of ~6.5GB before I stopped. Just copy/paste the code above and try both `f1()` and `f0()`. – Arun Jan 03 '15 at 23:11