Read a text file in R line by line

Question

I would like to read a text file in R, line by line, using a for loop and with the length of the file. The problem is that it only prints character(0). This is the code:

fileName="up_down.txt"
con=file(fileName,open="r")
line=readLines(con) 
long=length(line)
for (i in 1:long){
    linn=readLines(con,1)
    print(linn)
}
close(con)

The problem is that you read the entire file in (`line=readLines(con)`) and then you continue reading the file inside the loop; at the point, there is nothing left to read. — Brian Diggs, Sep 27 '12 at 18:31
If you are looking for a way to load only one line at a time from a (maybe large) file, than the [currently accepted answer](http://stackoverflow.com/a/12627356/1067114) is not solving your problem. If, instead, you just want to process the content of a file line by line, regardless of how you load it, maybe the question should be better formulated. — Francesco Napolitano, Mar 01 '17 at 13:19

score 160 · Answer 1 · answered Mar 03 '16 at 01:02

160

You should take care with readLines(...) and big files. Reading all lines at memory can be risky. Below is a example of how to read file and process just one line at time:

processFile = function(filepath) {
  con = file(filepath, "r")
  while ( TRUE ) {
    line = readLines(con, n = 1)
    if ( length(line) == 0 ) {
      break
    }
    print(line)
  }

  close(con)
}

Understand the risk of reading a line at memory too. Big files without line breaks can fill your memory too.

answered Mar 03 '16 at 01:02

dvd

1,750
2
10
9

13

This should really be the accepted answer, as the others will run into issues with large files. – theduke Sep 07 '16 at 08:45
4

This is suggested to be a right way to parse large file line by line. Other answers read in all lines into the memory, and then loop that object in the memory, which is absolutely different from this. – Nan Zhou Mar 14 '17 at 09:09
6

readLines documentation: **"If the connection is open it is read from its current position."** It's what makes the loop work. – San Nov 24 '18 at 14:45
1

If the file contains empty lines, does it *break* this script? Otherwise excellent solution, kudos! (The others below are mostly worthless, as they read the entire file into memory.) – jena Apr 22 '21 at 14:23
@jena No, empty lines do not break the script. I think that it is because when readLines() reads an empty line, it still returns a character vector, if empty. Hence, the variable line will have length == 1. – jorvaor Jun 02 '21 at 20:00
@theduke With memory growing every year, there are more and more files for which [the originally-accepted answer](https://stackoverflow.com/a/12627356/1048186) makes sense. – Josiah Yoder Jun 10 '21 at 18:19

Dirk Eddelbuettel · Answer 2 · 2021-06-10T18:54:31.183

Just use readLines on your file:

R> res <- readLines(system.file("DESCRIPTION", package="MASS"))
R> length(res)
[1] 27
R> res
 [1] "Package: MASS"                                                                  
 [2] "Priority: recommended"                                                          
 [3] "Version: 7.3-18"                                                                
 [4] "Date: 2012-05-28"                                                               
 [5] "Revision: $Rev: 3167 $"                                                         
 [6] "Depends: R (>= 2.14.0), grDevices, graphics, stats, utils"                      
 [7] "Suggests: lattice, nlme, nnet, survival"                                        
 [8] "Authors@R: c(person(\"Brian\", \"Ripley\", role = c(\"aut\", \"cre\", \"cph\"),"
 [9] "        email = \"ripley@stats.ox.ac.uk\"), person(\"Kurt\", \"Hornik\", role"  
[10] "        = \"trl\", comment = \"partial port ca 1998\"), person(\"Albrecht\","   
[11] "        \"Gebhardt\", role = \"trl\", comment = \"partial port ca 1998\"),"     
[12] "        person(\"David\", \"Firth\", role = \"ctb\"))"                          
[13] "Description: Functions and datasets to support Venables and Ripley,"            
[14] "        'Modern Applied Statistics with S' (4th edition, 2002)."                
[15] "Title: Support Functions and Datasets for Venables and Ripley's MASS"           
[16] "License: GPL-2 | GPL-3"                                                         
[17] "URL: http://www.stats.ox.ac.uk/pub/MASS4/"                                      
[18] "LazyData: yes"                                                                  
[19] "Packaged: 2012-05-28 08:47:38 UTC; ripley"                                      
[20] "Author: Brian Ripley [aut, cre, cph], Kurt Hornik [trl] (partial port"          
[21] "        ca 1998), Albrecht Gebhardt [trl] (partial port ca 1998), David"        
[22] "        Firth [ctb]"                                                            
[23] "Maintainer: Brian Ripley <ripley@stats.ox.ac.uk>"                               
[24] "Repository: CRAN"                                                               
[25] "Date/Publication: 2012-05-28 08:53:03"                                          
[26] "Built: R 2.15.1; x86_64-pc-mingw32; 2012-06-22 14:16:09 UTC; windows"           
[27] "Archs: i386, x64"                                                               
R>

There is an entire manual devoted to this.

I am using readLines, but I just dont get why I get that error — Layla, Sep 27 '12 at 17:20
When you say there is a whole manual devoted to it, you should also tell us which manual it is. — U. Windl, Jan 23 '18 at 15:26
@U.Windl I think he means the manual entry you get by typing `?readLines` at the prompt. That is, [this manual page](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readLines) — Josiah Yoder, Jun 10 '21 at 18:16
@U.Windl Looking closer, there was a broken link hiding within Dirk's answer. I've tried to restore it, but the link is also dead. — Josiah Yoder, Jun 10 '21 at 18:23
@JosiahYoder Your link was also wrong. Capitalisation matters. — Dirk Eddelbuettel, Jun 10 '21 at 18:55

score 45 · Accepted Answer · edited Oct 19 '15 at 21:17

45

Here is the solution with a for loop. Importantly, it takes the one call to readLines out of the for loop so that it is not improperly called again and again. Here it is:

fileName <- "up_down.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
for (i in 1:length(linn)){
   print(linn[i])
}
close(conn)

edited Oct 19 '15 at 21:17

Shawn Mehan

4,513
9
31
51

answered Sep 27 '12 at 17:56

Layla

5,234
15
51
66

3

You don't need the for loop at all since you're printing the entire vector. Just `print(linn)` suffices. – Assad Ebrahim Apr 07 '14 at 04:41
2

Very good answer. In R "<-" is normally used in convention instead of "=" – Ryan Aug 01 '14 at 19:06
10

well what happens if you have a 30 gig file? – Chris Oct 19 '15 at 23:13
1

@Chris you use the only correct answer by dvd ;) – jena Jun 03 '21 at 13:23

score 4 · Answer 4 · answered Feb 09 '17 at 06:12

I write a code to read file line by line to meet my demand which different line have different data type follow articles: read-line-by-line-of-a-file-in-r and determining-number-of-linesrecords. And it should be a better solution for big file, I think. My R version (3.3.2).

con = file("pathtotargetfile", "r")
readsizeof<-2    # read size for one step to caculate number of lines in file
nooflines<-0     # number of lines
while((linesread<-length(readLines(con,readsizeof)))>0)    # calculate number of lines. Also a better solution for big file
  nooflines<-nooflines+linesread

con = file("pathtotargetfile", "r")    # open file again to variable con, since the cursor have went to the end of the file after caculating number of lines
typelist = list(0,'c',0,'c',0,0,'c',0)    # a list to specific the lines data type, which means the first line has same type with 0 (e.g. numeric)and second line has same type with 'c' (e.g. character). This meet my demand.
for(i in 1:nooflines) {
  tmp <- scan(file=con, nlines=1, what=typelist[[i]], quiet=TRUE)
  print(is.vector(tmp))
  print(tmp)
}
close(con)

score 1 · Answer 5 · answered Nov 01 '18 at 22:37

1

I suggest you check out chunked and disk.frame. They both have functions for reading in CSVs chunk-by-chunk.

In particular, disk.frame::csv_to_disk.frame may be the function you are after?

answered Nov 01 '18 at 22:37

xiaodai

14,889
18
76
140

Also checkout [LaF](https://cran.r-project.org/web/packages/LaF/index.html) package. [chunked](https://cran.rstudio.com/web/packages/chunked/index.html) is actually a wrapper for LaF which makes things easier sometimes. – San Nov 24 '18 at 14:55
`disk.frame` looks great and it includes support for two of my favorite packages - `data.table` and `fst` which are among the most efficient of their kind. Can you kindly point out further documentation/examples of `disk.frame` other than that available in the github page. – San Nov 24 '18 at 16:24
@san i am writing them at the moment. You can check out the vignette folder or go into inst/fannie_mae for more examples – xiaodai Nov 24 '18 at 20:46
For storing larger than RAM data sets, `disk.frame` can be an alternative to `MonetDbLite`. I hope it makes to CRAN early. – San Nov 25 '18 at 07:33
there's more doc https://github.com/xiaodaigh/disk.frame and a vignette now @San https://rpubs.com/xiaodai/intro-disk-frame – xiaodai Feb 01 '19 at 23:32

Phillip Long · Answer 6 · 2021-06-24T21:11:41.197

fileName = "up_down.txt"

### code to get the line count of the file
length_connection = pipe(paste("cat ", fileName, " | wc -l", sep = "")) # "cat fileName | wc -l" because that returns just the line count, and NOT the name of the file with it
long = as.numeric(trimws(readLines(con = length_connection, n = 1)))
close(length_connection) # make sure to close the connection
###

for (i in 1:long){

    ### code to extract a single line at row i from the file
    linn_connection_cmd = paste("head -n", format(x = i, scientific = FALSE, big.mark = ""), fileName, "| tail -n 1", sep = " ") # extracts one line from fileName at the desired line number (i)
    linn_connection = pipe(linn_connection_cmd)
    linn = readLines(con = linn_connection, n = 1)
    close(linn_connection) # make sure to close the conection
    ###
    
    # the line is now loaded into R and anything can be done with it
    print(linn)
}
close(con)

By using R's pipe() command, and using shell commands to extract what we want, the full file is never loaded into R, and is read in line by line.

paste("head -n", format(x = i, scientific = FALSE, big.mark = ""), fileName, "| tail -n 1", sep = " ")

It is this command that does all the work; it extracts one line from the desired file.

Edit: R's default behavior is for i to return as normal number when less than 100,000, but begins returning i in scientific notation when it is greater than or equal to 100,000 (1e+05). Thus, format(x = i, scientific = FALSE, big.mark = "") is used in our pipe command to make sure that the pipe() command always receives a number in normal form, which is all that the command can understand. If the pipe() command is given any number like 1e+05, it will not be able to comprehend it and will result in the following error:

head: 1e+05: invalid number of lines

Read a text file in R line by line

6 Answers6

Linked