78

I would like to read a text file in R, line by line, using a for loop and with the length of the file. The problem is that it only prints character(0). This is the code:

fileName="up_down.txt"
con=file(fileName,open="r")
line=readLines(con) 
long=length(line)
for (i in 1:long){
    linn=readLines(con,1)
    print(linn)
}
close(con)
zx8754
  • 52,746
  • 12
  • 114
  • 209
Layla
  • 5,234
  • 15
  • 51
  • 66
  • 12
    The problem is that you read the entire file in (`line=readLines(con)`) and then you continue reading the file inside the loop; at the point, there is nothing left to read. – Brian Diggs Sep 27 '12 at 18:31
  • 1
    If you are looking for a way to load only one line at a time from a (maybe large) file, than the [currently accepted answer](http://stackoverflow.com/a/12627356/1067114) is not solving your problem. If, instead, you just want to process the content of a file line by line, regardless of how you load it, maybe the question should be better formulated. – Francesco Napolitano Mar 01 '17 at 13:19

6 Answers6

160

You should take care with readLines(...) and big files. Reading all lines at memory can be risky. Below is a example of how to read file and process just one line at time:

processFile = function(filepath) {
  con = file(filepath, "r")
  while ( TRUE ) {
    line = readLines(con, n = 1)
    if ( length(line) == 0 ) {
      break
    }
    print(line)
  }

  close(con)
}

Understand the risk of reading a line at memory too. Big files without line breaks can fill your memory too.

dvd
  • 1,750
  • 2
  • 10
  • 9
  • 13
    This should really be the accepted answer, as the others will run into issues with large files. – theduke Sep 07 '16 at 08:45
  • 4
    This is suggested to be a right way to parse large file line by line. Other answers read in all lines into the memory, and then loop that object in the memory, which is absolutely different from this. – Nan Zhou Mar 14 '17 at 09:09
  • 6
    readLines documentation: **"If the connection is open it is read from its current position."** It's what makes the loop work. – San Nov 24 '18 at 14:45
  • 1
    If the file contains empty lines, does it *break* this script? Otherwise excellent solution, kudos! (The others below are mostly worthless, as they read the entire file into memory.) – jena Apr 22 '21 at 14:23
  • @jena No, empty lines do not break the script. I think that it is because when readLines() reads an empty line, it still returns a character vector, if empty. Hence, the variable line will have length == 1. – jorvaor Jun 02 '21 at 20:00
  • @theduke With memory growing every year, there are more and more files for which [the originally-accepted answer](https://stackoverflow.com/a/12627356/1048186) makes sense. – Josiah Yoder Jun 10 '21 at 18:19
51

Just use readLines on your file:

R> res <- readLines(system.file("DESCRIPTION", package="MASS"))
R> length(res)
[1] 27
R> res
 [1] "Package: MASS"                                                                  
 [2] "Priority: recommended"                                                          
 [3] "Version: 7.3-18"                                                                
 [4] "Date: 2012-05-28"                                                               
 [5] "Revision: $Rev: 3167 $"                                                         
 [6] "Depends: R (>= 2.14.0), grDevices, graphics, stats, utils"                      
 [7] "Suggests: lattice, nlme, nnet, survival"                                        
 [8] "Authors@R: c(person(\"Brian\", \"Ripley\", role = c(\"aut\", \"cre\", \"cph\"),"
 [9] "        email = \"ripley@stats.ox.ac.uk\"), person(\"Kurt\", \"Hornik\", role"  
[10] "        = \"trl\", comment = \"partial port ca 1998\"), person(\"Albrecht\","   
[11] "        \"Gebhardt\", role = \"trl\", comment = \"partial port ca 1998\"),"     
[12] "        person(\"David\", \"Firth\", role = \"ctb\"))"                          
[13] "Description: Functions and datasets to support Venables and Ripley,"            
[14] "        'Modern Applied Statistics with S' (4th edition, 2002)."                
[15] "Title: Support Functions and Datasets for Venables and Ripley's MASS"           
[16] "License: GPL-2 | GPL-3"                                                         
[17] "URL: http://www.stats.ox.ac.uk/pub/MASS4/"                                      
[18] "LazyData: yes"                                                                  
[19] "Packaged: 2012-05-28 08:47:38 UTC; ripley"                                      
[20] "Author: Brian Ripley [aut, cre, cph], Kurt Hornik [trl] (partial port"          
[21] "        ca 1998), Albrecht Gebhardt [trl] (partial port ca 1998), David"        
[22] "        Firth [ctb]"                                                            
[23] "Maintainer: Brian Ripley <ripley@stats.ox.ac.uk>"                               
[24] "Repository: CRAN"                                                               
[25] "Date/Publication: 2012-05-28 08:53:03"                                          
[26] "Built: R 2.15.1; x86_64-pc-mingw32; 2012-06-22 14:16:09 UTC; windows"           
[27] "Archs: i386, x64"                                                               
R> 

There is an entire manual devoted to this.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
45

Here is the solution with a for loop. Importantly, it takes the one call to readLines out of the for loop so that it is not improperly called again and again. Here it is:

fileName <- "up_down.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
for (i in 1:length(linn)){
   print(linn[i])
}
close(conn)
Shawn Mehan
  • 4,513
  • 9
  • 31
  • 51
Layla
  • 5,234
  • 15
  • 51
  • 66
4

I write a code to read file line by line to meet my demand which different line have different data type follow articles: read-line-by-line-of-a-file-in-r and determining-number-of-linesrecords. And it should be a better solution for big file, I think. My R version (3.3.2).

con = file("pathtotargetfile", "r")
readsizeof<-2    # read size for one step to caculate number of lines in file
nooflines<-0     # number of lines
while((linesread<-length(readLines(con,readsizeof)))>0)    # calculate number of lines. Also a better solution for big file
  nooflines<-nooflines+linesread

con = file("pathtotargetfile", "r")    # open file again to variable con, since the cursor have went to the end of the file after caculating number of lines
typelist = list(0,'c',0,'c',0,0,'c',0)    # a list to specific the lines data type, which means the first line has same type with 0 (e.g. numeric)and second line has same type with 'c' (e.g. character). This meet my demand.
for(i in 1:nooflines) {
  tmp <- scan(file=con, nlines=1, what=typelist[[i]], quiet=TRUE)
  print(is.vector(tmp))
  print(tmp)
}
close(con)
Nick Dong
  • 3,638
  • 8
  • 47
  • 84
1

I suggest you check out chunked and disk.frame. They both have functions for reading in CSVs chunk-by-chunk.

In particular, disk.frame::csv_to_disk.frame may be the function you are after?

xiaodai
  • 14,889
  • 18
  • 76
  • 140
  • Also checkout [LaF](https://cran.r-project.org/web/packages/LaF/index.html) package. [chunked](https://cran.rstudio.com/web/packages/chunked/index.html) is actually a wrapper for LaF which makes things easier sometimes. – San Nov 24 '18 at 14:55
  • `disk.frame` looks great and it includes support for two of my favorite packages - `data.table` and `fst` which are among the most efficient of their kind. Can you kindly point out further documentation/examples of `disk.frame` other than that available in the github page. – San Nov 24 '18 at 16:24
  • @san i am writing them at the moment. You can check out the vignette folder or go into inst/fannie_mae for more examples – xiaodai Nov 24 '18 at 20:46
  • For storing larger than RAM data sets, `disk.frame` can be an alternative to `MonetDbLite`. I hope it makes to CRAN early. – San Nov 25 '18 at 07:33
  • there's more doc https://github.com/xiaodaigh/disk.frame and a vignette now @San https://rpubs.com/xiaodai/intro-disk-frame – xiaodai Feb 01 '19 at 23:32
0
fileName = "up_down.txt"

### code to get the line count of the file
length_connection = pipe(paste("cat ", fileName, " | wc -l", sep = "")) # "cat fileName | wc -l" because that returns just the line count, and NOT the name of the file with it
long = as.numeric(trimws(readLines(con = length_connection, n = 1)))
close(length_connection) # make sure to close the connection
###

for (i in 1:long){

    ### code to extract a single line at row i from the file
    linn_connection_cmd = paste("head -n", format(x = i, scientific = FALSE, big.mark = ""), fileName, "| tail -n 1", sep = " ") # extracts one line from fileName at the desired line number (i)
    linn_connection = pipe(linn_connection_cmd)
    linn = readLines(con = linn_connection, n = 1)
    close(linn_connection) # make sure to close the conection
    ###
    
    # the line is now loaded into R and anything can be done with it
    print(linn)
}
close(con)

By using R's pipe() command, and using shell commands to extract what we want, the full file is never loaded into R, and is read in line by line.

paste("head -n", format(x = i, scientific = FALSE, big.mark = ""), fileName, "| tail -n 1", sep = " ")

It is this command that does all the work; it extracts one line from the desired file.

Edit: R's default behavior is for i to return as normal number when less than 100,000, but begins returning i in scientific notation when it is greater than or equal to 100,000 (1e+05). Thus, format(x = i, scientific = FALSE, big.mark = "") is used in our pipe command to make sure that the pipe() command always receives a number in normal form, which is all that the command can understand. If the pipe() command is given any number like 1e+05, it will not be able to comprehend it and will result in the following error:

head: 1e+05: invalid number of lines