5

I have a large data file that has 40,000+ lines. It's a list of log inputs, and looks a bit like this:

    D 20160602 14:15:43.559 F7982D62 Req Agr:131 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0      
    D 20160602 14:15:43.559 F7982D62 Set Agr:130 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0 I 20160602 14:15:43.559 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 
    M 20160602 14:15:43.595 DOC1: F7982D62 Request for unencrypted meta data on encrypted transaction
    M 20160602 14:15:48.353 DOC1: F7982D62 Transaction has been acknowledged at 722875647 
    F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt" 
    M 20160602 14:15:48.780 DOC1: F7982D63 New download request D 20160602 14:15:48.780 F7982D63 META: 134 Path: /pcgc/public/CTD/exome/fastq/PCGC0033175_HS_EX__1-00304-01__v1_FCBC0RE4ACXX_L3_p32of96_P2.fastq.gz user: xqixh8sl pack: arg: feat: cE,s

Since it's so big, I don't want to read the entire thing into memory. I only need lines that begin with the line identifier "F" and have a (0, 0) error, like this:

    F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz" "" 3322771022 (0,0) "1499.61 seconds (17.7 megabits/sec)"

Everything else I can just ignore. My issue is this: I want a way to read this file line by line and evaluate whether or not it needs to keep the line for importation. Currently, I am using a for loop to run through each line and am using the readLines() function. It looks something like this:

library(stringr)
con <- file("dataSet.txt", open = "r")
Fdata <- data.frame
i <- 1
j <- 1
lineLength <- length(readLines(con))
for (i in 1:lineLength){
  line <- readLines("dataSet.txt", 1)
  if (str_sub(line, 1, 1) == 'F' && grepl("\\(0\\,0\\)", line)[i]){
    print(line)
    Fdata[j,] <- rbind(line)
    i <- i + 1
    j <- j + 1
  }
  i <- i + 1
}
print(Fdata)

It runs fine, but the output it gives me is not what I want. It just keeps on printing the first line of the file over and over.

    [1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"
    [1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"
    [1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"
    [1] "C 20160525 05:27:47.915 Rotated log file: /var/log/servedat-201605250527.log"

How can I get it to evaluate whether or not I need the line, and how can I store it correctly (as a vector, data frame, matrix, it doesn't really matter) so that I can print it outside the for loop?

UPDATE

I have changed my code to this:

    library(stringr)
    con <- file("dataSet.txt", open = "r")
    Fdata <- data.frame
    i <- 1
    j <- 1
    lineLength <- length(readLines(con))
    for (i in 1:lineLength){
      line <- readLines(con, 1)
      print(line)
      if (str_sub(line, 1, 1) == 'F' && grepl("\\(0\\,0\\)", line)[i]){
        print(line)
        Fdata[j,] <- rbind(line)
        i <- i + 1
        j <- j + 1
      }
      i <- i + 1
    }
    print(Fdata)

However, when I check the value stored in line it says that it is empty. I don't understand why it changed. Additionally, it told me that the if statement did not have a proper TRUE/FALSE condition, which confuses me as well because grepl() should return a TRUE/FALSE value.

UPDATE

I managed to get rid of the error, but I'm still not getting anything when I call Fdata. I checked my variables, and R said that line was empty, that it had no characters. Did I assign it incorrectly? I want line to be the line I am parsing in the data file and evaluating whether I need to store it or not. Here is my updated code:

library(stringr)
con <- file("dataSet.txt", open = "r")
Fdata <- data.frame
i <- 1
j <- 1
lineLength <- length(readLines("dataSet.txt))
for (i in 1:lineLength){
  line <- readLines(con, 1)
  print(line)
  if (str_sub(line, 1, 1) == 'F' && grepl("\\(0\\,0\\)", line)){
    print(line)
    Fdata[j,] <- rbind(line)
    i <- i + 1
    j <- j + 1
  }
  i <- i + 1
}
print(Fdata) 
Dave2e
  • 22,192
  • 18
  • 42
  • 50
stargirl
  • 129
  • 1
  • 2
  • 12
  • You need to pass the file connection into the `readLines` instead of specifying it with a string. Change the first line in the for loop to `line <- readLines(con, 1)` should solve the issue here. – Psidom Jun 20 '16 at 13:17
  • Also I don't think you need to escape the comma. use `grepl("\\(0,0\\), line)`. – Psidom Jun 20 '16 at 13:19
  • The problem you have is the `grepl("\\(0\\,0\\)", line)[i]` where `i` can be a large number while grepl returns vector of length 1. Remove the `[i]` should make it work. – Psidom Jun 20 '16 at 13:43
  • It got rid of the error, but I'm still not getting anything printed when I call Fdata. When I check the line variable, R keeps saying that it is empty. Did I assign it incorrectly? – stargirl Jun 20 '16 at 13:52
  • Another problem you have is `lineLength <- length(readLines(con))`. For a file stream you can only read it once, since you have read it before the for loop, `con` point to the end of the file, so you will get nothing to read any more in the for loop. And also your program read the file twice which seems to deviate from your original purpose of constructing the whole idea. Do check the answer I am giving below. Replace the file name with whatever your file names is. – Psidom Jun 20 '16 at 13:57

3 Answers3

4

Check this out:

con <- file("test1.txt", "r")
lines <- c()
while(TRUE) {
  line = readLines(con, 1)
  if(length(line) == 0) break
  else if(grepl("^\\s*F{1}", line) && grepl("(0,0)", line, fixed = TRUE)) lines <- c(lines, line)
}

lines
# [1] "F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES \"/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz\" \"\" 3322771022 (0,0) \"1499.61 seconds (17.7 megabits/sec)\""

Pass the file stream to readLines so that it can read it line by line. Use regular expression ^\\s*F{1} to capture line starting with letter F with possible white spaces where ^ denote the beginning of a string. Use fixed=T to capture the exact match of (0,0). If both of the checks are TRUE, append the result to lines.

Data:

D 20160602 14:15:43.559 F7982D62 Req Agr:131 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0      
D 20160602 14:15:43.559 F7982D62 Set Agr:130 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0 I 20160602 14:15:43.559 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 
M 20160602 14:15:43.595 DOC1: F7982D62 Request for unencrypted meta data on encrypted transaction
M 20160602 14:15:48.353 DOC1: F7982D62 Transaction has been acknowledged at 722875647 
F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt" 
M 20160602 14:15:48.780 DOC1: F7982D63 New download request D 20160602 14:15:48.780 F7982D63 META: 134 Path: /pcgc/public/CTD/exome/fastq/PCGC0033175_HS_EX__1-00304-01__v1_FCBC0RE4ACXX_L3_p32of96_P2.fastq.gz user: xqixh8sl pack: arg: feat: cE,s
F 20160602 14:25:11.321 F7982D50 GET 156.145.15.85:37525 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0077248_HS_EX__1-06808__v3_FCC49HJACXX_L7_p1of1_P1.fastq.gz" "" 3322771022 (0,0) "1499.61 seconds (17.7 megabits/sec)"
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Thank you so so much! This worked great. Just one quick question: since the while loop runs only if the conditions are true, wouldn't that mean it'll stop as soon as it hits a line that makes the condition false? I'm a bit confused as why it works. Everything else makes sense though. – stargirl Jun 20 '16 at 15:36
  • Yes, it is true and it is the logic we are using here. We specify the condition to be always TRUE, and whenever we reach the last line(which means the length of the line is zero), we break out of the loop which is what the keyword `break` does and stops the while. – Psidom Jun 20 '16 at 15:42
  • FWIW, it is much more efficient to grow a list by modifying in place than to concatenate a vector to itself, which will duplicate the vector in memory for every single assignment. So the vector `lines <- c(lines, line)` would be better off as the list `lines[[length(lines)+1]]<-line`. – Brooks Ambrose Jun 26 '20 at 07:25
1

If you have enough memory, 40,000 lines shouldn't be too much for R to handle. For performance reason it is better to read in all on the lines at once and use the vector performance to analyze the results.

Your code can be simplified to this:

library(stringr)

line <- readLines("dataSet.txt")

foundset<-line[which(str_sub(line, 1, 1) == 'F' & grepl("(0,0)", line, fixed = TRUE))]
#rm("line")  #include this line to free up memory if there is a concern

This reads in all of the lines and subsets out the one which begin with the letter "F". All of those lines are in the vector foundset.

Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • Part of my assignment is that I try and not read the entire data set into R. I've done it that way, but the goal was to learn how to read line by line. Part of the reason is efficiency and performance issues. Also, they want me to write out a loop for this because I will be dealing with millions of lines of code soon, and I will have to parse through and only store the lines I need. Thank you for your suggestion though. – stargirl Jun 20 '16 at 13:42
  • The `disk.frame` package was designed to do this chunking automatically. https://diskframe.com – Brooks Ambrose Jun 26 '20 at 07:20
1

Something like this answer (What is a good way to read line-by-line in R?) would also work:

cat('  D 20160602 14:15:43.559 F7982D62 Req Agr:131 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0',      
    'D 20160602 14:15:43.559 F7982D62 Set Agr:130 Mra:0 Exp:0 Mxr:0 Mnr:0 Mxd:0 Mnd:0 Nro:0 I 20160602 14:15:43.559 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" ""',
    'M 20160602 14:15:43.595 DOC1: F7982D62 Request for unencrypted meta data on encrypted transaction',
    'M 20160602 14:15:48.353 DOC1: F7982D62 Transaction has been acknowledged at 722875647',
    'F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt"',
    'M 20160602 14:15:48.780 DOC1: F7982D63 New download request D 20160602 14:15:48.780 F7982D63 META: 134 Path: /pcgc/public/CTD/exome/fastq/PCGC0033175_HS_EX__1-00304-01__v1_FCBC0RE4ACXX_L3_p32of96_P2.fastq.gz user: xqixh8sl pack: arg: feat: cE,s")',
    'F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt" (0,0)',
    file="test",
    sep="\n")


library(stringr)
con  <- file("test", open = "r")
res<-c()

while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
  if (substr(str_trim(oneLine),1,1) =="F" & (regexpr("(0,0)",oneLine)[1] > 0) ){

    res<-c(res,oneLine)
  } 

} 

close(con)
res
[1] "F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES \"/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz\" \"\" 50725464 (4,32) \"Remote Application: Session Aborted: Aborted by user interrupt\" (0,0)"

Note that I added that last line in there on purpose to show how the while loop works.

Community
  • 1
  • 1
Mike H.
  • 13,960
  • 2
  • 29
  • 39
  • Could you explain the cat() function you used? Also, Since it's such a large file, could I just use the file name and separator and not copy in all the lines as a string? – stargirl Jun 20 '16 at 15:02