How to Read Specific Lines of A Large Irregular Data File in R

Question

I work in data management, meaning people give me raw data and I have to format and parse it to get the pieces I need and organize it in a way that makes sense. Currently the data I'm working with is a log file, but I have opened and saved it as a text file. It looks a bit like this:

M 20160525 09:51:11.822 DOC1: Clearing stale DENIED send to 1864130A.62274 in 13 after 39411ms

D 20160525 09:51:11.824 F798257E GET 10.19.100.24:62274 van8tc - "/pcgc/public/Other/" "*li" Done

M 20160525 09:51:11.825 DOC1: F798257E Transaction has been acknowledged at 15804727

F 20160525 09:51:11.825 F798257E GET 10.19.100.24:62274 van8tc - "/pcgc/public/Other/" "*li" 441 (0,0) "0.10 seconds (36.8 kilobits/sec)" D 20160525 09:51:11.825 F798257E GET 10.19.100.24:62274 van8tc - "/pcgc/public/Other/" "*li" - "Freeing Package Unit"

It's quite a large file, and I don't wish to import the entire thing into R mainly because of the amount of space it takes up. Each line has "fields" (what I want to organize and separate) that are designated as the following:

F -- identifier of the line
20160525 -- date (yyyymmdd)
17:52:38.791 -- timestamp (HH:MM:SS.sss)
F798259D -- transfer identifier
156.145.15.85:46634 -- IP address and related port
xqixh8sl -- username
AES -- encryption level (could be - (dash))
"/pcgc...fastq.gz" -- transferred file (in ")
"" -- additional string (should be empty "")
2951144113 -- transferred bytes
(0,0) -- error
"2289.47 seconds (10.3 megabits/sec)" -- data about the transfer

The only lines I need are the ones that start with F and have a (0, 0) error. Here is an example line:

F 20160525 17:52:38.791 F798259D GET 156.145.15.85:46634 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0053681_HS_EX__1-02598__v1_FCAD18P7ACXX_L8_p92of93_P1.fastq.gz" "" 2951144113 (0,0) "2289.47 seconds (10.3 megabits/sec)"

And I would NOT consider a line like this:

F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt"

The line above did not have a (0, 0) error, so it would not be considered.

My question is this: since the file is so large, I want to able to parse through it and pick out only the lines I need beforehand. Then, once I import it, I want the best way to organize it neatly. I know that there are a variety of ways to read the file (I have been trying with readLines() and scan()) but I don't know how to write in the conditional statement (the line must start with F, and must have a (0, 0) error).

I have tried a variety of things:

Used scan() to import the entire file into R as a list.

x <- scan("dataSet.txt", what = list(lineID = "", date = "", timestamp = "", transferID = "", IP = "", username = "", encryption = "", transferredFile = "", error = "", data = ""), sep = " ", fill = TRUE, strip.white = TRUE)

logs <- list(x)

logs

While I liked the numbering and rows, it left out a lot of fields that I needed. This is the output it gave me:

[9062] ""
[9063] ""
[9064] ""
[9065] ""
[9066] ""
[9067] ""
[9068] ""
[9069] ""
[9070] ""
[9071] ""
[9072] ""
[9073] "Mnr:0"
[9074] ""
[9075] "Mnr:0"
[9076] ""
[9077] ""
[9078] "data"
[9079] ""
[9080] "2,"
[9081] "12,"
[9082] ""
[9083] ""
[9084] "550F919C.60099"

I found as example online of this, so I copied it and tried to use it similarly. However, it did not give me what I desired. If someone could explain how this works, that would also be greatly appreciated. However, the way I used it also imported the entire file.

> setwd("/Users/kimm5w/Intern Work")

> dataset <- list()

> con <- file("dataSet.txt")

> open(con)

> dataset <- grep("F", scan("dataSet.txt", what = list(lineID = "", date = "", timestamp = "", transferID = "", IP = "", username = "", encryption = "", transferredFile = "", error = "", data = ""), sep = " ", fill = TRUE, strip.white = TRUE), perl = TRUE, value = TRUE)

> dataset

This is the output it gave me, which was not the format I wanted:

\"[0]\", \"\", \"xqixh8sl:\", \"\", \"\", \"\", \"\", \"Mnr:0\", \"\", \"Mnr:0\", \"\", \"\", \"data\", \"\", \"at\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"Mnr:0\", \"\", \"Mnr:0\", \"\", \"\", \"data\", \"\", \"at\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"Mnr:0\", \"\", \"Mnr:0\", \"\", \"\", \"data\", \"\", \"550F919C.36474\", \"\", \"550F919C.42385\", \"\", \"550F919C.49879\", \"\", \"550F919C.53923\", \"\", \"6,\", \"18,\", \"\", \"550F919C.36773\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"at\", \"\", \"\", \"\", \"\", \"\", \n\"\", \"\", \"550F919C.37525\", \"\", \"6,\", \"18,\", \"\")"

I'm fairly new at R; I learned Java and though the concepts are similar, the syntax is unfamiliar. If anyone can help me with this, please do! I've been working on this for about a week and can't figure it out. Thank you for your help!

UPDATE

Here's what I've tried so far after going through your suggestions:

    setwd("/Users/kimm5w/Intern Work")
    df<-data.frame(readLines("dataSet.txt"))
    F_dataSet <- grep("^F.*(0,0)", "dataSet.txt")
    F_dataSet

    library(stringr)
    x = 0
    while(x < length(readLines("dataSet.txt"))){
      line <- readLines("dataSet.txt")
      if (str_sub(line, 1, 1) == 'F' & grepl('\\(0\\,0\\)', line)[1]){
        F_data <- c(F_data, line)
        }
    }
    display(F_data)

For some reason when I try and run it in R, it doesn't display the result. However, it does run without error. My question is if one of these will work. I can't use a for loop because the exact number of lines isn't known. So instead, I tried using a while loop in the second version. The link was helpful, but a bit confusing because I wasn't familiar with the syntax. If someone could explain each section I think it would be easier to understand. On the first attempt, I just tried using grep() to sort out the lines I needed, but I'm not sure if it worked. If anyone can help out from here, that would be very much appreciated. And to those that sent me answers, thank you too. This has helped me a lot, and is the most progress I've made in a while.

Here's another update. It runs fine, but for some reason the while loop does not print anything. F_data does not show up when I try to display it. Could someone point out where the error is?

    setwd("/Users/kimm5w/Intern Work")
    F_data <- data.frame
    print(F_data)
    library(stringr)
    x <- length(readLines("dataSet.txt"))
    print(x)
    while(x != 0)
      {
      line <- readline("dataSet.txt")
      print(line)
      if (str_sub(line, 1, 1) == 'F' & grepl('\\(0\\,0\\)', line)[1]){
        F_data <- c(F_data, line)
        print(F_data)
      }
      x <- x + 1
    }
    close(con)
    F_data

Hi, what about using readLines to reed lines one by one within a for loop and only save to a variable or file the ones that qualify to your criteria?. I think this would help at least to the task of filtering relevant lines. — Matias Thayer, Jun 14 '16 at 13:08
But what I'm confused on is if the file isn't even read into R, how am I supposed to tell the computer that I want it to look at the first character of each line? My first impulse is to use the substring function, but I don't think it works unless the file is completely imported. I've also looked at using regular expressions to match the format of the lines I want, but I'm not familiar with those either. I'm having the most trouble with actually writing the code and syntax; I know how I want to go about this, but am not entirely sure of the way I need to write. — stargirl, Jun 14 '16 at 13:31
Hi, I put the reply as an answer, formating code in here was hard :) — Matias Thayer, Jun 14 '16 at 14:21

score 0 · Answer 1 · answered Jun 14 '16 at 14:08

0

Perhaps this is a cop out, but if you are concerned about conserving memory during your R session, just don't do it in the R session. You can just preprocess the file using grep before reading it into R.

grep "^F.*(0,0)" dataSet.txt > processed_dataSet.txt

answered Jun 14 '16 at 14:08

andrew

2,524
2
24
36

I've been using TextWrangler or TextEdit on my mac to transfer the data. Should I switch to something else, or is one of these able to process the data beforehand? – stargirl Jun 14 '16 at 14:25
1

use `grep`. Way faster. – andrew Jun 14 '16 at 14:28
In the processor, then. Do you know any good ones that are compatible with RStudio? I just want to try out it with a chunk of the data, but none of the consoles I used to open the log file seem to be able to parse through the data using grep(). – stargirl Jun 14 '16 at 14:41
I am talking about the unix command-line tool called `grep`, not the R function called `grep()`. Do you have a unix-like terminal? – andrew Jun 14 '16 at 15:13

score 0 · Answer 2 · answered Jun 14 '16 at 14:20

0

Lets say you read the first line, using the readLines function and a for loop or something else. Then, you can use a simple search to see if your line start with "F" and if it contains "(0,0)". For instance:

library(stringr)
line='F 20160525 09:51:11.825 F798257E GET 10.19.100.24:62274 van8tc - "/pcgc/public/Other/" "*li" 441 (0,0) "0.10 seconds (36.8 kilobits/sec)" D 20160525 09:51:11.825 F798257E GET 10.19.100.24:62274 van8tc - "/pcgc/public/Other/" "*li" - "Freeing Package Unit"'

if(str_sub(line,1,1)=='F' & grepl('\\(0\\,0\\)', line)[1]){
    relevant_guys<-c(relevant_guys, line)
}

In this way you don't have to put the whole file in memory, and evaluate line by line.

answered Jun 14 '16 at 14:20

Matias Thayer

571
3
8

Could I do it using a while loop instead? The number of data lines fluctuates depending on which console I use to open it, so I'm not entirely sure exactly how many times I'll need to use readLines(). – stargirl Jun 14 '16 at 14:39
Here is a nice example using a while: http://stackoverflow.com/questions/4106764/what-is-a-good-way-to-read-line-by-line-in-r – Matias Thayer Jun 14 '16 at 14:56

How to Read Specific Lines of A Large Irregular Data File in R

2 Answers2