I work in data management, meaning people give me raw data and I have to format and parse it to get the pieces I need and organize it in a way that makes sense. Currently the data I'm working with is a log file, but I have opened and saved it as a text file. It looks a bit like this:
M 20160525 09:51:11.822 DOC1: Clearing stale DENIED send to 1864130A.62274 in 13 after 39411ms
D 20160525 09:51:11.824 F798257E GET 10.19.100.24:62274 van8tc - "/pcgc/public/Other/" "*li" Done
M 20160525 09:51:11.825 DOC1: F798257E Transaction has been acknowledged at 15804727
F 20160525 09:51:11.825 F798257E GET 10.19.100.24:62274 van8tc - "/pcgc/public/Other/" "*li" 441 (0,0) "0.10 seconds (36.8 kilobits/sec)" D 20160525 09:51:11.825 F798257E GET 10.19.100.24:62274 van8tc - "/pcgc/public/Other/" "*li" - "Freeing Package Unit"
It's quite a large file, and I don't wish to import the entire thing into R mainly because of the amount of space it takes up. Each line has "fields" (what I want to organize and separate) that are designated as the following:
- F -- identifier of the line
- 20160525 -- date (yyyymmdd)
- 17:52:38.791 -- timestamp (HH:MM:SS.sss)
- F798259D -- transfer identifier
- 156.145.15.85:46634 -- IP address and related port
- xqixh8sl -- username
- AES -- encryption level (could be - (dash))
- "/pcgc...fastq.gz" -- transferred file (in ")
- "" -- additional string (should be empty "")
- 2951144113 -- transferred bytes
- (0,0) -- error
- "2289.47 seconds (10.3 megabits/sec)" -- data about the transfer
The only lines I need are the ones that start with F and have a (0, 0) error. Here is an example line:
F 20160525 17:52:38.791 F798259D GET 156.145.15.85:46634 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0053681_HS_EX__1-02598__v1_FCAD18P7ACXX_L8_p92of93_P1.fastq.gz" "" 2951144113 (0,0) "2289.47 seconds (10.3 megabits/sec)"
And I would NOT consider a line like this:
F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt"
The line above did not have a (0, 0) error, so it would not be considered.
My question is this: since the file is so large, I want to able to parse through it and pick out only the lines I need beforehand. Then, once I import it, I want the best way to organize it neatly. I know that there are a variety of ways to read the file (I have been trying with readLines()
and scan()
) but I don't know how to write in the conditional statement (the line must start with F, and must have a (0, 0) error).
I have tried a variety of things:
Used
scan()
to import the entire file into R as a list.x <- scan("dataSet.txt", what = list(lineID = "", date = "", timestamp = "", transferID = "", IP = "", username = "", encryption = "", transferredFile = "", error = "", data = ""), sep = " ", fill = TRUE, strip.white = TRUE)
logs <- list(x)
logs
While I liked the numbering and rows, it left out a lot of fields that I needed. This is the output it gave me:
[9062] ""
[9063] ""
[9064] ""
[9065] ""
[9066] ""
[9067] ""
[9068] ""
[9069] ""
[9070] ""
[9071] ""
[9072] ""
[9073] "Mnr:0"
[9074] ""
[9075] "Mnr:0"
[9076] ""
[9077] ""
[9078] "data"
[9079] ""
[9080] "2,"
[9081] "12,"
[9082] ""
[9083] ""
[9084] "550F919C.60099"
- I found as example online of this, so I copied it and tried to use it similarly. However, it did not give me what I desired. If someone could explain how this works, that would also be greatly appreciated. However, the way I used it also imported the entire file.
> setwd("/Users/kimm5w/Intern Work")
> dataset <- list()
> con <- file("dataSet.txt")
> open(con)
> dataset <- grep("F", scan("dataSet.txt", what = list(lineID = "", date = "", timestamp = "", transferID = "", IP = "", username = "", encryption = "", transferredFile = "", error = "", data = ""), sep = " ", fill = TRUE, strip.white = TRUE), perl = TRUE, value = TRUE)
> dataset
This is the output it gave me, which was not the format I wanted:
\"[0]\", \"\", \"xqixh8sl:\", \"\", \"\", \"\", \"\", \"Mnr:0\", \"\", \"Mnr:0\", \"\", \"\", \"data\", \"\", \"at\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"Mnr:0\", \"\", \"Mnr:0\", \"\", \"\", \"data\", \"\", \"at\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"Mnr:0\", \"\", \"Mnr:0\", \"\", \"\", \"data\", \"\", \"550F919C.36474\", \"\", \"550F919C.42385\", \"\", \"550F919C.49879\", \"\", \"550F919C.53923\", \"\", \"6,\", \"18,\", \"\", \"550F919C.36773\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"at\", \"\", \"\", \"\", \"\", \"\", \n\"\", \"\", \"550F919C.37525\", \"\", \"6,\", \"18,\", \"\")"
I'm fairly new at R; I learned Java and though the concepts are similar, the syntax is unfamiliar. If anyone can help me with this, please do! I've been working on this for about a week and can't figure it out. Thank you for your help!
UPDATE
Here's what I've tried so far after going through your suggestions:
setwd("/Users/kimm5w/Intern Work")
df<-data.frame(readLines("dataSet.txt"))
F_dataSet <- grep("^F.*(0,0)", "dataSet.txt")
F_dataSet
library(stringr)
x = 0
while(x < length(readLines("dataSet.txt"))){
line <- readLines("dataSet.txt")
if (str_sub(line, 1, 1) == 'F' & grepl('\\(0\\,0\\)', line)[1]){
F_data <- c(F_data, line)
}
}
display(F_data)
For some reason when I try and run it in R, it doesn't display the result. However, it does run without error. My question is if one of these will work. I can't use a for loop because the exact number of lines isn't known. So instead, I tried using a while loop in the second version. The link was helpful, but a bit confusing because I wasn't familiar with the syntax. If someone could explain each section I think it would be easier to understand. On the first attempt, I just tried using grep() to sort out the lines I needed, but I'm not sure if it worked. If anyone can help out from here, that would be very much appreciated. And to those that sent me answers, thank you too. This has helped me a lot, and is the most progress I've made in a while.
Here's another update. It runs fine, but for some reason the while loop does not print anything. F_data does not show up when I try to display it. Could someone point out where the error is?
setwd("/Users/kimm5w/Intern Work")
F_data <- data.frame
print(F_data)
library(stringr)
x <- length(readLines("dataSet.txt"))
print(x)
while(x != 0)
{
line <- readline("dataSet.txt")
print(line)
if (str_sub(line, 1, 1) == 'F' & grepl('\\(0\\,0\\)', line)[1]){
F_data <- c(F_data, line)
print(F_data)
}
x <- x + 1
}
close(con)
F_data