0

I have evaluated a file and parsed through for only the lines I need. I saved these lines as a data frame and now am looking for a way to split it into columns for each field. My code for the data frame is below:

con <- file("dataSet.txt", "r")
lines <- c()
while(TRUE) {
  line = readLines(con, 1)
  if(length(line) == 0) break
  else if(grepl("^\\s*F{1}", line) && grepl("(0,0)", line, fixed = TRUE)) 
    lines <- c(lines, line)
  }
lines <- data.frame(lines)

When lines is printed, it is displayed like this:

[1] F 20160525 08:22:06.838 F798256B GET 10.199.194.38:57708 wei2dt - "" "*li" 264 (0,0) "1.62 seconds (1.30 kilobits/sec)"                       
[2] F 20160525 08:28:26.920 F798256C GET 10.19.105.15:57708 wei2dt - "isi_audit_log.dmp-sv.tmp" "*dl" 69 (0,0) "0.29 seconds (1.93 kilobits/sec)" 
[3] F 20160525 08:28:26.933 F798256E GET 10.19.105.15:57708 wei2dt - "CG0009-1364_GT_report.txt" "*dl" 34 (0,0) "0.01 seconds (34.0 kilobits/sec)"
[4] F 20160525 08:28:26.941 F798256F GET 10.19.105.15:57708 wei2dt - "./" "*li" 89 (0,0) "0.01 seconds (102 kilobits/sec)"                        
[5] F 20160525 08:29:12.717 7798256B SEND 10.19.105.15:57708 wei2dt - "isi_audit_log.dmp" "" 1019692009 (0,0) "38.05 seconds (214 megabits/sec)"  

1741 Levels: F 20160525 08:22:06.838 F798256B GET 10.199.194.38:57708 wei2dt - "" "*li" 264 (0,0) "1.62 seconds (1.30 kilobits/sec)"

However, I would like to split up lines into multiple columns so that each field (separated by a space) is in it's own column. Specifically, I want to split it into 13 columns labelled:

"Line ID"
"Date"
"Timestamp"
"Transfer ID"
""
"IP Address"
"Username"
"Encryption Level"
"Transferred File"
""
"Transferred Bytes"
"Error"
"Transfer Time Data"

The ones with blank strings indicate columns that I do not want to name. I want to split the rest into the columns above like so:

  1. F -- identifier of the line

  2. 20160525 -- date (yyyymmdd)

  3. 17:52:38.791 -- timestamp (HH:MM:SS.sss)

  4. F798259D -- transfer identifier

  5. 156.145.15.85:46634 -- IP address and related port

  6. xqixh8sl -- username

  7. AES -- encryption level (could be - (dash))

  8. "/pcgc...fastq.gz" -- transferred file (in ")

  9. "" -- additional string (should be empty "")

  10. 2951144113 -- transferred bytes

  11. (0,0) -- error (only consider lines with 0,0 for now)

  12. "2289.47 seconds (10.3 megabits/sec)" -- data about the transfer

Thank you for your help in advance.

UPDATE

As requested, I will put the result of dput(head(lines, 10)) below.

"F 20160531 14:19:11.085 F7982871 GET 146.203.126.246:31947 xricf4xj AES \"/pcgc/public/Other/transcriptome/fastq/PCGC0069603_HS_TX__1-05846__v1_FC882_L2_p9of16_P2.fastq.gz\" \"\" 551700712 (0,0) \"12.42 seconds (355 megabits/sec)\"" 
"F 20160531 14:19:24.085 F7982872 GET 146.203.126.246:20198 xricf4xj AES \"/pcgc/public/Other/transcriptome/fastq/PCGC0069749_HS_TX__1-04056__v1_FC01060_L1_p3of12_P2.fastq.gz\" \"\" 592956993 (0,0) \"12.98 seconds (365 megabits/sec)\"" 
"F 20160531 14:20:04.881 F7982873 GET 146.203.126.246:37792 xricf4xj AES \"/pcgc/public/Other/transcriptome/fastq/PCGC0065337_HS_TX__1-02281__v1_FC504_L5_p4of6_P2.fastq.gz\" \"\" 1787507416 (0,0) \"40.76 seconds (351 megabits/sec)\""
"F 20160531 14:20:10.763 F7982874 GET 146.203.126.246:5683 xricf4xj AES \"/pcgc/public/Other/transcriptome/fastq/PCGC0065271_HS_TX__1-02626__v1_FC412_L1_p6of6_P2.fastq.gz\" \"\" 235573426 (0,0) \"5.86 seconds (321 megabits/sec)\"" 
"F 20160531 14:20:24.142 F7982875 GET 146.203.126.246:52946 xricf4xj AES \"/pcgc/public/CTD/transcriptome/fastq/PCGC0069557_HS_TX__1-00738__v1_FC864_L1_p3of7_P2.fastq.gz\" \"\" 619011108 (0,0) \"13.34 seconds (371 megabits/sec)\"" 
"F 20160531 14:20:36.823 F7982876 GET 146.203.126.246:12531 xricf4xj AES \"/pcgc/public/CTD/transcriptome/fastq/PCGC0065398_HS_TX__1-01907__v1_FC718_L1_p2of10_P1.fastq.gz\" \"\" 539231282 (0,0) \"12.63 seconds (341 megabits/sec)\"" 
"F 20160531 14:21:10.955 F7982877 GET 146.203.126.246:2531 xricf4xj AES \"/pcgc/public/LVOTO/transcriptome/fastq/PCGC0065300_HS_TX__1-00652__v1_FC437_L3_p1of6_P2.fastq.gz\" \"\" 1545568612 (0,0) \"34.10 seconds (363 megabits/sec)\"" 
"F 20160531 14:21:20.721 F7982878 GET 146.203.126.246:16699 xricf4xj AES \"/pcgc/public/Other/transcriptome/fastq/PCGC0065413_HS_TX__1-01894__v1_FC718_L1_p6of10_P1.fastq.gz\" \"\" 452830134 (0,0) \"9.73 seconds (372 megabits/sec)\""
"F 20160531 14:21:26.191 F7982879 GET 146.203.126.246:54154 xricf4xj AES \"/pcgc/public/Other/transcriptome/fastq/PCGC0065397_HS_TX__1-01894__v1_FC711_L2_p6of10_P2.fastq.gz\" \"\" 267729030 (0,0) \"5.45 seconds (393 megabits/sec)\""
"F 20160531 14:21:41.752 F798287A GET 146.203.126.246:55620 xricf4xj AES \"/pcgc/public/Other/transcriptome/fastq/PCGC0069744_HS_TX__1-05476__v1_FC971_L2_p1of12_P2.fastq.gz\" \"\" 670588883 (0,0) \"15.54 seconds (345 megabits/sec)\""
stargirl
  • 129
  • 1
  • 2
  • 12
  • please share a sample of your data.frame using `dput`. – lmo Jun 22 '16 at 12:25
  • I've already provided a sample above. That is how it shows up on my screen when I ask for the first five lines. The file is big (40,000+) so I didn't give the entire thing. – stargirl Jun 22 '16 at 12:34
  • 1
    Use `dput` to extract a subsample as I said so that re can read it in. If it were a simpler data.frame this would not be a problem, but that is not the case here. try copying and pasting the result of `dput(head(df, 10))` into your question. take a look at these tips on how to produce a [minimum, complete and verifyible example](http://stackoverflow.com/help/mcve), as well as this post on [creating a great example in R](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – lmo Jun 22 '16 at 12:37
  • I hope this will get you started. https://cran.r-project.org/web/packages/rex/vignettes/log_parsing.html – user5249203 Jun 23 '16 at 13:30
  • You have more similar posts here http://stackoverflow.com/questions/4350440/split-a-column-of-a-data-frame-to-multiple-columns. – user5249203 Jun 23 '16 at 20:39

2 Answers2

2

Looks like a server log; you may try readr::read_log:

library(readr)
txt <- readLines(n=5)
F 20160525 08:22:06.838 F798256B GET 10.199.194.38:57708 wei2dt - "" "*li" 264 (0,0) "1.62 seconds (1.30 kilobits/sec)"                       
F 20160525 08:28:26.920 F798256C GET 10.19.105.15:57708 wei2dt - "isi_audit_log.dmp-sv.tmp" "*dl" 69 (0,0) "0.29 seconds (1.93 kilobits/sec)" 
F 20160525 08:28:26.933 F798256E GET 10.19.105.15:57708 wei2dt - "CG0009-1364_GT_report.txt" "*dl" 34 (0,0) "0.01 seconds (34.0 kilobits/sec)"
F 20160525 08:28:26.941 F798256F GET 10.19.105.15:57708 wei2dt - "./" "*li" 89 (0,0) "0.01 seconds (102 kilobits/sec)"                        
F 20160525 08:29:12.717 7798256B SEND 10.19.105.15:57708 wei2dt - "isi_audit_log.dmp" "" 1019692009 (0,0) "38.05 seconds (214 megabits/sec)" 
read_log(paste(txt, collapse="\n"))
#     X1       X2           X3       X4   X5                  X6     X7   X8                        X9
# 1 FALSE 20160525 08:22:06.838 F798256B  GET 10.199.194.38:57708 wei2dt <NA>                          
# 2 FALSE 20160525 08:28:26.920 F798256C  GET  10.19.105.15:57708 wei2dt <NA>  isi_audit_log.dmp-sv.tmp
# 3 FALSE 20160525 08:28:26.933 F798256E  GET  10.19.105.15:57708 wei2dt <NA> CG0009-1364_GT_report.txt
# 4 FALSE 20160525 08:28:26.941 F798256F  GET  10.19.105.15:57708 wei2dt <NA>                        ./
# 5 FALSE 20160525 08:29:12.717 7798256B SEND  10.19.105.15:57708 wei2dt <NA>         isi_audit_log.dmp
#   X10        X11   X12                              X13  X14  X15  X16  X17  X18  X19  X20  X21  X22
# 1 *li        264 (0,0) 1.62 seconds (1.30 kilobits/sec)                                             
# 2 *dl         69 (0,0) 0.29 seconds (1.93 kilobits/sec)      <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 3 *dl         34 (0,0) 0.01 seconds (34.0 kilobits/sec) <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 4 *li         89 (0,0)  0.01 seconds (102 kilobits/sec)                                             
# 5     1019692009 (0,0) 38.05 seconds (214 megabits/sec) <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#    X23  X24  X25  X26  X27  X28  X29  X30  X31  X32  X33  X34  X35  X36
# 1                                                                      
# 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 4                                                                      
# 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • It is a log file, but I've already saved it as a text file. The dataSet.txt file is what I'm reading. – stargirl Jun 22 '16 at 14:30
  • When I run it there's an error because it says that the rows do not have the same amount of columns. It says they are expected to have seven, but they vary in number. However, I don't even want seven columns, I need thirteen. Could you explain how this function works? – stargirl Jun 22 '16 at 14:38
  • Do you have an error or just warnings? If the latter: don't care if the function succeeded in readling the logfile. Read the documentation to see how this function works (I also don't have any other info than that.) – lukeA Jun 22 '16 at 14:48
0

Instead of data.frame(lines) use

# call strsplit function, which splits the data by any white spaces    
my_df   <- data.frame( do.call( rbind, strsplit(my_data, ' ' ) ) )
my_cols <- c("Line ID","Date", "Timestamp","Transfer ID","", "IP Address",
   "Username","Encryption Level", "Transferred File", "", "Transferred   Bytes",
   "Error", "Transfer Time Data")

Later on, you can further clean the dataframe by dropping those not required or combining columns into 1 column by...

# combine dataframe columns into a new column
 my_df$`Transfer Time Data` <- paste(my_df$X13,my_df$X14,my_df$X15) 

# remove columns
within(my_df, rm(X13,X14,X15))

This is little round about work, but should get you what you need.

user5249203
  • 4,436
  • 1
  • 19
  • 45