0

I need parse a bunch of huge text files, each are 100MB+. They are poorly formatted log files in CSV format, but each record is multiple lines, so I can't just read each line and separate them by the delimiter. Its also not a set number of lines, since if there are blank values then sometimes the line is skipped or some lines overflow into the next line. Also the record delimiter can change within the same file, from "" to " ***** " and there is sometimes a line that says "end of log #"

Sample log:

"Date:","6/23/2015","","Location:","Kol","","Target Name:","ILO.sed.908"
"ID:","ke.lo.213"
"User:","EDU\namo"
"Done:","Edit File"
"Comment","File saved successfully"
""
"Date:","6/27/2015","","Location:","Los Angeles","","Target Name:","MAL.21.ol.lil"
"ID:","uf.903.124.56"
"Done:","dirt emptied and driven to locations without issue, yet to do anyt"
"hing with the steel pipes, no planks "
"Comment"," l"
""
"end of log 1"
"Date:","5/16/2015","","Location:","Springfield","","Target Name:","ile.s.ol.le"
"ID:","84l.df.345"
"User:","EDU\bob2"
"Done:","emptied successfully"
"Comment","File saved successfully"
" ******* "

How should I approach this? It needs to be efficient so that I can process it fast, so fewer file io operations would be nice. I currently just read it into memory all at once:

with open('Path/to/file', 'r') as content_file:
    content = content_file.read()

I am also somewhat new to python, I know how to handle reading multiple files and running the code on each and I have a toString to output it into a new csv file.

The other problem is that a few of the log files are a few GB in size and it wouldn't do to read all that into memory at once, but i don't know how to separate it into chunks. I can't just read X number of lines, since the record line counts are not set.

The comments need to be saved and concatenated together in a single string.

So please help!

  • Example of how to read huge file in chunks: http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python – Jay-C Jul 12 '15 at 03:17

2 Answers2

0

I noticed that each log entry begins with a "Date" line and ends with "Done" followed by "Comment" lines. So instead of worrying about delimiters, you could read everything from "Date" line to the "Comment" line and consider that as one block of log.

The "end of log" message doesn't seem really important but if you really want to grab that as well, you could grab everything between two consecutive "Date" lines and that would be one block of log.

I posted a link above on how to load file in chunks. The bigger the chunk, the fewer I/O you have to do, but this also means you take a hit in the memory due to bigger chunks being loaded.

Jay-C
  • 23
  • 1
  • 6
0

To handle the large file you should use the fact that files are iterators returning line by line in Python:

with open('Path/to/file', 'r') as content_file:
    for line in content_file:
         # your code

The Python CVS library uses this feature as well. The lib might be useful.

Klaus D.
  • 13,874
  • 5
  • 41
  • 48