I need parse a bunch of huge text files, each are 100MB+. They are poorly formatted log files in CSV format, but each record is multiple lines, so I can't just read each line and separate them by the delimiter. Its also not a set number of lines, since if there are blank values then sometimes the line is skipped or some lines overflow into the next line. Also the record delimiter can change within the same file, from "" to " ***** " and there is sometimes a line that says "end of log #"
Sample log:
"Date:","6/23/2015","","Location:","Kol","","Target Name:","ILO.sed.908"
"ID:","ke.lo.213"
"User:","EDU\namo"
"Done:","Edit File"
"Comment","File saved successfully"
""
"Date:","6/27/2015","","Location:","Los Angeles","","Target Name:","MAL.21.ol.lil"
"ID:","uf.903.124.56"
"Done:","dirt emptied and driven to locations without issue, yet to do anyt"
"hing with the steel pipes, no planks "
"Comment"," l"
""
"end of log 1"
"Date:","5/16/2015","","Location:","Springfield","","Target Name:","ile.s.ol.le"
"ID:","84l.df.345"
"User:","EDU\bob2"
"Done:","emptied successfully"
"Comment","File saved successfully"
" ******* "
How should I approach this? It needs to be efficient so that I can process it fast, so fewer file io operations would be nice. I currently just read it into memory all at once:
with open('Path/to/file', 'r') as content_file:
content = content_file.read()
I am also somewhat new to python, I know how to handle reading multiple files and running the code on each and I have a toString to output it into a new csv file.
The other problem is that a few of the log files are a few GB in size and it wouldn't do to read all that into memory at once, but i don't know how to separate it into chunks. I can't just read X number of lines, since the record line counts are not set.
The comments need to be saved and concatenated together in a single string.
So please help!