I have a data dump that is a "messed up" CSV. (About 100 files, each with about 1000 lines of actual CSV data.)
The dump has some other text in addition to CSV. How can I extract the CSV part separately, programmatically?
As an example the data file looks like something like this
Session:1
Data collection date: 09-09-2016
Related questions:
Question 1: parta, partb, partc,
Question 2: parta, partb, partc
"field1","field2","field3","field4"
"data11","data12","data13","data14"
"data21","data22","data23","data24"
"data31","data32","data33","data34"
"data41","data42","data43","data44"
"data51","data52","data53","data54"
I need to extract the csv part.
Caveats,
the text in the beginning is NOT limited to 4 - 5 lines.
the additional text is NOT just in the beginning of the file
I saw this post that suggests using re.split and/or csv.Sniffer, however my attempt was not fruitful.
with open("untitled.csv") as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
print(dialect.__dict__)
csvstarts = False
csvdump = []
for ln in csvfile.readlines():
toks = re.split(r'[,]', ln)
print(toks)
if toks[0] == '"field1"' and not csvstarts: # identify by the header line
csvstarts = True
continue
if csvstarts:
if toks[0] == '"field1"': # identify the start of subsequent csv data
csvstarts = False
continue
csvdump.append(ln) # record the current line
print(csvdump)
For now I am able to identify the csv lines accurately ONLY if there is one bunch of data.
Is there anything better I can do?