0

I have a data dump that is a "messed up" CSV. (About 100 files, each with about 1000 lines of actual CSV data.)
The dump has some other text in addition to CSV. How can I extract the CSV part separately, programmatically?

As an example the data file looks like something like this

Session:1
Data collection date: 09-09-2016
Related questions:
    Question 1: parta, partb, partc,
    Question 2: parta, partb, partc

"field1","field2","field3","field4"
"data11","data12","data13","data14"
"data21","data22","data23","data24"
"data31","data32","data33","data34"
"data41","data42","data43","data44"
"data51","data52","data53","data54"

I need to extract the csv part.

Caveats,
the text in the beginning is NOT limited to 4 - 5 lines.
the additional text is NOT just in the beginning of the file

I saw this post that suggests using re.split and/or csv.Sniffer, however my attempt was not fruitful.

with open("untitled.csv") as csvfile:
    dialect = csv.Sniffer().sniff(csvfile.read(1024))
    csvfile.seek(0)
    print(dialect.__dict__)
    csvstarts = False
    csvdump = []
    for ln in csvfile.readlines():
        toks = re.split(r'[,]', ln)
        print(toks)
        if toks[0] == '"field1"' and not csvstarts: # identify by the header line
            csvstarts = True
            continue
        if csvstarts:
            if toks[0] == '"field1"': # identify the start of subsequent csv data
                csvstarts = False
                continue
            csvdump.append(ln)  # record the current line

    print(csvdump)

For now I am able to identify the csv lines accurately ONLY if there is one bunch of data.

Is there anything better I can do?

Community
  • 1
  • 1
okkhoy
  • 1,298
  • 3
  • 16
  • 29
  • What separates the lines with data and lines with text then? Can a text line start with "? – Aidenhjj Sep 22 '16 at 02:48
  • so, that is where the problem lies, there is nothing that separates csv data from text, unless you consider blank lines as separator. – okkhoy Sep 22 '16 at 03:13

3 Answers3

1

How about this:

import re

my_pattern = re.compile("(\"[\w]+\",)+")

with open('<your_file>', 'rb') as fi:
    for f in fi:
        result = my_pattern.match(f)
        if result:
            print f

Assuming the csv data can be differentiated from the rest by having no special characters in them (we only accept each element to have letters or numbers surrounded by double quotes and a comma to separate from the next element)

picmate 涅
  • 3,951
  • 5
  • 43
  • 52
0

If your csv lines and only those lines start with \", then you can do this:

import csv

data = list(csv.reader(open("test.csv", 'rb'), quotechar='¬'))
# for quotechar - use something that won't turn up in data

def importCSV(data):
    # outputs list of list with required data
    # works on the assumption that all required data starts with \"
    # and that no text starts with \"

    out = []

    for line in data:
        if (line != []) and (line[0][0] == "\""):
            line = [el.replace("\"", "") for el in line]
            out.append(line)

    return out

useful = importCSV(data)
Aidenhjj
  • 1,249
  • 1
  • 14
  • 27
0

Can you not read each line and do a regex to see weather or not to pull the data? Maybe something like:

^(["][\w]["][,])+["][\w]["]$

My regex is not the best and there may likely be a better way but that seemed to work for me.

Mrd05d
  • 64
  • 7