identify csv in python

Question

I have a data dump that is a "messed up" CSV. (About 100 files, each with about 1000 lines of actual CSV data.)
The dump has some other text in addition to CSV. How can I extract the CSV part separately, programmatically?

As an example the data file looks like something like this

Session:1
Data collection date: 09-09-2016
Related questions:
    Question 1: parta, partb, partc,
    Question 2: parta, partb, partc

"field1","field2","field3","field4"
"data11","data12","data13","data14"
"data21","data22","data23","data24"
"data31","data32","data33","data34"
"data41","data42","data43","data44"
"data51","data52","data53","data54"

I need to extract the csv part.

Caveats,
the text in the beginning is NOT limited to 4 - 5 lines.
the additional text is NOT just in the beginning of the file

I saw this post that suggests using re.split and/or csv.Sniffer, however my attempt was not fruitful.

with open("untitled.csv") as csvfile:
    dialect = csv.Sniffer().sniff(csvfile.read(1024))
    csvfile.seek(0)
    print(dialect.__dict__)
    csvstarts = False
    csvdump = []
    for ln in csvfile.readlines():
        toks = re.split(r'[,]', ln)
        print(toks)
        if toks[0] == '"field1"' and not csvstarts: # identify by the header line
            csvstarts = True
            continue
        if csvstarts:
            if toks[0] == '"field1"': # identify the start of subsequent csv data
                csvstarts = False
                continue
            csvdump.append(ln)  # record the current line

    print(csvdump)

For now I am able to identify the csv lines accurately ONLY if there is one bunch of data.

Is there anything better I can do?

What separates the lines with data and lines with text then? Can a text line start with "? — Aidenhjj, Sep 22 '16 at 02:48
so, that is where the problem lies, there is nothing that separates csv data from text, unless you consider blank lines as separator. — okkhoy, Sep 22 '16 at 03:13

picmate 涅 · Accepted Answer · 2016-09-22T03:28:06.230

1

How about this:

import re

my_pattern = re.compile("(\"[\w]+\",)+")

with open('<your_file>', 'rb') as fi:
    for f in fi:
        result = my_pattern.match(f)
        if result:
            print f

Assuming the csv data can be differentiated from the rest by having no special characters in them (we only accept each element to have letters or numbers surrounded by double quotes and a comma to separate from the next element)

edited Sep 22 '16 at 03:28

answered Sep 22 '16 at 03:21

picmate 涅

3,951
5
43
52

Note to self: must learn how to use regular expressions...much better answer! – Aidenhjj Sep 22 '16 at 03:30
Happy to help. You could accept the answer if it solves your problem. – picmate 涅 Sep 22 '16 at 03:48
Sorry about that. – picmate 涅 Sep 22 '16 at 03:53
this is about the closest that does the job. a minor fix in the regex is to check if that is the start and end if line as suggested by @Mrd05d below. hence accepting. i will update if i manage to do better – okkhoy Sep 22 '16 at 10:07

score 0 · Answer 2 · answered Sep 22 '16 at 03:18

If your csv lines and only those lines start with \", then you can do this:

import csv

data = list(csv.reader(open("test.csv", 'rb'), quotechar='¬'))
# for quotechar - use something that won't turn up in data

def importCSV(data):
    # outputs list of list with required data
    # works on the assumption that all required data starts with \"
    # and that no text starts with \"

    out = []

    for line in data:
        if (line != []) and (line[0][0] == "\""):
            line = [el.replace("\"", "") for el in line]
            out.append(line)

    return out

useful = importCSV(data)

score 0 · Answer 3 · answered Sep 22 '16 at 03:54

0

Can you not read each line and do a regex to see weather or not to pull the data? Maybe something like:

^(["][\w]["][,])+["][\w]["]$

My regex is not the best and there may likely be a better way but that seemed to work for me.

answered Sep 22 '16 at 03:54

Mrd05d

64
7

identify csv in python

3 Answers3