2

Is there an alternative to using the csv module to read a csv file in python3 in a streaming way? Currently my data looks something like this:

"field1"::"field2"::"field3"\x02\n
"1"::"hi\n"::"3"\x02\n
"8"::"ok"::"3"\x02\n

The separator is two characters, :: (the csv module only accepts a single character separator) and the line separator also contains two characters, \x02\n. Are there any csvreaders that can be used for python in a streaming mode that would be able to support this?

Here is an example of what I'm trying to do:

>>> import csv
>>> s = ''''"field1"::"field2"::"field3"\x02\n\n"1"::"hi\n"::"3"\x02\n\n"8"::"ok"::"3"\x02\n'''
>>> csvreader=csv.reader(s, delimiter='::', lineterminator='\x02\n')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
TypeError: "delimiter" must be a 1-character string

Loading pandas just to read this csv seems like overkill x 100, so I'd like to see what other options there are.

  • If you're able to control how this csv is formatted I would switch to a single char and a different line separator but using just open and re should suffice here I believe. – Jab Feb 14 '19 at 04:13
  • Are you saying you would like to have the data separated by the two delimiters within the same process? As well, are you using `csv.reader`? Could you maybe post the section of code you are currently attempting to use to clean this data? – Hayden Feb 14 '19 at 04:13
  • 1
    Here's a related Q/A, but requires pandas--seems like a giant dependency for such a small feature: https://stackoverflow.com/questions/31194669/use-multiple-character-delimiter-in-python-pandas-read-csv – Brian Peterson Feb 14 '19 at 04:15
  • @BrianPeterson agreed -- are there any other options? –  Feb 14 '19 at 04:46
  • @Jaba `re` gets really tricky -- with escape characters, quote characters, etc. I'd rather not try and not do that. –  Feb 14 '19 at 04:48
  • Your csv format as written in your code is not formatted right at all. Do you mean: `'field1::field2::field3\x02\n1::2::3\x02\n8::2::3\x02'` – Jab Feb 14 '19 at 05:00
  • @Jaba -- yes, that's correct. I'll update it. –  Feb 14 '19 at 05:50
  • @Jaba -- updated. –  Feb 14 '19 at 19:09

2 Answers2

1

As you have discovered, the CSV library is not suitable for that data format. You could though pre-parse the data beforehand. For example the following approach should work:

from io import StringIO
import csv

s = '''"field1"::"field2"::"field3"\x02\n\n"1"::"hi\n"::"3"\x02\n\n"8"::"ok"::"3"\x02\n'''

def csv_reader_alt(source):
    return csv.reader((line.replace('\x02', '').replace('::', ':') for line in source), delimiter=':')    

for row in csv_reader_alt(StringIO(s)):
    if row:
        print(row)

Giving you the following output:

['field1', 'field2', 'field3']
['1', 'hi\n', '3']
['8', 'ok', '3']
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • thanks for this. Please see updated question, where reading rows line by line isn't as straightforward. –  Feb 14 '19 at 19:09
  • @DavidL it's a bit difficult to tell the exact format from your small example but I have now shown how you could possibly pre-parse your data before passing it to a normal `csv.reader()`. Maybe a link to the actual CSV file would help for testing. – Martin Evans Feb 15 '19 at 08:15
0

@MartinEvans shows a nice way of doing it in his answer.

Here is the code for reading from a file (not from a string in memory) with proper file handling, using a custom delimiter (implemented using a custom generator):

def get_line(file, delimiter='\n', bufsize=4096):
    # https://stackoverflow.com/a/19600562/9225671
    buf = ''
    while True:
        chunk = file.read(bufsize)
        if len(chunk) == 0:
            # end of file has been reached; serve the remaining data and exit
            yield buf
            return

        buf += chunk
        line_list = buf.split(delimiter)

        # don't serve the last part yet, first we need to read more chunks from the file
        buf = line_list.pop(-1)

        for line in line_list:
            yield line

if __name__ == '__main__':
    with open('my_file.csv') as f:
        for line in get_line(f, delimiter='\x02\n'):
            if len(line) > 0:
                parts = line.split('::')
                print(parts)
                print([
                    e.strip('"')
                    for e in parts])

Does that work for you?

Ralf
  • 16,086
  • 4
  • 44
  • 68