0

Background: streaming log files from Amazon's S3. From zipped files, after a few steps, I get a file-like object. For gzipped files, I decompress a chunk in the stream, which is a string, and then use str.splitlines() to get a list of rows.

csv.reader accepts anything with an iterator protocol, like files and lists. However, for files, I'll need to file.close() after everything is done. The files I have then, when unzipped and decompressed, become csv and tsv files. Comma or tab-separated.

delims = [',','\t']

For the zipfile, since a middle step is making a ZipExtFile that DOES NOT have a seek() function, I can't use csv.Sniffer. For the gzip files, they were streamed and become a list of rows.

How do I dynamically figure out which delimiter to use when calling csv.reader? I'm currently using the code below (based off this). Ideally, I send a teststr to this, and then call csv.reader(csvfile, delimiter = k).

HOWEVER, how do I get a sample of the file/list to test, and then return back to the start of the file, as neither types of inputs have a seek() function?

teststr = 'how,-do,-you,-dynamically,-identify,-unknown,-delimiters,-in,-a,-data'

def find_delimiter(teststr):
    # how-do-you-dynamically-identify-unknown-delimiters-in-a-data-file
    possible = [',','\t','-']
    count = {}

    for c in teststr:
        if c in possible: count[c] = count.get(c,0) + 1

    delim = [key for key,val in count.iteritems() if val == max(count.values())]

    if len(delim) == 1: 
        delim = delim[0]
    else:
        print delim
        delim = None
    return delim

k = find_delimiter(teststr)
print k
Community
  • 1
  • 1
ehacinom
  • 8,070
  • 7
  • 43
  • 65
  • Can't you load the file in memory an work from that string buffer? – Sylvain Leroux Aug 27 '14 at 21:55
  • @SylvainLeroux (sorry for ignorance in advance) I don't know how to, nor if it's possible. I think I'm not loading files but fetching them (difference?) from AWS S3, and I think that streaming is like putting it in memory and then reading out a bit? Moreover, I'm dealing with > 100s MB files, and at last count almost 24000 files. don't know if this is relevant, but also only really want a couple lines from each file. – ehacinom Aug 27 '14 at 21:59

1 Answers1

0

Summary on personal soln.

Decided that the little method works, so I changed my approach: I open or stream the file and, temporarily ignoring csv.reader() (and hoping that most data is well-behaved in newline behavior, which it ought to be) used the .readline() method of strings to grab a couple lines.

This is then sent to the find_delimiter method above, and the lines and the returned delimiter are then run through csv.reader().

ehacinom
  • 8,070
  • 7
  • 43
  • 65