Background: streaming log files from Amazon's S3. From zipped files, after a few steps, I get a file-like object. For gzipped files, I decompress a chunk in the stream, which is a string, and then use str.splitlines()
to get a list of rows.
csv.reader
accepts anything with an iterator protocol, like files and lists. However, for files, I'll need to file.close()
after everything is done. The files I have then, when unzipped and decompressed, become csv and tsv files. Comma or tab-separated.
delims = [',','\t']
For the zipfile, since a middle step is making a ZipExtFile that DOES NOT have a seek()
function, I can't use csv.Sniffer
. For the gzip files, they were streamed and become a list of rows.
How do I dynamically figure out which delimiter to use when calling csv.reader
? I'm currently using the code below (based off this). Ideally, I send a teststr
to this, and then call csv.reader(csvfile, delimiter = k)
.
HOWEVER, how do I get a sample of the file/list to test, and then return back to the start of the file, as neither types of inputs have a seek()
function?
teststr = 'how,-do,-you,-dynamically,-identify,-unknown,-delimiters,-in,-a,-data'
def find_delimiter(teststr):
# how-do-you-dynamically-identify-unknown-delimiters-in-a-data-file
possible = [',','\t','-']
count = {}
for c in teststr:
if c in possible: count[c] = count.get(c,0) + 1
delim = [key for key,val in count.iteritems() if val == max(count.values())]
if len(delim) == 1:
delim = delim[0]
else:
print delim
delim = None
return delim
k = find_delimiter(teststr)
print k