I have a question. Is there a possible way to see if a column header exists in a file, or skip rows until? Say I have a group of files. One with a header on the first row, another with the header on the second row following some useless text on the first row, and another that has no header. I want to skip all rows before the column header or detect if one even exists without specifying "skiprows" in the code. There are a number of hard coded ways to do this. I have used regexes and replaces etc., but I am looking for a more universal idea that covers all bases. I have even made a raw input prompt that allows you to enter the amount of rows you want to skip. That method worked, but I want something that will not have to rely on user input and just detect column headers on its own. I am just looking for a few ideas if any. I am working mainly csv type files and would like to do this with Python.
1 Answers
csv.Sniffer has a has_header() function that should return True if the first row appears to be a header. A procedure for using it would be to first remove all empty rows from the top until the first non-empty row and then run csv.Sniffer.has_header(). My experience is that the header must be in the first line for has_header() to return True and it will return False if the number of header fields do not match the number of data fields for at least one row in its scan range which must be set by the user. 1024 or 2048 are typical scan ranges. I tried to set it much higher even so the entire file would be read, but it still failed to recognize the header if it was not in the first line. All my testing was done using Python 2.7.10.
Here is an example of using csv.Sniffer in a script that first determines if a file has a recognizable header and if not renames it, creates a new, empty file with the original name, then opens the renamed file for reading and the new file for writing and writes the renamed file contents to the new file excluding leading blank lines. Finally it retests the new file for a header to determine if removing the blank lines made a difference.
import csv
from datetime import datetime
import os
import re
import shutil
import sys
import time
common_delimeters = set(['\' \'', '\'\t\'', '\',\''])
def sniff(filepath):
with open(filepath, 'rb') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(2048))
delimiter = repr(dialect.delimiter)
if delimiter not in common_delimeters:
print filepath,'has uncommon delimiter',delimiter
else:
print filepath,'has common delimiter',delimiter
csvfile.seek(0)
if csv.Sniffer().has_header(csvfile.read(2048)):
print filepath, 'has a header'
return True
else:
print filepath, 'does not have a header'
return False
def remove_leading_blanks(filepath):
# test filepath for header and delimiter
print 'testing',filepath,'with sniffer'
has_header = sniff(filepath)
if has_header:
print 'no need to remove leading blank lines if any in',filepath
return True
# make copy of filepath appending current date-time to its name
if os.path.isfile(filepath):
now = datetime.now().strftime('%Y%d%m%H%M%S')
m = re.search(r'(\.[A-Za-z0-9_]+)\Z',filepath)
bakpath = ''
if m != None:
bakpath = filepath.replace(m.group(1),'') + '.' + now + m.group(1)
else:
bakpath = filepath + '.' + now
try:
print 'renaming', filepath,'to', bakpath
os.rename(filepath, bakpath)
except:
print 'renaming operation failed:', sys.exc_info()[0]
return False
print 'creating a new',filepath,'from',bakpath,'minus leading blank lines'
# now open renamed file and copy it to original filename
# except for leading blank lines
time.sleep(2)
try:
with open(bakpath) as o, open (filepath, 'w') as n:
p = False
for line in o:
if p == False:
if line.rstrip():
n.write(line)
p = True
else:
continue
else:
n.write(line)
except IOError as e:
print 'file copy operation failed: %s' % e.strerror
return False
print 'testing new',filepath,'with sniffer'
has_header = sniff(filepath)
if has_header:
print 'the header problem with',filepath,'has been fixed'
return True
else:
print 'the header problem with',filepath,'has not been fixed'
return False
Given this csv file where the header is actually on line 11:
header,better,leader,fodder,blather,super
1,2,3,,,
4,5,6,7,8,9
3,4,5,6,7,
2,,,,,
remove_leading_blanks() determined that it did not have headers, then removed the leading blank lines and determined that it did have headers. Here is the trace of its console output:
testing test1.csv with sniffer...
test1.csv has uncommon delimiter '\r'
test1.csv does not have a header
renaming test1.csv to test1.20153108142923.csv
creating a new test1.csv from test1.20153108142923.csv minus leading blank lines
testing new test1.csv with sniffer
test1.csv has common delimiter ','
test1.csv has a header
the header problem with test1.csv has been fixed
done ok
While this may work a lot of the time, generally it does not appear reliable due to too much variation in headers and their placement. However, maybe its better than nothing.
See csv.Sniffer, csv.py and _csv.c for more info. PyMOTW's csv – Comma-separated value files has a good tutorial review of the csv module with details on Dialects.