I'm iterating over multiple .txt files in a folder using glob.iglob and read_csv. The data are arranged in monthly files. The goal is to extract data and combine into aggregated monthly files. The format of the files are related but somewhat inconsistent (hence the usage of fUniv below).
for filename in glob.iglob('*file.txt'):
fUniv = open(filename, 'U')
df = pd.read_csv(fUniv,engine='c',low_memory=False)
fUniv.close()
mainBlock(df)
The iteration works well. Within the files, there are two different sets of column headers. I need to treat these two file types slightly different using an if-elif-else that differentiates the files based on the presence of particular column headers.
def mainBlock(df):
if 'x' in df.columns:
#do stuff
elif 'y' in df.columns:
#do different stuff
else:
#something is wrong
sys.exit('Script terminated.')
#append df to monthly file
with open(file, 'a') as s:
frame.to_csv(s, header=True, index_col=1, encoding='utf-8')
This also works well for file sizes under a certain threshold. Once the file sizes are greater than 400 MB, I run into an error.
Error tokenizing data: C error out of memory
I've attempted to use the chunksize iteration described in a couple threads (1) (2). I've come up with this...
for filename in glob.iglob('*file.txt'):
filesize = os.path.getsize(filename)
chunklimit = 100000000
fUniv = open(filename, 'U')
if filesize > chunklimit:
df = pd.read_csv(fUniv,engine='c',low_memory=False,iterator=True,
chunksize=chunklimit)
for chunk in df:
mainBlock(chunk)
fUniv.close()
While it runs through the smaller files fine, it gives the following error when reaching a file above the chunklimit threshold.
'TextFileReader' object has no attribute 'columns'
I've figured out that it is likely reading the initial chunk correctly but cannot satisfy the next iteration of
for chunk in df:
mainBlock(df)
because there are no column headers to evaluate in the chunks sent to mainBlock's if-elif-else.
Is this interpretation of the error correct? How do I get around this? Also, once the chunks are passed, do I have to concatenate them before they are appended to file? Any help is greatly appreciated.