1

I'm iterating over multiple .txt files in a folder using glob.iglob and read_csv. The data are arranged in monthly files. The goal is to extract data and combine into aggregated monthly files. The format of the files are related but somewhat inconsistent (hence the usage of fUniv below).

for filename in glob.iglob('*file.txt'):
    fUniv = open(filename, 'U')
    df = pd.read_csv(fUniv,engine='c',low_memory=False)
    fUniv.close()
    mainBlock(df)

The iteration works well. Within the files, there are two different sets of column headers. I need to treat these two file types slightly different using an if-elif-else that differentiates the files based on the presence of particular column headers.

def mainBlock(df):
    if 'x' in df.columns:
        #do stuff
    elif 'y' in df.columns:
        #do different stuff
    else:
        #something is wrong
        sys.exit('Script terminated.')

    #append df to monthly file
    with open(file, 'a') as s:
        frame.to_csv(s, header=True, index_col=1, encoding='utf-8')

This also works well for file sizes under a certain threshold. Once the file sizes are greater than 400 MB, I run into an error.

Error tokenizing data: C error out of memory

I've attempted to use the chunksize iteration described in a couple threads (1) (2). I've come up with this...

for filename in glob.iglob('*file.txt'):
    filesize = os.path.getsize(filename)
    chunklimit = 100000000
    fUniv = open(filename, 'U')
    if filesize > chunklimit:
        df = pd.read_csv(fUniv,engine='c',low_memory=False,iterator=True,
        chunksize=chunklimit)
        for chunk in df:
            mainBlock(chunk)
    fUniv.close()

While it runs through the smaller files fine, it gives the following error when reaching a file above the chunklimit threshold.

'TextFileReader' object has no attribute 'columns'

I've figured out that it is likely reading the initial chunk correctly but cannot satisfy the next iteration of

for chunk in df:
    mainBlock(df)

because there are no column headers to evaluate in the chunks sent to mainBlock's if-elif-else.

Is this interpretation of the error correct? How do I get around this? Also, once the chunks are passed, do I have to concatenate them before they are appended to file? Any help is greatly appreciated.

Community
  • 1
  • 1

0 Answers0