1

We all know the question, when you are running in a memory error: Maximum size of pandas dataframe

I also try to read 4 large csv-files with the following command:

files = glob.glob("C:/.../rawdata/*.csv")
dfs = [pd.read_csv(f, sep="\t", encoding='unicode_escape') for f in files]
df = pd.concat(dfs,ignore_index=True)

The only massage I receive is:

C:..\conda\conda\envs\DataLab\lib\site-packages\IPython\core\interactiveshell.py:3214: DtypeWarning: Columns (22,25,56,60,71,74) have mixed types. Specify dtype option on import or set low_memory=False. if (yield from self.run_code(code, result)):

which should be no problem.

My total dataframe has a size of: (6639037, 84)

Could there be any datasize restriction without an memory error? That means python is automatically skipping some lines without telling me? I had this with another porgramm in the past, I don't think python is so lazy, but you never know.

Further reading: Later i am saving is as sqlite-file, but I also don't think this should be a problem:

conn = sqlite3.connect('C:/.../In.db')
df.to_sql(name='rawdata', con=conn, if_exists = 'replace', index=False)
conn.commit()
conn.close()
PV8
  • 5,799
  • 7
  • 43
  • 87
  • you can try saving the dataframe as a csv and compare the file size with respect to the input csvs. I think you can easily realize if something is missing. (The only difference should be the header only) – Andrea Nov 28 '19 at 14:14

2 Answers2

3

You can pass a generator expression to concat

dfs = (pd.read_csv(f, sep="\t", encoding='unicode_escape') for f in files)

so you avoid the creation of that crazy list in the memory. This might alleviate the problem with the memory limit.

Besides, you can make a special generator that contains a downcast for some columns. Say, like this:

def downcaster(names): 
    for name in names:
        x = pd.read_csv(name, sep="\t", encoding='unicode_escape')
        x['some_column'] = x['some_column'].astype('category')
        x['other_column'] = pd.to_numeric(x['other_column'], downcast='integer')
        yield x

dc = downcaster(names)
df = pd.concat(dc, ...
Oleg O
  • 1,005
  • 6
  • 11
  • I don't have a memory limit, or I don't even know that I have it – PV8 Nov 28 '19 at 14:11
  • I believe, there's always a limit, i.e. sooner or later if you keep your DataFrame growing, you stumble upon a MemoryError. What I showed before are the tricks that I usually do to avoid it (I work with tables of up to 30 GB, so this is my everyday trouble). – Oleg O Nov 28 '19 at 14:17
0

It turned out that there was an error in the file reading, so thanks @Oleg O for the help and tricks to reduce the memory.

For now I do not think that there is a effect that python automatically skips lines. It only happened with wrong coding. My example you can find here: Pandas read csv skips some lines

PV8
  • 5,799
  • 7
  • 43
  • 87