What is an appropriate choice to store very large files with python? .csv files truncated data in certain cells

Question

I'm writing a python script for the data acquisition phase of my project and so far I've been storing data in .csv files. While I was reading data from a particular .csv file, I got the error:

syntaxError: EOL while scanning string literal

I took a look at the specific row in the file and and the data in the specific cell were truncated. I am using pandas to store dicts to csv and it never threw an error. I guess .csv will save itself no matter what, even if that means it will delete data without any warning.

I thought of changing to .xls. When the same row was being stored, an error came up saying (something along the lines of):

Max character length reached. Max character length per cell was ~32k.

Then I thought that it may just be an excel/libreoffice calc issue (I tried both) and they can't visualize the data in cell but they are actually there. So I tried printing the specific cell; data were indeed truncated. The specific cell contains a dict, whose values are float, int, boolean or string. However, all of them have been converted to strings.

My question is, is there a way to fix it without changing the file format?

In the case that I have to change the file format, what would be an appropriate choice to store very large files? I am thinking about hdf5.

In case you need more info, do let me know. Thank you!

If you're storing dicts, I'd definitely go with a file format that actually supports dicts and doesn't make you convert them to strings, such as JSON or msgpack. HDF5 is a reasonably good choice for large tables or arrays of numbers and strings, but it sounds like that's not actually what your data looks like... — tjollans, Mar 03 '21 at 10:59
Another option is to store everything in a sql database. Most databases can easily handle million of entries. Some examples using pandas and sql databases are [here](https://datacarpentry.org/python-ecology-lesson/09-working-with-sql/index.html), [here](https://towardsdatascience.com/python-pandas-and-sqlite-a0e2c052456f) and [here](https://pythonspeed.com/articles/indexing-pandas-sqlite/). — Thymen, Mar 03 '21 at 11:00

score 1 · Accepted Answer · answered Mar 03 '21 at 10:51

1

There is a limit to fields size:

csv.field_size_limit([new_limit])
Returns the current maximum field size allowed by the parser. 
If new_limit is given, this becomes the new limit.

On my system (Python 3.8.0), I get:

>>> import csv
>>> csv.field_size_limit()
131072

which is exactly 128 kB.

You could try to set the limit higher:

 csv.field_size_limit(your_new_limit)

But maybe a different file format would be more adapted depending on what kind of data you store.

answered Mar 03 '21 at 10:51

Thierry Lathuille

23,663
10
44
50

1

The field size limit can be increased but the receiving end might not be capable of ingesting CSV files with very long fields. If your final target is Excel, it is much more restricted than Python in what it can accept (but then probably the world will be a better place for you if you can get rid of Excel). – tripleee Mar 03 '21 at 11:18
Thank you for pointing to the right direction! I used the solution from this link: https://stackoverflow.com/questions/15063936/csv-error-field-larger-than-field-limit-131072 to set the new limit. Even though libre office calc still throws the usual error message: "The data could not be loaded completely because the maximum number of characters per cell was exceeded", data are indeed there. The length of the list of dicts was checked against the true length and they're the same. – The Hulk Mar 10 '21 at 11:47

What is an appropriate choice to store very large files with python? .csv files truncated data in certain cells

1 Answers1