348

I have a script reading in a csv file with very huge fields:

# example from http://docs.python.org/3.3/library/csv.html?highlight=csv%20dictreader#examples
import csv
with open('some.csv', newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

However, this throws the following error on some csv files:

_csv.Error: field larger than field limit (131072)

How can I analyze csv files with huge fields? Skipping the lines with huge fields is not an option as the data needs to be analyzed in subsequent steps.

user1251007
  • 15,891
  • 14
  • 50
  • 76
  • 20
    Even better would be to consider _why_ there are such big fields Is that expected in your data? Sometimes errors like these are indicative of a different problem. I had some Bad Data in mine that included a random double quote character and thus had to use the QUOTE_NONE option shown in another answer here. – dustmachine Apr 21 '16 at 16:35
  • 3
    I updated my question to indicate that in my case huge fields might occur. There is no bad data in the csv file. – user1251007 Apr 21 '16 at 18:53
  • 2
    @dustmachine Such things happen because sometimes you find people storing images (or other binary files) in base64 format in database tables. – wintermute Sep 23 '16 at 19:17

8 Answers8

493

The csv file might contain very huge fields, therefore increase the field_size_limit:

import sys
import csv

csv.field_size_limit(sys.maxsize)

sys.maxsize works for Python 2.x and 3.x. sys.maxint would only work with Python 2.x (SO: what-is-sys-maxint-in-python-3)

Update

As Geoff pointed out, the code above might result in the following error: OverflowError: Python int too large to convert to C long. To circumvent this, you could use the following quick and dirty code (which should work on every system with Python 2 and Python 3):

import sys
import csv
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)
Community
  • 1
  • 1
user1251007
  • 15,891
  • 14
  • 50
  • 76
  • 19
    On Windows 7 64bit with Python 2.6, `maxInt = sys.maxsize` returns `9223372036854775807L` which consequently results in a `TypeError: limit must be an integer` when calling `csv.field_size_limit(maxInt)`. Interestingly, using `maxInt = int(sys.maxsize)` does not change this. A crude workaround is to simlpy use `csv.field_size_limit(2147483647)` which of course cause issues on other platforms. In my case this was adquat to identify the broken value in the CSV, fix the export options in the other application and remove the need for `csv.field_size_limit()`. – roskakori Oct 30 '14 at 15:02
186

This could be because your CSV file has embedded single or double quotes. If your CSV file is tab-delimited try opening it as:

c = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
CSP
  • 1,989
  • 1
  • 11
  • 4
  • 1
    Thank you!! If you are using csvkit (an excellent python library and command-line csv toolkit) and get the original error because your file uses unbalanced single or double quotes, you can select QUOTE_NONE via the `-u 3` command line option, aka `--quoting 3` – nealmcb Jan 25 '15 at 14:26
  • I had the error `field larger than field limit` because of a single double-quote in a bad formatted CSV file. – Augusto Destrero May 17 '22 at 11:15
53

.csv field sizes are controlled via [Python.Docs]: csv.field_size_limit([new_limit]) (emphasis is mine):

Returns the current maximum field size allowed by the parser. If new_limit is given, this becomes the new limit.

It is set by default to 131072 or 0x20000 (128k), which should be enough for any decent .csv:

>>> import csv
>>>
>>>
>>> limit0 = csv.field_size_limit()
>>> limit0
131072
>>> "0x{0:016X}".format(limit0)
'0x0000000000020000'

However, when dealing with a .csv file (with the correct quoting and delimiter) having (at least) one field longer than this size, the error pops up.
To get rid of the error, the size limit should be increased (to avoid any worries, the maximum possible value is attempted).

Behind the scenes (check [GitHub]: python/cpython - (master) cpython/Modules/_csv.c for implementation details), the variable that holds this value is a C long ([Wikipedia]: C data types), whose size varies depending on CPU architecture and OS (ILP). The classical difference: for a 064bit OS (and Python build), the long type size (in bits) is:

  • Nix: 64
  • Win: 32

When attempting to set it, the new value is checked to be in the long boundaries, that's why in some cases another exception pops up (because sys.maxsize is typically 064bit wide - encountered on Win):

>>> import sys, ctypes as ct
>>>
>>>
>>> "v{:d}.{:d}.{:d}".format(*sys.version_info[:3]), sys.platform, sys.maxsize, ct.sizeof(ct.c_void_p) * 8, ct.sizeof(ct.c_long) * 8
('v3.9.9', 'win32', 9223372036854775807, 64, 32)
>>>
>>> csv.field_size_limit(sys.maxsize)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long

To avoid running into this problem, set the (maximum possible) limit (LONG_MAX), using an artifice (thanks to [Python.Docs]: ctypes - A foreign function library for Python). It should work on Python 3 and Python 2, on any CPU / OS.

>>> csv.field_size_limit(int(ct.c_ulong(-1).value // 2))
131072
>>> limit1 = csv.field_size_limit()
>>> limit1
2147483647
>>> "0x{0:016X}".format(limit1)
'0x000000007FFFFFFF'

064bit Python on a Nix like OS:

>>> import sys, csv, ctypes as ct
>>>
>>>
>>> "v{:d}.{:d}.{:d}".format(*sys.version_info[:3]), sys.platform, sys.maxsize, ct.sizeof(ct.c_void_p) * 8, ct.sizeof(ct.c_long) * 8
('v3.8.10', 'linux', 9223372036854775807, 64, 64)
>>>
>>> csv.field_size_limit()
131072
>>>
>>> csv.field_size_limit(int(ct.c_ulong(-1).value // 2))
131072
>>> limit1 = csv.field_size_limit()
>>> limit1
9223372036854775807
>>> "0x{0:016X}".format(limit1)
'0x7FFFFFFFFFFFFFFF'

For 032bit Python, things should run smoothly without the artifice (as both sys.maxsize and LONG_MAX are 032bit wide).
If this maximum value is still not enough, then the .csv would need manual intervention in order to be processed from Python.

Check the following resources for more details on:

CristiFati
  • 38,250
  • 9
  • 50
  • 87
39

Below is to check the current limit

csv.field_size_limit()

Out[20]: 131072

Below is to increase the limit. Add it to the code

csv.field_size_limit(100000000)

Try checking the limit again

csv.field_size_limit()

Out[22]: 100000000

Now you won't get the error "_csv.Error: field larger than field limit (131072)"

Tad
  • 811
  • 8
  • 16
7

I just had this happen to me on a 'plain' CSV file. Some people might call it an invalid formatted file. No escape characters, no double quotes and delimiter was a semicolon.

A sample line from this file would look like this:

First cell; Second " Cell with one double quote and leading space;'Partially quoted' cell;Last cell

the single quote in the second cell would throw the parser off its rails. What worked was:

csv.reader(inputfile, delimiter=';', doublequote='False', quotechar='', quoting=csv.QUOTE_NONE)
Steffen Winkler
  • 2,805
  • 2
  • 35
  • 58
  • `doublequote` is ignored when `quoting=csv.QUOTE_NONE` – 2Toad Oct 22 '20 at 23:55
  • @2Toad that is interesting to know. I thought I needed to specify both, but maybe I never tried `quoting=csv.QUOTE_NONE` before adding `quotechar=''`and `doublequote='False'`. Gonna dig a bit on this during the weekend. – Steffen Winkler Oct 27 '20 at 11:06
2

Sometimes, a row contain double quote column. When csv reader try read this row, not understood end of column and fire this raise. Solution is below:

reader = csv.reader(cf, quoting=csv.QUOTE_MINIMAL)
Ahmet Erkan ÇELİK
  • 2,364
  • 1
  • 26
  • 28
-2

You can use the error_bad_lines option of pd.read_csv to skip these lines.

import pandas as pd

data_df = pd.read_csv('data.csv', error_bad_lines=False)

This works since the "bad lines" as defined in pandas include lines that one of their fields exceed the csv limit.

Be careful that this solution is valid only when the fields in your csv file shouldn't be this long. If you expect to have big field sizes, this will throw away your data.

liakoyras
  • 1,101
  • 12
  • 27
0x01h
  • 843
  • 7
  • 13
  • 5
    There is no bad line ... as written in the question: The csv files contains huge fields and this data need to be analyzed. – user1251007 Oct 10 '19 at 15:12
  • 2
    *Bad lines* concept in `pandas` includes the rows which exceed the field limit of `csv`. So, if you want to skip these lines and read other lines successfully, you may use this solution. Otherwise, when huge fields are required for you, increasing field limit by `csv.field_size_limit(100000000)` is appropriate. – 0x01h Oct 11 '19 at 08:26
  • 1
    You should explain why you use `error_bad_lines` – dinhanhx May 27 '21 at 09:31
-3

Find the cqlshrc file usually placed in .cassandra directory.

In that file append,

[csv]
field_size_limit = 1000000000
  • entities should not be multiplied without necessity – djh Nov 12 '20 at 12:44
  • 2
    There is no Cassandra here. This would be a suitable answer on https://stackoverflow.com/questions/24168235/cassandra-cqlsh-text-field-limit-on-copy-from-csv-field-larger-than-field-limit – tripleee Feb 19 '21 at 06:03