pandas using Int64 (capitalized) by default, rather than the default int64 (lowercase)

Question

I have Pandas v0.24+, and I'm looking through: Keeping array type as integer while having a NaN value

I'm getting the usual value errors by trying to read in Integer columns with nan values.

Pandas: ValueError: Integer column has NA values in column 33

This is because integer types cannot handle NA values. The problem is I don't actually know the datatypes of my csv - I'd still like pandas to 'infer' what they are. Is there a way it can do this while using Int64 by default instead of int64, so that it doesn't halt and complain about NA values in the process?

EDIT: This is what happens

df = pd.read_csv(file)

Then

    Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/christopherturnbull/DATA_SCIENCE/PointTopic/access_test_v3.py", line 18, in <module>
    df = mdb.read_table(rdb_file,'v31a_v8_oct20_point_topic_availability_deliverable_201118')
  File "/Users/christopherturnbull/DATA_SCIENCE/virtualenvs/pointtopic/lib/python3.8/site-packages/pandas_access/__init__.py", line 127, in read_table
    return pd.read_csv(proc.stdout, *args, **kwargs)
  File "/Users/christopherturnbull/DATA_SCIENCE/virtualenvs/pointtopic/lib/python3.8/site-packages/pandas/io/parsers.py", line 688, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/Users/christopherturnbull/DATA_SCIENCE/virtualenvs/pointtopic/lib/python3.8/site-packages/pandas/io/parsers.py", line 460, in _read
    data = parser.read(nrows)
  File "/Users/christopherturnbull/DATA_SCIENCE/virtualenvs/pointtopic/lib/python3.8/site-packages/pandas/io/parsers.py", line 1198, in read
    ret = self._engine.read(nrows)
  File "/Users/christopherturnbull/DATA_SCIENCE/virtualenvs/pointtopic/lib/python3.8/site-packages/pandas/io/parsers.py", line 2157, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 941, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1073, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1104, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1198, in pandas._libs.parsers.TextReader._convert_with_dtype
ValueError: Integer column has NA values in column 33

But df = pd.read_csv(file, header = None) seems to work, although now I don't have the dtypes

Can you add the input, and the code you are using? So my question is why when you are reading the csv is this error thrown? I believe pandas by default will interpret columns with nan as float... So my guess is that you are converting the columns afterwards, so do you which columns to convert? — Dani Mesejo, Dec 15 '20 at 09:04
*"so that it doesn't halt and complain about NA values in the process"* is unclear, you need to post a reproducible snippet of data. Is this happening with `pd.read_csv()`, or which command? Post your code and the error you're getting. — smci, Dec 15 '20 at 09:18

Dani Mesejo · Answer 1 · 2020-12-19T10:43:54.893

As far as I know, you need to specify the dtype when reading csv, also in the documentation of nullable integers for pandas 0.24 (removed in the stable version), you can find the following:

Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series

As an alternative you could use convert_dtypes:

import pandas as pd
import io

s = """val,col\n
       hello,1\n
       world,nan"""

df = pd.read_csv(io.StringIO(s))
res = df.convert_dtypes()
print(res.dtypes)

Output

val    string
col     Int64
dtype: object

The documentation of convert_dtypes, states:

convert_integer: bool, default True Whether, if possible, conversion can be done to integer extension types.

Note that in the example above the original dtype was float:

print(df.dtypes)

Output (for df resulting of using read_csv)

val     object
col    float64
dtype: object

UPDATE

It seems that something it throwing off the inference engine, but as the problem is located in column 33, you could specify the dtype for it, try:

df = pd.read_csv(file, dtype={33: pd.Int64Dtype()})

The reason that using

df = pd.read_csv(file, header=None)

works, is that makes the header part of the column values, so as they are strings the columns are interpreted as dtype object, as in:

import pandas as pd
import io

s = """val,col,bad\n
       hello,1,1.5\n
       world,,2.3"""

df = pd.read_csv(io.StringIO(s), header=None)
print(df)

Output

              0    1    2
0           val  col  bad
1         hello    1  1.5
2         world  NaN  2.3

As it can be seen the headers are the values for the first row.

pandas using Int64 (capitalized) by default, rather than the default int64 (lowercase)

1 Answers1