15

I have a #-separated file with three columns: the first is integer, the second looks like a float, but isn't, and the third is a string. I attempt to load this directly into python with pandas.read_csv

In [149]: d = pandas.read_csv('resources/names/fos_names.csv',  sep='#', header=None, names=['int_field', 'floatlike_field', 'str_field'])

In [150]: d
Out[150]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1673 entries, 0 to 1672
Data columns:
int_field          1673  non-null values
floatlike_field    1673  non-null values
str_field          1673  non-null values
dtypes: float64(1), int64(1), object(1)

pandas tries to be smart and automatically convert fields to a useful type. The issue is that I don't actually want it to do so (if I did, I'd used the converters argument). How can I prevent pandas from converting types automatically?

Cœur
  • 37,241
  • 25
  • 195
  • 267
duckworthd
  • 14,679
  • 16
  • 53
  • 68
  • It may not be avoidable. Pandas (and NumPy more generally) doesn't support NaN for integers. Since reading from any CSV leaves the possibility for NaN, it might have been a convenience choice to just have it coerce to float always. Also, your types seem backwards from your printout. It surely *won't* convert floatlike input to an int64, though it presumably *will* convert intlike input to float64. – ely Aug 23 '12 at 23:37
  • I don't believe the order of dtypes listed have anything to do with the order of the columns, but with regard to automatic type coercion, this would be most unfortunate. I was under the impression pandas used NaN anytime information was missing, regardless of type. – duckworthd Aug 23 '12 at 23:46
  • 1
    Yes, it does, but then that column will always be of type Object. You can have an Object column that's mostly Ints. However, there are many cases where the NaN coerces it into a float column instead. If you first make an object column and then fill it with possibly-NaN integers, it will stay as object. I think if you just fill it with whatever's there, leaving Pandas to discern the type, it will choose float for any numeric type than has NaNs present. This is not just a limitation of Pandas, but of NumPy and Python entirely. There is no library I'm aware of offering an Int with NaN support. – ely Aug 23 '12 at 23:52
  • @EMS don't forget that numpy comes also with `MaskedArrays`, that can flag a data as missing/invalid/whatever... `np.genfromtxt` naturally supports `MaskedArrays`, which can be useful. – Pierre GM Aug 24 '12 at 19:07
  • 1
    @EMS As a second comment, there are talks about adding some NaN-like value to `int` `ndarrays`. – Pierre GM Aug 24 '12 at 19:08

3 Answers3

12

I'm planning to add explicit column dtypes in the upcoming file parser engine overhaul in pandas 0.10. Can't commit myself 100% to it but it should be pretty simple with the new infrastructure coming together (http://wesmckinney.com/blog/?p=543).

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
8

I think your best bet is to read the data in as a record array first using numpy.

# what you described:
In [15]: import numpy as np
In [16]: import pandas
In [17]: x = pandas.read_csv('weird.csv')

In [19]: x.dtypes
Out[19]: 
int_field            int64
floatlike_field    float64  # what you don't want?
str_field           object

In [20]: datatypes = [('int_field','i4'),('floatlike','S10'),('strfield','S10')]

In [21]: y_np = np.loadtxt('weird.csv', dtype=datatypes, delimiter=',', skiprows=1)

In [22]: y_np
Out[22]: 
array([(1, '2.31', 'one'), (2, '3.12', 'two'), (3, '1.32', 'three ')], 
      dtype=[('int_field', '<i4'), ('floatlike', '|S10'), ('strfield', '|S10')])

In [23]: y_pandas = pandas.DataFrame.from_records(y_np)

In [25]: y_pandas.dtypes
Out[25]: 
int_field     int64
floatlike    object  # better?
strfield     object
Paul H
  • 65,268
  • 20
  • 159
  • 136
  • This is a neat idea. Worth adding as a recipe even now in 0.13.1, because `read_csv(dtype, converters, ...)` still has many issues. – smci Apr 18 '14 at 03:37
  • if you still want to read directly from pandas you can likewise do: datatypes = {strfield': str, 'int_field' : int} pd.read_csv(path, dtypes=datatypes,...) – lfvv Oct 11 '20 at 20:51
0

As per doc https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html now you can

Use str or object together with suitable na_values settings to preserve and not interpret dtype.

So for instance this should keep all the columns to the default object type and let you convert them afterwards without auto-inference:

pandas.read_csv(csv_path, dtype=object)
Barnercart
  • 1,523
  • 1
  • 11
  • 23