19

What exactly happens when Pandas issues this warning? Should I worry about it?

In [1]: read_csv(path_to_my_file)
/Users/josh/anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/parsers.py:1139: 
DtypeWarning: Columns (4,13,29,51,56,57,58,63,87,96) have mixed types. Specify dtype option on import or set low_memory=False.              

  data = self._reader.read(nrows)

I assume that this means that Pandas is unable to infer the type from values on those columns. But if that is the case, what type does Pandas end up using for those columns?

Also, can the type always be recovered after the fact? (after getting the warning), or are there cases where I may not be able to recover the original info correctly, and I should pre-specify the type?

Finally, how exactly does low_memory=False fix the problem?

Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564

2 Answers2

19

Revisiting mbatchkarov's link, low_memory is not deprecated. It is now documented:

low_memory : boolean, default True

Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser)

I have asked what resulting in mixed type inference means, and chris-b1 answered:

It is deterministic - types are consistently inferred based on what's in the data. That said, the internal chunksize is not a fixed number of rows, but instead bytes, so whether you can a mixed dtype warning or not can feel a bit random.

So, what type does Pandas end up using for those columns?

This is answered by the following self-contained example:

df=pd.read_csv(StringIO('\n'.join([str(x) for x in range(1000000)] + ['a string'])))
DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.

type(df.loc[524287,'0'])
Out[50]: int

type(df.loc[524288,'0'])
Out[51]: str

The first part of the csv data was seen as only int, so converted to int, the second part also had a string, so all entries were kept as string.

Can the type always be recovered after the fact? (after getting the warning)?

I guess re-exporting to csv and re-reading with low_memory=False should do the job.

How exactly does low_memory=False fix the problem?

It reads all of the file before deciding the type, therefore needing more memory.

Robert Pollak
  • 3,751
  • 4
  • 30
  • 54
  • I first got the DType warning, which goes away after I use low_memory=False. However, there's this error `Error: C stack usage 528430048 is too close to the limit Error: C stack usage 528429312 is too close to the limit`. This had been there even before the low_memory=False flag. Any fix for this? I changed the R version as mentioned [here](https://stackoverflow.com/questions/14719349/error-c-stack-usage-is-too-close-to-the-limit), it worked just once, but I get the error consistently now. – murphy1310 Jul 19 '18 at 14:40
7

low_memory is apparently kind of deprecated, so I wouldn't bother with it.

The warning means that some of the values in a column have one dtype (e.g. str), and some have a different dtype (e.g. float). I believe pandas uses the lowest common super type, which in the example I used would be object.

You should check your data, or post some of it here. In particular, look for missing values or inconsistently formatted int/float values. If you are certain your data is correct, then use the dtypes parameter to help pandas out.

mbatchkarov
  • 15,487
  • 9
  • 60
  • 79