1

My code:

raw_data = pd.read_csv("C:/my.csv")

After I ran it to file is loaded but I am getting:

C:\Users\user\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3051: DtypeWarning: Columns (0,79,237,239,241,243,245,247,248,249,250,251,252,253,254,255,256,258,260,262,264) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)

Questions:

  1. What exactly it means?
  2. How to fix it?

Sorry, I cannot share the data.

vasili111
  • 6,032
  • 10
  • 50
  • 80
  • 1
    Does [this](https://stackoverflow.com/a/27232309) help? have a read on [how to ask good pandas questions](https://stackoverflow.com/q/20109391) as well. – hongsy Jan 21 '20 at 15:00
  • 2
    The warning is telling you that those columns have mixed data types. Meaning, for example, column 79 you might expect to be a `date` format. However, in your file, you might have '01/01/2020' but you also have 43831 in another row. Pandas is trying to determine the type for you, but it's warning you that a consistent type can't be assigned because the data is inconsistent. – gbeaven Jan 21 '20 at 15:00
  • @gbeaven You mean "'01/01/2020' but you also have 43831 in another **column**"? – vasili111 Jan 21 '20 at 15:01
  • no, *row* is correct. pandas has to read the entire file into memory (thus resulting in OOM). Consider the example of one file which has a column called user_id. It contains 10 million rows where the user_id is always numbers. Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file. – hongsy Jan 21 '20 at 15:03
  • 2
    @vasili111 No, I mean row. One is expected to be a `date` type while the other an `int` in the same column. I'm suggesting you have differing types (inconsistent) of data in the same column. – gbeaven Jan 21 '20 at 15:04
  • @hongsy No completely. I want to understand what will happen if I will leave it as is now and in data management phase for example in date column will remove all non dates. Can I safely ignore it for now, after correct data by looking at each column and find inconsistencies for example in date column will remove all non dates? Will this solve this error/warning? – vasili111 Jan 21 '20 at 15:11
  • You can safely ignore [DtypeWarnings](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.errors.DtypeWarning.html#pandas-errors-dtypewarning) if memory is not an issue. pandas will preserve the raw data as `str`s (an object dtype), albeit with more memory – hongsy Jan 21 '20 at 15:16
  • @hongsy Your last comment solves my issue. Thank you. – vasili111 Jan 21 '20 at 15:18
  • If your matter is solved please mark the answer as accepted so that others can see that your question has been answered – FredrikHedman Mar 03 '20 at 12:16
  • @FredrikHedman Done. Thank you for reminding. – vasili111 Mar 03 '20 at 15:22

4 Answers4

2

Try these

raw_data = pd.read_csv("C:/my.csv",low_memory=False)
  • From here: https://stackoverflow.com/a/27232309/1601703 it looks like `low_memory=False` is depreciated. What it actually does? – vasili111 Jan 21 '20 at 15:03
  • its not depreciated in your error it self says that it saying that it work previously for me ( Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result) – venkatadileep Jan 21 '20 at 15:04
  • 1
    @vasili111 it isnt depreciated: if you read the docs it says: `low_memory : bool, default True Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser).` – anky Jan 21 '20 at 15:05
1

pd.read_csv has a number of parameters that will give you control over how to treat the different columns.

Without the data it is hard to be specific, so read up on what the the options dtype or converters can do.

See the pandas manual for more details.

A first try could be

raw_data = pd.read_csv("C:/my.csv", dtype=str)

This should allow you to read the data and figure out how to set the data type on the columns that really matter.

FredrikHedman
  • 1,223
  • 7
  • 14
  • Can I fix this by casting on problematic columns the data type that I expect? For example, for example, string type on the column with strings and numerics? – vasili111 Jan 21 '20 at 15:16
  • Yes you can do this. However, first best to understand why the data has mixed types in the relevant columns. – FredrikHedman Jan 21 '20 at 15:23
  • Sure I meant that. First I will load data with `raw_data = pd.read_csv("C:/my.csv")`. After reciving that warning I will look at problematic columns and find why panda may think that there are different data types (for example, strings and numericals in one column). After I will fix that by changing data if possible (recode, make NaN, etc). After I will cast the data type on that column that I think is correct one. – vasili111 Jan 21 '20 at 15:28
  • 1
    Sounds like a good plan. Good luck with your data wrangling :) – FredrikHedman Jan 21 '20 at 16:14
1

Pandas will read all data to memory. If your CSV is large, this may be a tough task.

chunks = []
 for chunk in pd.read_csv('desired_file...', chunksize = 1000):
     chunks.append(chunk)
 df = pd.concat(chunks, ignore_index = True)

This will read the CSV to memory in chunks instead of as bulk.

Gess123
  • 61
  • 4
0

Try to use the parameter dtype for pandas.read_csv

You can find here: Pandas.read_csv

In my CSV, I just transform all the columns in a string, and after the loading of the Dataset, i transform the columns I need in numbers using

DataFrame[Column] = pandas.to_numeric(DataFrame[Column], errors='coerce')