1

why does read_csv automatically convert all my columns read into 'object' types? I want to read 10 Gb csv (float and Int) and load it into a pandas data frame. I don't run into this issue (where all columns with numbers are converted to object types) if I read a smaller file (100 MB or less) with either panda or dask

csv sample

i tried to specify dtype explicitly, manually as part of read_csv; still ended up with objects (verified after read with df.dtype)

import pandas as pd
file='D:/path/combine.csv'
data_type={'Lat':np.float32,'Long':np.float32,   'HorizontalAccuracy':np.int,'RSRP':np.int}
data=pd.read_csv(file, low_memory=False, dtype=data_type)
data.dtypes

tried to read 1st line of file and get dtypes automatically, then read file with the defined dtypes: ended up with all objects

file='D:/path/combine.csv'
col_names=pd.read_csv(file, nrows=0).columns
types_dict=data_type
types_dict.update({col:np.int64 for col in col_names if col not in types_dict})
data=pd.read_csv(file, low_memory=False, dtype=data_type)
data.dtypes

TypeError: Cannot cast array from dtype('O') to dtype('float32') according to the rule 'safe' During handling of the above exception, another exception occurred: ValueError: could not convert string to float: '\x1a'

tried read_csv with dask while explicitly specifying dtype; got error about can't convert string to float

import dask.dataframe as dd
file='D:/path/combine.csv'
data_type={'Lat':np.float32,'Long':np.float32, 'HorizontalAccuracy':np.int,'RSRP':np.int} 
ddf=dd.read_csv(file, dtype=data_type)
ddf.compute()

TypeError: Cannot cast array from dtype('O') to dtype('float32') according to the rule 'safe' ValueError: could not convert string to float: 'Latitude'

Cuezy
  • 43
  • 1
  • 6
  • 2
    We can't guess. Give us a [mcve] which shows when Pandas uses `object` unnecessarily. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) if you need help with this. – jpp Dec 28 '18 at 20:32
  • 1
    Keep in mind that "string"-type columns will always show up as "objects" in pandas. The float and int types are as expected, though. – webelo Dec 28 '18 at 20:34
  • you should probably include the head of your dataset, the code you are using (just the relevant part) to load the csv to a dataframe, and then your error message. Describing to us with words is less informative than showing code. – d_kennetz Dec 28 '18 at 20:37
  • @jpp i can't reproduce it here since the problem doesn't appear without loading a large file – Cuezy Dec 28 '18 at 21:32
  • @webelo sorry i didnt have any strings in my input file, i edited my question. thanks for the info tho. – Cuezy Dec 28 '18 at 21:32
  • @d_kennetz got it. pls have another look. – Cuezy Dec 28 '18 at 21:33
  • The variable names in your sample data don't match the variable names in the `data_type` dict. They need to in order for the `dtypes` argument to work. If that doesn't solve it consider posting a plaintext version of your sample data: the excel-style formatting might hide important characters. – webelo Dec 28 '18 at 22:26

2 Answers2

0

There are some issues with your code, and most specifically, your column names in your csv file are different than the names of the dtypes you are assigning. This will throw an error because these need to match exactly. So using this as my input file (I tried to replicate yours as much as possible):

Location_Lat,Location_Long,Location_H,Location_Z,QOS_RSRP
47.46058,-123.053,6,98588,-115
47.62989,-122.81,87,98312,-114
47.54865,-122.859,9,98312,-113

I noticed that you had one extra column (Location_Z) that you did not call in your dtypes. I set up a dataframe for assigning dtypes, and using the specified columns based on your information above. You can specify everything you want in the pd.read_csv() function as follows:

import sys

import pandas as pd import numpy as np

df = pd.read_csv('fun.csv', header=0, usecols=['Location_Lat',
                                               'Location_Long',
                                               'Location_H',
                                               'QOS_RSRP'],
                 dtype={'Location_Lat':np.float32,
                        'Location_Long':np.float32,
                        'Location_H':np.int,
                        'QOS_RSRP':np.int}, low_memory=False)
print(df)
print(df.dtypes)

A couple things to note. I explicitly assigned the header=0. This will ensure that pandas uses the first line of the csv as your header (so I know what the names will be called). If you want it to be a different line just set header=(line you want)

Second, I told pandas to explicitly use only 4 out of the 5 columns, since you did not specify dtype for the Location_Z column, I did not include that column, but if you want it you can just include it in the usecols= param above and specify the dtype in the dtype= param.

Lastly, the dtype dict uses the column names from the dataframe to assign dtypes to. Fortunately, we assigned the columns with header=0 so pandas already "knows" the column names. In pandas, floats and ints have a really low memory cost compared to a string object. The reason your df is returning object dtypes for everything is because the header might be being read in as row1 of your df, in which case pandas would assign that whole column as an object dtypes. The result of the code above when printed to screen is:

[dkennetz@hpc02  fun_temp]$ python3.5 pandas_fun.py
   Location_Lat  Location_Long  Location_H  QOS_RSRP
0     47.460579    -123.053001           6      -115
1     47.629890    -122.809998          87      -114
2     47.548649    -122.859001           9      -113
Location_Lat     float32
Location_Long    float32
Location_H         int64
QOS_RSRP           int64
dtype: object

So we can see that only 4 out of the 5 columns are being loaded into the dataframe, and that the dtypes are indeed float32 and int64 as assigned. The dtype: object at the bottom is referring to the dataframe itself, which will always be a dataframe object in pandas.

I hope this clears up any questions you may have had!

d_kennetz
  • 5,219
  • 5
  • 21
  • 44
  • 1
    thanks so much for your time. I finally figured out the problem: the placement of where "header=0" actually matters! In my code i had dd.read_csv(file, usecols=twr_coln, dtype=data_type, header=0) and dask must've tried to process the whole file BEFORE seeing my command that header=0; resulting in long processing time. I needed the code like dd.read_csv(file, header=0, usecols=twr_coln, dtype=data_type), then the process time was drastically reduced. your comment helped me trouble shoot! – Cuezy Jan 02 '19 at 17:21
  • That is good! The order does indeed matter because of the size of your dataset. if I do the same thing with a 3 column csv with size of 51gb (passing header=0 last) pandas throws an out of memory error, but if I do it first the in memory size of the dataset drastically reduces because of datatypes. And I am talking about pandas, not dask. Dask inherits pandas api so this could be where your issue is occurring. It has to do with the how pandas reads in information of large files (uses like head 100 lines) and tries to guess the datatype based on the first 100 lines. – d_kennetz Jan 02 '19 at 17:47
0

my problem of timeouts and reading entire large csv as object despite it being numbers, was caused by the lack of specifying "header=0". More importantly where "header=0" is placed in your read_csv will determine whether it's going to work.

wrong code dd.read_csv(file, usecols=twr_coln, dtype=data_type, header=0) right code dd.read_csv(file, header=0, usecols=twr_coln, dtype=data_type)

Cuezy
  • 43
  • 1
  • 6