40

pandas function read_csv() reads a .csv file. Its documentation is here

According to documentation, we know:

dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} (Unsupported with engine=’python’)

and

converters : dict, default None Dict of functions for converting values in certain columns. Keys can either be integers or column labels

When using this function, I can call either pandas.read_csv('file',dtype=object) or pandas.read_csv('file',converters=object). Obviously, converter, its name can says that data type will be converted but I wonder the case of dtype?

smci
  • 32,567
  • 20
  • 113
  • 146
Bryan
  • 1,477
  • 1
  • 21
  • 38

3 Answers3

35

The semantic difference is that dtype allows you to specify how to treat the values, for example, either as numeric or string type.

Converters allows you to parse your input data to convert it to a desired dtype using a conversion function, e.g, parsing a string value to datetime or to some other desired dtype.

Here we see that pandas tries to sniff the types:

In [2]:
df = pd.read_csv(io.StringIO(t))
t="""int,float,date,str
001,3.31,2015/01/01,005"""
df = pd.read_csv(io.StringIO(t))
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
int      1 non-null int64
float    1 non-null float64
date     1 non-null object
str      1 non-null int64
dtypes: float64(1), int64(2), object(1)
memory usage: 40.0+ bytes

You can see from the above that 001 and 005 are treated as int64 but the date string stays as str.

If we say everything is object then essentially everything is str:

In [3]:    
df = pd.read_csv(io.StringIO(t), dtype=object).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
int      1 non-null object
float    1 non-null object
date     1 non-null object
str      1 non-null object
dtypes: object(4)
memory usage: 40.0+ bytes

Here we force the int column to str and tell parse_dates to use the date_parser to parse the date column:

In [6]:
pd.read_csv(io.StringIO(t), dtype={'int':'object'}, parse_dates=['date']).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
int      1 non-null object
float    1 non-null float64
date     1 non-null datetime64[ns]
str      1 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 40.0+ bytes

Similarly we could've pass the to_datetime function to convert the dates:

In [5]:
pd.read_csv(io.StringIO(t), converters={'date':pd.to_datetime}).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
int      1 non-null int64
float    1 non-null float64
date     1 non-null datetime64[ns]
str      1 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 40.0 bytes
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • I should point out for others that both parameters can be provided in the same `read_csv` invocation, although I haven't tested for overlapping columns referenced in both `dtypes` and the `converters` dictionaries if using that API. – jxramos Jul 27 '18 at 21:58
  • Is there a way to make your own sniffer that converts dtypes? I feel this would be useful with excel files with many columns. – Maxim Sep 25 '18 at 16:23
  • That won't work excel sheets are imported using a 3rd party module so the dtypes are provided via that module. If it was a csv file then you could define your own function and load this to converter and have this applied on every column, that would work – EdChum Sep 25 '18 at 16:49
  • 1
    In reply to @jxramos comment, ``converters`` takes precedence over ``dtypes`` in case the same column is referenced in both arguments, at least on pandas 1.3.1 – Lionel Hamayon Aug 24 '21 at 16:45
8

I would say that the main purpose for converters is to manipulate the values of the column, not the datatype. The answer shared by @EdChum focuses on the idea of the dtypes. It uses the pd.to_datetime function.

Within this article https://medium.com/analytics-vidhya/make-the-most-out-of-your-pandas-read-csv-1531c71893b5 in the area about converters, you will see an example of changing a csv column, with values such as "185 lbs.", into something that removes the "lbs" from the text column. This is more of the idea behind the read_csv converters parameter.

What the .csv looks like (If the image doesn't show up, please go to the article.)
the csv file with 6 columns. Weight is column with entries like 145 lbs.

#creating functions to clean the columns
w = lambda x: (x.replace('lbs.',''))
r = lambda x: (x.replace('"',''))
#using converters to apply the functions to the columns
fighter = pd.read_csv('raw_fighter_details.csv' , 
                      converters={'Weight':w , 'Reach':r }, 
                      header=0, 
                      usecols = [0,1,2,3])
fighter.head(15)

The DataFrame after using converters on the Weight column.
enter image description here

VISQL
  • 1,960
  • 5
  • 29
  • 41
  • Note that the `r` lambda function was used on the `Reach` column, and not on the Height column. – VISQL Dec 07 '21 at 15:15
0

We use Converters to change the value of a perticular cell

We can write a function for a particular column inside the Converters and it will run for every cell in that column .

Please see the below example , this is a dataframe and we are trying to change those highlight values

enter image description here

   import pandas as pd
def converst_peopple_cell(cell):
  if cell=='n.a.':
    return 'Sam Walton'
  else:
    return cell       # if the cell value is not 'n.a.' then it will return the orginal value of the cell
def convert_eps_cell(cell):
  if cell=='not available':
    return None
  else:
    return cell
df=pd.read_csv('https://raw.githubusercontent.com/codebasics/py/master/pandas/4_read_write_to_excel/stock_data.csv',converters={
    'people': converst_peopple_cell,
    'eps':convert_eps_cell
   
})
df    

We use Converters arguments , which is like a python dictionary ,we can select some particular column and can apply some particular function, for a particular column like people it is gonna call this funtion convert_people_cell for every single cell in that column . The DataFrame after using converters on the people column

enter image description here