118

I'm using Pandas to read a bunch of CSVs. Passing an options json to dtype parameter to tell pandas which columns to read as string instead of the default:

dtype_dic= { 'service_id':str, 'end_date':str, ... }
feedArray = pd.read_csv(feedfile , dtype = dtype_dic)

In my scenario, all the columns except a few specific ones are to be read as strings. So instead of defining several columns as str in dtype_dic, I'd like to set just my chosen few as int or float. Is there a way to do that?

It's a loop cycling through various CSVs with differing columns, so a direct column conversion after having read the whole csv as string (dtype=str), would not be easy as I would not immediately know which columns that csv is having. (I'd rather spend that effort in defining all the columns in the dtype json!)

Edit: But if there's a way to process the list of column names to be converted to number without erroring out if that column isn't present in that csv, then yes that'll be a valid solution, if there's no other way to do this at csv reading stage itself.

Note: this sounds like a previously asked question but the answers there went down a very different path (bool related) which doesn't apply to this question. Pls don't mark as duplicate!

Nikhil VJ
  • 5,630
  • 7
  • 34
  • 55

4 Answers4

161

For Pandas 1.5.0+, there's an easy way to do this. If you use a defaultdict instead of a normal dict for the dtype argument, any columns which aren't explicitly listed in the dictionary will use the default as their type. E.g.

from collections import defaultdict
types = defaultdict(str, A="int", B="float")
df = pd.read_csv("/path/to/file.csv", dtype=types, keep_default_na=False)

(I haven't tested this, but I assume you still need keep_default_na=False)


For older versions of Pandas:

You can read the entire csv as strings then convert your desired columns to other types afterwards like this:

df = pd.read_csv('/path/to/file.csv', dtype=str, keep_default_na=False)
# example df; yours will be from pd.read_csv() above
df = pd.DataFrame({'A': ['1', '3', '5'], 'B': ['2', '4', '6'], 'C': ['x', 'y', 'z']})
types_dict = {'A': int, 'B': float}
for col, col_type in types_dict.items():
    df[col] = df[col].astype(col_type)

keep_default_na=False is necessary if some of the columns are empty strings or something like NA which pandas convert to NA of type float by default, which would make you end up with a mixed datatype of str/float

Another approach, if you really want to specify the proper types for all columns when reading the file in and not change them after: read in just the column names (no rows), then use those to fill in which columns should be strings

col_names = pd.read_csv('file.csv', nrows=0).columns
types_dict = {'A': int, 'B': float}
types_dict.update({col: str for col in col_names if col not in types_dict})
pd.read_csv('file.csv', dtype=types_dict)
Nathan
  • 9,651
  • 4
  • 45
  • 65
  • 1. This will error out if the said cols aren't present in that CSV. Pls see the question. 2. I want to by default cast ALL cols as string, except some chosen ones. Pls see the question. – Nikhil VJ Apr 06 '18 at 03:57
  • 1
    awesome! Sorry I didn't see your update back then.. funny I thought I'd get some alert if anything changed. I particularly like the second approach.. best of both worlds. – Nikhil VJ Jul 04 '18 at 12:36
  • This wouldn't work when you want to specify a decimal separator in the read_csv function. It will cast these numbers as str with the wrong decimal separator and thereafter you will not be able to convert it to float directly. – Michael H. Sep 04 '19 at 10:31
  • You can do `str.replace` first to convert to "." as a decimal separator or write your own type-casting function instead of using `float`/`int` (e.g. `lambda x: float(x.replace(",", "."))`). The latter is probably slower – Nathan Sep 04 '19 at 18:45
  • Actually, if you're using the second approach here, I don't see any reason that specifying a decimal separator wouldn't work directly; the above comment only matters for the first approach used. – Nathan Oct 15 '19 at 15:30
  • 1
    Just noticed that for the 1.5 pandas release: "Support for defaultdict was added. Specify a defaultdict as input where the default determines the dtype of the columns which are not explicitly listed." – nerff Sep 30 '22 at 21:51
  • 2
    Careful, the first solution will output empty columns as `NA` floats rather than `str` despite explicit `dtype=str`. Adding `keep_default_na=False` fixes this, so that everything is _really_ read in as str first. – Cornelius Roemer Feb 07 '23 at 16:36
45

I recently encountered the same issue, though I only have one csv file so I don't need to loop over files. I think this solution can be adapted into a loop as well.

Here I present a solution I used. Pandas' read_csv has a parameter called converters which overrides dtype, so you may take advantage of this feature.

An example code is as follows: Assume that our data.csv file contains all float64 columns except A and B which are string columns. You may read this file using:

df = pd.read_csv('data.csv', dtype = 'float64', converters = {'A': str, 'B': str})  

The code gives warnings that converters override dtypes for these two columns A and B, and the result is as desired.

Regarding looping over several csv files all one needs to do is to figure out which columns will be exceptions to put in converters. This is easy if files have a similar pattern of column names, otherwise, it would get tedious.

MECoskun
  • 789
  • 6
  • 12
  • 2
    you can specify just converters for one or more columns, without specifying dtype for other columns. It is very useful when you have just several columns you need to specify format for, and you don't want to specify format for all columns as in the answers above. – Sergey Zaitsev Jun 10 '21 at 18:59
7

You can do the following:

pd.read_csv(self._LOCAL_FILE_PATH,
            index_col=0,
             encoding="utf-8",
             dtype={
                    'customer_id': 'int32',
                    'product_id': 'int32',
                    'subcategory_id': 'int16',
                    'category_id': 'int16',
                    'gender': 'int8',
                    'views': 'int8',
                    'purchased': 'int8',
                    'added': 'int8',
                    'time_on_page': 'float16',
                 })
Shady Smaoui
  • 867
  • 9
  • 11
2

Extending on @MECoskun's answer using converters and simultaneously striping leading and trailing white spaces, making converters more versatile:

df = pd.read_csv('data.csv', dtype = 'float64', converters = {'A': str.strip, 'B': str.strip}) 

There is also lstrip and rstrip that could be used if needed instead of strip. Note, do not use strip() but just strip. Of course, you do not strip non strings.

silicon23
  • 68
  • 1
  • 9