With pandas >= 1.0 there is now a dedicated string datatype:
You can convert your column to this pandas string datatype using .astype('string'):
df = df.astype('string')
This is different from using str
which sets the pandas 'object' datatype:
df = df.astype(str)
You can see the difference in datatypes when you look at the info of the dataframe:
df = pd.DataFrame({
'zipcode_str': [90210, 90211] ,
'zipcode_string': [90210, 90211],
})
df['zipcode_str'] = df['zipcode_str'].astype(str)
df['zipcode_string'] = df['zipcode_str'].astype('string')
df.info()
# you can see that the first column has dtype object
# while the second column has the new dtype string
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 zipcode_str 2 non-null object
1 zipcode_string 2 non-null string
dtypes: object(1), string(1)
From the docs:
The 'string' extension type solves several issues with object-dtype
NumPy arrays:
1) You can accidentally store a mixture of strings and non-strings in an
object dtype array. A StringArray can only store strings.
2) object dtype breaks dtype-specific operations like
DataFrame.select_dtypes(). There isn’t a clear way to select just text
while excluding non-text, but still object-dtype columns.
3) When reading code, the contents of an object dtype array is less clear
than string.
Information about pandas 1.0 can be found here:
https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html