54

Numpy seems to make a distinction between str and object types. For instance I can do:

>>> import pandas as pd
>>> import numpy as np
>>> np.dtype(str)
dtype('S')
>>> np.dtype(object)
dtype('O')

Where dtype('S') and dtype('O') corresponds to str and object respectively.

However pandas seem to lack that distinction and coerce str to object.

>>> df = pd.DataFrame({'a': np.arange(5)})
>>> df.a.dtype
dtype('int64')
>>> df.a.astype(str).dtype
dtype('O')
>>> df.a.astype(object).dtype
dtype('O')

Forcing the type to dtype('S') does not help either.

>>> df.a.astype(np.dtype(str)).dtype
dtype('O')
>>> df.a.astype(np.dtype('S')).dtype
dtype('O')

Is there any explanation for this behavior?

cottontail
  • 10,268
  • 18
  • 50
  • 51
Meitham
  • 9,178
  • 5
  • 34
  • 45
  • 4
    As a very brief explanation that isn't a full answer: If you use a string dtype in `numpy`, it's fundamentally a fixed-width c-like string. In `pandas`, they're "normal" python strings, thus the object type. – Joe Kington Jan 19 '16 at 15:55
  • 2
    This might address your question - http://stackoverflow.com/questions/21018654/strings-in-a-dataframe-but-dtype-is-object - basically they store object ndarray, not strings in ndarray. However, I do support that they could have be more clear when it comes to distinguishing types - for example having an ability to distinguish 'str' from 'mixed' columns which are also reported as 'O'. – Sereger Jan 19 '16 at 16:00

2 Answers2

54

Numpy's string dtypes aren't python strings.

Therefore, pandas deliberately uses native python strings, which require an object dtype.

First off, let me demonstrate a bit of what I mean by numpy's strings being different:

In [1]: import numpy as np
In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7')
In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object)

Now, 'x' is a numpy string dtype (fixed-width, c-like string) and y is an array of native python strings.

If we try to go beyond 7 characters, we'll see an immediate difference. The string dtype versions will be truncated:

In [4]: x[1] = 'a really really really long'
In [5]: x
Out[5]:
array(['Testing', 'a reall', 'string'],
      dtype='|S7')

While the object dtype versions can be arbitrary length:

In [6]: y[1] = 'a really really really long'

In [7]: y
Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object)

Next, the |S dtype strings can't hold unicode properly, though there is a unicode fixed-length string dtype, as well. I'll skip an example, for the moment.

Finally, numpy's strings are actually mutable, while Python strings are not. For example:

In [8]: z = x.view(np.uint8)
In [9]: z += 1
In [10]: x
Out[10]:
array(['Uftujoh', 'b!sfbmm', 'tusjoh\x01'],
      dtype='|S7')

For all of these reasons, pandas chose not to ever allow C-like, fixed-length strings as a datatype. As you noticed, attempting to coerce a python string into a fixed-with numpy string won't work in pandas. Instead, it always uses native python strings, which behave in a more intuitive way for most users.

Joe Kington
  • 275,208
  • 71
  • 604
  • 463
4

Read here, if you arrived here looking read about the difference between 'string' and object dtypes in pandas. As of pandas 1.5.3, there are two main differences between the two dtypes.

1. Null handling

object dtype can store not only strings but also mixed data types, so if you want to cast the values into strings, astype(str) is the prescribed method. This however casts all values into strings, even NaNs become literal 'nan' strings. string is a nullable dtype, so casting as 'string' preserves NaNs as null values.

x = pd.Series(['a', float('nan'), 1], dtype=object)
x.astype(str).tolist()          # ['a', 'nan', '1']
x.astype('string').tolist()     # ['a', <NA>, '1']

A consequence of this is that string operations (e.g. counting characters, comparison) that are performed on object dtype columns return numpy.int or numpy.bool etc. whereas the same operations performed on 'string' dtype return nullable pd.Int64 or pd.Boolean dtypes. In particular, NaN comparisons return False (because NaN is not equal to any value) for comparisons performed on object dtypes, while pd.NA remains pd.NA for comparisons performed on 'string' dtype.

x = pd.Series(['a', float('nan'), 'b'], dtype=object)
x == 'a'

0     True
1    False
2    False
dtype: bool
    
    
y = pd.Series(['a', float('nan'), 'b'], dtype='string')
y == 'a'

0     True
1     <NA>
2    False
dtype: boolean

So with 'string' dtype, null handling is more flexible because you can call fillna() etc. to handle null values however you want to.1

2. string dtype is clearer

If a pandas column is object dtype, values in it can be replaced with anything. For example, a string in it can be replaced by an integer and that's OK (e.g. x below). It might have unwanted consequences afterwards if you expect each value in it to be strings. string dtype does not have that problem because a string can only be replaced by another string (e.g. y below).

x = pd.Series(['a', 'b'], dtype=str)
y = pd.Series(['a', 'b'], dtype='string')
x[1] = 3                        # OK
y[1] = 3                        # ValueError
y[1] = '3'                      # OK

This has the advantage where you can use select_dtypes() to select only string columns. In other words, with object dtypes, there is no way to identify string columns, but with 'string' dtypes, there is.

df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': [[1], [2,3], [4,5]]}).astype({'A': 'string'})
df.select_dtypes('string')      # only selects the string column


    A
0   a
1   b
2   c



df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': [[1], [2,3], [4,5]]})
df.select_dtypes('object')      # selects the mixed dtype column as well


    A   B
0   a   [1]
1   b   [2, 3]
2   c   [4, 5]

3. Memory efficiency

String Dtype 'string' has storage options (python and pyarrow) and if the strings are short, pyarrow is very efficient. Look at the following example:

lst = np.random.default_rng().integers(1000000, size=1000).astype(str).tolist()

x = pd.Series(lst, dtype=object)
y = pd.Series(lst, dtype='string[pyarrow]')
x.memory_usage(deep=True)       # 63041
y.memory_usage(deep=True)       # 10041

As you can see, if the strings are short (at most 6 characters in the example above), pyarrow is consumes over 6 times less memory. However, as the following example shows, if the strings are long, there's barely any difference.

z = x * 1000
w = (y.astype(str) * 1000).astype('string[pyarrow]')
z.memory_usage(deep=True)       # 5970128
w.memory_usage(deep=True)       # 5917128

1 Similar intuition already exists for str.contains, str.match for example.

x = pd.Series(['a', float('nan'), 'b'], dtype=object)
x.str.match('a', na=np.nan)

0     True
1      NaN
2    False
dtype: object
cottontail
  • 10,268
  • 18
  • 50
  • 51