4

I'm trying to convert an UInt8 pandas series into the new StringDtype.

I can do the following, covered in this question, which predates the new string dtype:

import pandas as pd
int_series = pd.Series(range(20), dtype="UInt8")
obj_series = int_series.apply(str)

Which gives me a series of Object dtype containing strings.

But if I try to convert the series to the new string dtype, I get an error:

>>> string_series = int_series.astype("string")
...
TypeError: data type not understood

Note that the first converting the series to Object and then to string dtype works:

int_series.apply(str).astype("string")

How can I convert the int series to string directly?

I'm using pandas version 1.0.3 on Python 3.7.6


Update: I've found this open issue in the pandas Github page that describes the exact same problem.

A comment in the issue above points to another open issue which covers the desired but currently not available functionality of converting between different ExtensionArray types.

So the answer is that the direct conversion cannot be done now, but likely will be possible in the future.

foglerit
  • 7,792
  • 8
  • 44
  • 64
  • I always thought that `pandas` only has `object` as dtype for string values. Interesting. – Quang Hoang Apr 11 '20 at 13:58
  • according to [this doc](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html), `int_series.astype('string')` should work, yet it doesn't. – Quang Hoang Apr 11 '20 at 14:04
  • @QuangHoang: Yes, the `string` dtype is new in version 1.0.0 – foglerit Apr 11 '20 at 14:07
  • If you don't use "UInt8" but regular int, the error is more explicit: `ValueError: StringArray requires a sequence of strings or pandas.NA` – Ben.T Apr 11 '20 at 14:09

1 Answers1

1

This is explained in the docs, in the example section:

Unlike object dtype arrays, StringArray doesn’t allow non-string values

Where the following example is shown:

pd.array(['1', 1], dtype="string")

Traceback (most recent call last): ... ValueError: StringArray requires an object-dtype ndarray of strings.

The only solution seems to be casting to Object dtype as you were doing and then to string.

This is also clearly stated in the source code of StringArray, where right at the top you'll see the warning:

   .. warning::
       Currently, this expects an object-dtype ndarray
       where the elements are Python strings or :attr:`pandas.NA`.
       This may change without warning in the future. Use
       :meth:`pandas.array` with ``dtype="string"`` for a stable way of
       creating a `StringArray` from any sequence.

If you check the validation step in _validate, you'll see how this will fail for arrays of non-strings:

def _validate(self):
    """Validate that we only store NA or strings."""
    if len(self._ndarray) and not lib.is_string_array(self._ndarray, skipna=True):
        raise ValueError("StringArray requires a sequence of strings or pandas.NA")
    if self._ndarray.dtype != "object":
        raise ValueError(
            "StringArray requires a sequence of strings or pandas.NA. Got "
            f"'{self._ndarray.dtype}' dtype instead."
        )

For the example in the question:

from pandas._libs import lib

lib.is_string_array(np.array(range(20)), skipna=True)
# False
yatu
  • 86,083
  • 12
  • 84
  • 139
  • I understand this point for array creation, not conversion. This `pd.array([1, "2"], dtype="UInt8")` fails but this succeeds `pd.array([1, "2"], dtype="object").astype("UInt8")`. So although `UInt8` does not accept a string, it can still convert a string using `astype` – foglerit Apr 11 '20 at 14:20
  • Yes, because previously you've casted to object. And afaik casting to another dtype with `astype` is the same as creating a dataframe or series anew, note that it creates a copy @foglerit – yatu Apr 11 '20 at 14:26
  • It seems there's a bug. The warning you quote is for the `values` parameter for the `StringArray` `__init__` method. It states that this should work `pd.array(int_series, dtype="string")` but it does not. – foglerit Apr 11 '20 at 14:27
  • 1
    In any case, you answer is very insightful. I'll open an issue in Github and accept your answer when they confirm your point. – foglerit Apr 11 '20 at 14:29
  • Where is this stated? The warning seems quite clear in that non-string values cannot be directly converted @foglerit – yatu Apr 11 '20 at 14:29
  • 1
    You may be right. I interpreted the statement: Use `:meth:pandas.array` with `dtype="string"` for a stable way of creating a `StringArray` from *any* sequence as in opposition to using the class constructor, and that the "any array" implied the ability to convert types. – foglerit Apr 11 '20 at 14:32