1

Explanation of my issue

Take this DataFrame for example:

pd.DataFrame(
    data = np.array([
        ['A','1'],
        ['B','2'],
        ['C', 'False'],
    ])
)

Is there a good way to appropriately set the second column's element type to either float or boolean?

I am simply given that DataFrame where all variables are initially strings. In reality, I have tons of rows and each DataFrame is different so the indices that need to be set to floats and bools change. Therefore, I cannot create a default dtype 'template' to refer to.

Solutions I've Explored

  • pandas does have the df.to_numeric() function but you can only set non floats to Nans by setting errors='coerce' so this doesn't work. Similar issue with the df.astype() function
  • I could loop through each index try casting types till one sticks but this isn't elegant so I feel like there's a better way

Summary

Essentially, given a series object where elements are initially of type string, I need to cast the appropriate elements to either type float or bool. Is there an elegant way of doing this without looping through each element and casting either float or bool? Is there a pandas function that I'm missing?

Thanks in advance for any help!

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • I'm not sure what you are expecting, your question would be helped by provided the exact expected output given the example input (thanks for that, though). typically series have *a single data type*. If you want to be able to use either float or bool, you'd probably have to use `dtype=object`. At that point, there's nothing really inelegant about looping through the items and trying each to see which sticks. – juanpa.arrivillaga Mar 02 '22 at 00:14
  • Assuming I understand your task correctly, the most error-proof way of accomplishing the task is to write a custom function and `apply` it to the series. The series can be of dtype `object` but unlike numpy arrays, the elements within the series can be mixed type. You could try to chain together calls to `df['col']=df['col'].astype(xxx, errors='ignore')` but with mixed types that will get ugly quickly – G. Anderson Mar 02 '22 at 00:23
  • @juanpa.arrivillaga I was not aware that series had to be all one type, that would explain that pandas does not have a function that would solve this issue. My expected output would then be a series object of numpy.dtypes of either floats or booleans (ie '1.2' -> float, 'Flase' -> bool). Not sure if my terminology here is correct but I hope you understand my intentions. In that case, I would need to iterate through the elements to create the correct numpy.dtype – Benjamin Fogiel Mar 02 '22 at 00:23
  • @BenjaminFogiel no, I do not understand. Again, this is why it is really helpful if you *provide exactly the output you mean*. Again, you say you want "a series object of numpy.dtypes of either floats or booleans" but series *don't have dtypes*. They have *a single dtype*. It just sounds like you are asking for the exact same thing again. – juanpa.arrivillaga Mar 02 '22 at 00:28
  • @G.Anderson I like that thought, however, it did not work assuming I understood you correctly. I tried applying ```df[1] = df.apply(float, errors='ignore').apply(bool, errors='ignore')``` to the df I provided above, however, it converted everything to type boolean which makes sense given that Series cannot be of multiple types as @juanpa.arrivillaga points out – Benjamin Fogiel Mar 02 '22 at 00:32
  • There seems to be a little confusion. A _series_ will have a single dtype. Within a series, the individual elements can have _different_ dtypes. However, when a series contain mixed types, the `dtype` of the _series_ will be `object`. Because of this, vectorized methods will not give you much performance boost compared to just iterating, which is what I believe @juanpa.arrivillaga was originally trying to convey. – G. Anderson Mar 02 '22 at 00:38
  • 1
    The second point of confusion was, I believe, when I said to apply a custom function. I meant, as in the question [How to apply a custom function to a pandas dataframe row by row](https://stackoverflow.com/questions/40353519/how-to-apply-custom-function-to-pandas-data-frame-for-each-row) where you actually define a function with arguments, then apply that function to the elements in the series – G. Anderson Mar 02 '22 at 00:41
  • Thank you @G.Anderson and @ juanpa.arrivillaga, I have updated the post to reflect a working solution – Benjamin Fogiel Mar 02 '22 at 00:59
  • 1
    I'm glad you were able to find a solution @BenjaminFogiel Please use the answer field below your question to "Post Your Answer". This lets the system know that this question has been answered in a way that editing the solution into the question body cannot. – Henry Ecker Mar 02 '22 at 03:31

1 Answers1

1

Solution

Here is what worked:

given:

pd = pd.DataFrame(
    data = np.array([
        ['A','1'],
        ['B','2'],
        ['C', 'False'],
    ])
)

Assuming that all booleans variables will be either "False" or "True" within the DataFrame, and all other values are a valid float, we can use a lambda function to iterate through the rows of the DataFrame and cast types:

df[1] = df.apply(lambda row: bool(row[1]) if (row[1] == 'False' or row[1] == 'True') else float(row[1]), axis=1)

which results to the desired output:

>>> df[1][0].type
<class 'float'>
>>> df[1][1].type
<class 'float'>
>>> df[1][2].type
<class 'bool'>