-1

I have a dataframe of shape (2061, 5) and the following line:

df[6] = df.groupby(df.index)[6].transform(lambda x: ' '.join(x))

..causes the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-27721ddd8064> in <module>
----> 1 df.groupby(df.index)[6].transform(lambda x: ' '.join(x))

~/.pyenv/versions/miniconda3-latest/lib/python3.7/site-packages/pandas/core/groupby/generic.py in transform(self, func, *args, **kwargs)
    463 
    464         if not isinstance(func, str):
--> 465             return self._transform_general(func, *args, **kwargs)
    466 
    467         elif func not in base.transform_kernel_whitelist:

~/.pyenv/versions/miniconda3-latest/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _transform_general(self, func, *args, **kwargs)
    487         for name, group in self:
    488             object.__setattr__(group, "name", name)
--> 489             res = func(group, *args, **kwargs)
    490 
    491             if isinstance(res, (ABCDataFrame, ABCSeries)):

<ipython-input-19-27721ddd8064> in <lambda>(x)
----> 1 df.groupby(df.index)[6].transform(lambda x: ' '.join(x))

TypeError: sequence item 0: expected str instance, float found

I developed that code on a subset of the dataframe and it seemed to be doing exactly what I wanted to the data. So now if I for example do this:

df = df.head(50)

..and run the code, the error message goes away again.

I think somewhere, a type cast is happening except at one of the lines it decides to do something else. How can I efficiently find which row in the df is causing this without manually reading through the whole two thousand long column or a trial an error thing with .head() of different sizes?

cardamom
  • 6,873
  • 11
  • 48
  • 102
  • TypeError: sequence item 0: expected str instance,**float found**......`df[6] = df.groupby(level=0)[6].transform(lambda x: ' '.join(str(x)))`? or `df[6] = df[6].astype(str).groupby(level=0).transform(' '.join)` – ansev Apr 01 '20 at 20:17
  • 1
    `.join(str(x))` seems to prevent the error, noticed that previously and should have mentioned it. `.astype(str)` does not fix it, not sure what your `level=0` in the groupby is supposed to do - isn't that just for a multiindex frame? Why does pandas in its error message not tell you on which row it tripped up without the cast to string, and how can you extract that? – cardamom Apr 01 '20 at 20:42
  • you can check string series doing: ` I think you should do it manually: `df[6].map(type)==str`. or check yor dataframe... `df.applymap(str)==str` level always works, and here it groups by the only index it has. I'm not sure why `Series.astype` doesn't work – ansev Apr 01 '20 at 20:55
  • 1
    Sorry, your `Series.astype` does fix it, just tried more carefully. Will probably use `df[6] = df[6].astype(str)` one line earlier for clarity. – cardamom Apr 01 '20 at 21:41

1 Answers1

1

EDITED: Mask column in question to keep only rows where column has a float value, then check first index. IE:

mask = df['column_in_q'].apply(lambda x: type(x) == float)
#This returns a Boolean DF that can be used to keep only True values
float_df = df[mask]  # Subset of DF that meets condition
print(df.head())

I think this is because the Groupby method returns a groupby object, not a dataframe. You have to specify aggregation methods, which you could then subset. That is:

df[6] = df.groupby(df.index).sum()[6].transform(lambda x: ' '.join(x))

See here for more: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

whege
  • 1,391
  • 1
  • 5
  • 13
  • The thing is, the code works perfectly on the first 50 or however many lines as I wrote, so don't think it can be a fundamental thing like this – cardamom Apr 01 '20 at 20:27
  • What kind of data is in the dataframe? An example would be helpful. The TypeError suggests that it's trying to do a string join but encounters a float. In your lambda function, try casting x to a string first. – whege Apr 01 '20 at 20:33
  • The index looks like this: `Index(['01.05.2017', '04.05.2017'... '04.02.2018', '06.02.2018'], dtype='object', name=0, length=253)` Yes, casting to string first seems to fix it, but really the question was to find which row is causing it to fail **without** a cast to string. If I put `.join(str(x))` it goes away but the output looks subtly different, would prefer to know more finely where it is tripping up. – cardamom Apr 01 '20 at 20:39
  • Ah I see, I misunderstood. I don't know if there is a way to explicitly check this in the groupby, but one way would be to mask your dataframe on dtype and keep only floats in the column in question, then check the index of that first row. – whege Apr 01 '20 at 20:40
  • 'mask your dataframe on dtype and keep only floats in the column in question' sounds good as a diagnostic - how do you do that? Is probably the most efficient way. Can you add that to your answer.. – cardamom Apr 01 '20 at 20:45
  • it is better to use `apply(type)` and then compare the entire series, because apply is a very slow pandas method and adding the comparison inside slows down the function unnecessarily, check it https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code – ansev Apr 01 '20 at 20:59
  • @LiamFiddler thanks, with your answer, just changing 'float' to 'str' in the mask, I could establish that there were 2 out of 2061 points where the mask was `False`, places where a row contained a string of only numbers, with no characters or commas. So that answers it. – cardamom Apr 01 '20 at 21:34