1

im new to python. I am trying to troubleshoot an error

I have a dataframe(reprex)-

import pandas as pd
    df
    Out[29]: 
            Id  ServiceSubCodeKey   PrintDate
    0  1895650                  2  2018-07-27
    1  1895650                  4  2018-08-13
    2  1896355                  2  2018-08-10
    3  1897675                  9  2018-08-13
    4  1897843                  2  2018-08-10
    5  2178737                  3  2019-06-14
    6  2178737                  4  2019-06-14
    7  2178737                  7  2019-06-14
    8  2178737                  1  2019-06-14
    9  2178750                699  2019-06-14

columns = (
    pd.get_dummies(df["ServiceSubCodeKey"])
    .reindex(range(df.ServiceSubCodeKey.min(),
        df.ServiceSubCodeKey.max()+1), axis=1, fill_value=0)
    # now it has all digits
    .astype(str)
    )
codes = pd.Series(
    [int(''.join(row)) for row in columns.itertuples(index=False)],
    index=df.index)

codes = (
    codes.groupby(df.Id).transform('sum').astype('str')
    .str.pad(width=columns.shape[1], fillchar='0')
    .str.rstrip('0') # this will remove trailing 0's
    )

print(codes)

df = df.assign(one_hot_ssc=codes)

OverflowError: int too large to convert to float

When i tried to troubleshoot it, this error occurs at the part

codes = pd.Series(
    [int(''.join(row)) for row in columns.itertuples(index=False)],
    index=df.index)

If i change the last service subcode to 60 or a lower number instead of 699, this error goes away. Any solution to this error? I want it to work even for a 5 digit number. Lookin for a permanent solution

  • https://stackoverflow.com/questions/16174399/overflowerror-long-int-too-large-to-convert-to-float-in-python – BENY Jul 29 '20 at 01:03
  • I suspect you're creating a number with 699 digits. – Barmar Jul 29 '20 at 01:04
  • Try printing the df without converting to `int()` and see what the numbers are. – Barmar Jul 29 '20 at 01:07
  • Python has no problem with numbers with 5 digits. Floating point maxes out at hundreds of digits. – Barmar Jul 29 '20 at 01:07
  • https://stackoverflow.com/questions/1835787/what-is-the-range-of-values-a-float-can-have-in-python – Barmar Jul 29 '20 at 01:08
  • @Barmar One issue is that the integers are to long to convert to float. However, there's no reason for any of the integers to be converted to float. For example, `pd.Series` tries to cast the integers as floats, as does summing the `list` of `ints`. Do you know why `pandas` keeps trying to cast the `ints` to `floats`? – Trenton McKinney Jul 29 '20 at 02:31
  • I think it's just the default dtype for numbers, but you can override it. I'm not really a pandas expert. – Barmar Jul 29 '20 at 04:06

1 Answers1

1
  • The culprit seems to be that pandas is trying to cast the values to floats.
    • [int(''.join(row)) for row in columns.itertuples(index=False)] works, but converting it to a series with pd.Series does not.
    • I don't know why pandas is trying to cast the ints to floats
  • The workaround is, deal with the numbers in such a way that pandas doesn't have an opportunity to try to cast the ints into floats.
  • dfg[0] is a list of int
  • The following code also works with 'ServiceSubCodeKey' equal to 99999
import pandas as pd

# this will create codes
codes_values = [int(''.join(r)) for r in columns.itertuples(index=False)]
codes = pd.Series({'test': codes_values}).explode()
codes.index = df.index

# groupby and aggregate the values into lists
dfg = codes.groupby(df.Id).agg(list).reset_index()

# sum the lists; doing this with a pandas function also does not work, so no .sum or .apply
summed_lists = list()
for r, v in dfg.iterrows():
    summed_lists.append(str(sum(v[0])))

# assign the list of strings to a column
dfg['sums'] = summed_lists

# perform the remainder of the functions on the sums column
dfg['final'] = dfg.sums.str.pad(width=columns.shape[1], fillchar='0').str.rstrip('0')

# display(dfg.final)
0                                                 0101
1                                                   01
2                                            000000001
3                                                   01
4                                              1011001
5    0000000000000000000000000000000000000000000000...
Name: final, dtype: object

# merge df and dfg.final
dfm = pd.merge(df, dfg[['Id', 'final']], on='Id')

# display(dfm)
        Id  ServiceSubCodeKey   PrintDate         final
0  1895650                  2  2018-07-27          0101
1  1895650                  4  2018-08-13          0101
2  1896355                  2  2018-08-10            01
3  1897675                  9  2018-08-13     000000001
4  1897843                  2  2018-08-10            01
5  2178737                  3  2019-06-14       1011001
6  2178737                  4  2019-06-14       1011001
7  2178737                  7  2019-06-14       1011001
8  2178737                  1  2019-06-14       1011001
9  2178750              99999  2019-06-14  ...000000001
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158