1

I have a dataframe similar to the following, which we'll call "df":

id    value    time
a      1        1
a      1.5      2
a      2        3
a      2.5      4
b      1        1
b      1.5      2
b      2        3
b      2.5      4

I am running various regressions by "id" in Python on this dataframe. Generally, this requires a grouping by "id" and then applying a function to those groupings that calculates the regression.

I am working with 2 similar regression techniques in Scipy's stats library:

  1. Theil-Sen estimator:

    (https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.mstats.theilslopes.html)

  2. Siegel estimator:

    (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.siegelslopes.html).

Both of these intake the same type of data. Therefore the function to calculate them should be the same aside from the actual technique used.

For Theil-Sen, I wrote the following function and the groupby statement that would be applied to that function:

def theil_reg(df, xcol, ycol):
   model = stats.theilslopes(ycol,xcol)
   return pd.Series(model)

out = df.groupby('id').apply(theil_reg, xcol='time', ycol='value')

However, I get the following error, which I've been having the hardest time understanding how to address:

ValueError: could not convert string to float: 'time'

The actual variable time is a numpy float object, so it isn't a string. This makes me believe that the stats.theilslopes function is not recognizing that time is a column in the dataframe and is instead using 'time' as a string input into the function.

However if that's the case, then this seems to be a bug in the stats.theilslopes package, and would need to be addressed by Scipy. The reason I believe this to be the case is because the exact same function as above, but instead using the siegelslopes package, works perfectly fine and provides the output I'm expecting, and they're essentially the same estimation with the same inputs.

Doing the following on Siegel:

def siegel_reg(df, xcol, ycol):
   model = stats.siegelslopes(ycol,xcol)
   return pd.Series(model)

out = df.groupby('id').apply(siegel_reg, xcol='time',ycol='value')

Does not create any errors about the time variable and conducts the regression as needed.

Does anyone have thoughts on whether I'm missing something here? If so I would appreciate any thoughts, or if not, any thoughts on how to address this with Scipy.

Edit: here is the full error message that shows up when I run this script:

ValueError Traceback (most recent call last)
C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in apply(self, func, *args, **kwargs)
    688 try:
--> 689 result = self._python_apply_general(f)
    690 except Exception:

C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in _python_apply_general(self, f)
    706 keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 707                                                    self.axis)
    708 

C:\Anaconda\lib\site-packages\pandas\core\groupby\ops.py in apply(self, f, data, axis)
    189             group_axes = _get_axes(group)
--> 190             res = f(group)
    191             if not _is_indexed_like(res, group_axes):

C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in f(g)
    678                     with np.errstate(all='ignore'):
--> 679                         return func(g, *args, **kwargs)
    680             else:

<ipython-input-506-0a1696f0aecd> in theil_reg(df, xcol, ycol)
      1 def theil_reg(df, xcol, ycol):
----> 2     model = stats.theilslopes(ycol,xcol)
      3     return pd.Series(model)

C:\Anaconda\lib\site-packages\scipy\stats\_stats_mstats_common.py in 
theilslopes(y, x, alpha)
    221     else:
--> 222         x = np.array(x, dtype=float).flatten()
    223         if len(x) != len(y):

ValueError: could not convert string to float: 'time'

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
<ipython-input-507-9a199e0ce924> in <module>
----> 1 df_accel_correct.groupby('chart').apply(theil_reg, xcol='time', 
ycol='value')

C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in apply(self, func, *args, **kwargs)
    699 
    700                 with _group_selection_context(self):
--> 701                     return self._python_apply_general(f)
    702 
    703         return result

C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in _python_apply_general(self, f)
    705     def _python_apply_general(self, f):
    706         keys, values, mutated = self.grouper.apply(f, 
self._selected_obj,
--> 707                                                    self.axis)
    708 
    709         return self._wrap_applied_output(

C:\Anaconda\lib\site-packages\pandas\core\groupby\ops.py in apply(self, f, data, axis)
    188             # group might be modified
    189             group_axes = _get_axes(group)
--> 190             res = f(group)
    191             if not _is_indexed_like(res, group_axes):
    192                 mutated = True

C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in f(g)
    677                 def f(g):
    678                     with np.errstate(all='ignore'):
--> 679                         return func(g, *args, **kwargs)
    680             else:
    681                 raise ValueError('func must be a callable if args or '

<ipython-input-506-0a1696f0aecd> in theil_reg(df, xcol, ycol)
      1 def theil_reg(df, xcol, ycol):
----> 2     model = stats.theilslopes(ycol,xcol)
      3     return pd.Series(model)

C:\Anaconda\lib\site-packages\scipy\stats\_stats_mstats_common.py in theilslopes(y, x, alpha)
    220         x = np.arange(len(y), dtype=float)
    221     else:
--> 222         x = np.array(x, dtype=float).flatten()
    223         if len(x) != len(y):
    224             raise ValueError("Incompatible lengths ! (%s<>%s)" % (len(y), len(x)))

ValueError: could not convert string to float: 'time'

Update 2: after calling df in the function, I received the following error message:

ValueError                                Traceback (most recent call last)
C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in apply(self, func, *args, **kwargs)
    688             try:
--> 689                 result = self._python_apply_general(f)
    690             except Exception:

C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in _python_apply_general(self, f)
    706         keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 707                                                    self.axis)
    708 

C:\Anaconda\lib\site-packages\pandas\core\groupby\ops.py in apply(self, f, data, axis)
    189             group_axes = _get_axes(group)
--> 190             res = f(group)
    191             if not _is_indexed_like(res, group_axes):

C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in f(g)
    678                     with np.errstate(all='ignore'):
--> 679                         return func(g, *args, **kwargs)
    680             else:

<ipython-input-563-5db69048f347> in theil_reg(df, xcol, ycol)
      1 def theil_reg(df, xcol, ycol):
----> 2     model = stats.theilslopes(df[ycol],df[xcol])
      3     return pd.Series(model)

C:\Anaconda\lib\site-packages\scipy\stats\_stats_mstats_common.py in theilslopes(y, x, alpha)
    248     sigma = np.sqrt(sigsq)
--> 249     Ru = min(int(np.round((nt - z*sigma)/2.)), len(slopes)-1)
    250     Rl = max(int(np.round((nt + z*sigma)/2.)) - 1, 0)

ValueError: cannot convert float NaN to integer

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-564-d7794bd1d495> in <module>
----> 1 correct_theil = df_accel_correct.groupby('chart').apply(theil_reg, xcol='time', ycol='value')

C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in apply(self, func, *args, **kwargs)
    699 
    700                 with _group_selection_context(self):
--> 701                     return self._python_apply_general(f)
    702 
    703         return result

C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in _python_apply_general(self, f)
    705     def _python_apply_general(self, f):
    706         keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 707                                                    self.axis)
    708 
    709         return self._wrap_applied_output(

C:\Anaconda\lib\site-packages\pandas\core\groupby\ops.py in apply(self, f, data, axis)
    188             # group might be modified
    189             group_axes = _get_axes(group)
--> 190             res = f(group)
    191             if not _is_indexed_like(res, group_axes):
    192                 mutated = True

C:\Anaconda\lib\site-packages\pandas\core\groupby\groupby.py in f(g)
    677                 def f(g):
    678                     with np.errstate(all='ignore'):
--> 679                         return func(g, *args, **kwargs)
    680             else:
    681                 raise ValueError('func must be a callable if args or '

<ipython-input-563-5db69048f347> in theil_reg(df, xcol, ycol)
       1 def theil_reg(df, xcol, ycol):
 ----> 2     model = stats.theilslopes(df[ycol],df[xcol])
       3     return pd.Series(model)

C:\Anaconda\lib\site-packages\scipy\stats\_stats_mstats_common.py in theilslopes(y, x, alpha)
    247     # Find the confidence interval indices in `slopes`
    248     sigma = np.sqrt(sigsq)
--> 249     Ru = min(int(np.round((nt - z*sigma)/2.)), len(slopes)-1)
    250     Rl = max(int(np.round((nt + z*sigma)/2.)) - 1, 0)
    251     delta = slopes[[Rl, Ru]]

ValueError: cannot convert float NaN to integer

However, I have no null values in either column, and both columns are floats. Any suggestions on this error?

CSlater
  • 73
  • 7
  • what happens when u change the column name from `time` to `foo-bar`? does it run? – MattR Jun 07 '19 at 19:13
  • @MattR it has the same error: ValueError: could not convert string to float: 'foo-bar' – CSlater Jun 07 '19 at 19:28
  • then at least we know that it isn't passing `time` as a keyword :) – MattR Jun 07 '19 at 19:29
  • Yes and no (I believe) - I tried just passing foo-bar into the function without changing the column name, and it passed that error. It also passed the same error when I changed the column name to foo-bar – CSlater Jun 07 '19 at 19:34
  • Whenever you report a Python error, include the *complete* traceback (i.e. the complete error message) in the question. There is useful information in there, including exactly which line triggered the error. – Warren Weckesser Jun 08 '19 at 18:28
  • In general, scipy functions are not designed to handle Pandas objects (i.e. `DataFrame` or `Series` objects). It might work, but usually it is safer to pass numpy arrays. You can get a numpy array from a Pandas object by using the `.values` attribute. – Warren Weckesser Jun 08 '19 at 18:30
  • One more note: it would be much easier for someone to help you if you provided a [minimal and reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). Can you create a small, self-contained example that anyone can just copy and run (without editing) to reproduce the problem? – Warren Weckesser Jun 08 '19 at 18:33
  • @WarrenWeckesser I appreciate the feedback. I noticed that scipy functions tend to handle numpy arrays better, but am still noticing the same error when I try passing numpy arrays. I have included the full error message in my original post, at the bottom under the "Edit" section. Please let me know if that provides better context for you. – CSlater Jun 10 '19 at 13:27

1 Answers1

2

Essentially, you are passing the string values of column names (not any value entities) into methods but the slopes calls require numpy arrays (or pandas series that can be coerced into arrays). Specifically, you are attempting this call with no reference to df and hence your error:

model = stats.theilslopes('value', 'time')

Simply reference df in the calls:

model = stats.theilslopes(df['value'], df['time'])

model = stats.theilslopes(df[ycol], df[xcol])

Regarding different results across packages does not mean bugs with Scipy. Packages run different implementations. Read docs carefully to see how to call methods. Possibly, the other package you refer to allows a data input as argument inside call and the named strings reference the columns like below:

slopes_call(y='y_string', x='x_string', data=df)

In general, the Python object model always requires explicit named references to calls and objects and does not assume context.

Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thanks @Parfait. I tried this solution but received another error, which can be found in an update to my question above under "Update 2". Any thoughts on what's causing this? There are no NaN values in either column and they're both floats. – CSlater Jun 10 '19 at 16:05
  • With your posted data, I do not receive such an error. So issue may be data-specific. Please post enough to [reproduce](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) your error. – Parfait Jun 10 '19 at 16:33
  • Honestly, I'm not entirely sure how to create a reproducible example here. I've run this on a subset of the data, and it worked. So it seems that at some point, theilslopes cannot be calculated based on the data available and it throws an error as shown in Update 2. I've been trying to figure out times in which this error is thrown, and none of them seem to be relevant to my current issue (as they mostly relate to having NaN values in the dataset). So I guess if you have any advice on troubleshooting that error, that would be helpful for me. – CSlater Jun 10 '19 at 18:56
  • Hallelujah I figured out a solution. I used a try-except framework in the function and passed situations in which the ValueError arose, and it turns out that those situations should be excluded from the analysis anyways. Since you led me down the path to this solution, I'll give you creds for the right answer. Thanks @Parfait ! – CSlater Jun 10 '19 at 19:31