1

I am trying to bin a Pandas dataframe in Python 3 in order to have more efficient grouping over a large dataset. Currently the performance bottleneck is in iterating over the dataframe using the .apply() method.

All entries within the column are in hex, so it seems like the pd.to_numeric function should do exactly what I want.

I've tried a variety of options, but so far nothing has worked.

#  This sets all values to np.nan with coerced errors, 'Unable to parse string' with raise errors.
dataframe[bin] = pd.to_numeric(dataframe[to_bin], errors='coerce') % __NUM_BINS__ 

# Gives me "int() Cannot convert non-string with explicit base"
dataframe[bin] = int(dataframe[to_bin].astype(str), 16) % __NUM_BINS__

# Value Error: Invalid literal for int with base 10 'ffffffffff'
dataframe[bin] = dataframe.astype(np.int64) % __NUM_BINS__ 

Any suggestions? This seems like something that people would have to have tackled in the past.

ZSmyth
  • 11
  • 4
  • 1
    Have you tried this solution to convert the column entities to string: http://stackoverflow.com/a/22231860/1607105? You may need to append the `0x` prefix after string conversion so that `int(x,16)` knows it has a HEX value too. – dblclik Sep 20 '16 at 15:58
  • I just tried converting to string and prepending 0x to the value, but still get the "int() Cannot convert non-string with explicit base". I checked my dataframe aand the actual values are definitely strings (0xfffff as an example). – ZSmyth Sep 20 '16 at 16:48
  • You could still get some performance improvement by using `list-comprehension` and creating the series from that. Like:`[int(x, 16) for x in ser.tolist()]` – Nickil Maveli Sep 20 '16 at 18:18
  • The list comprehension style does appear to have made some performance improvements. Any ideas on if there is an even faster way to do this? It seems like something that should be parallelizeable much like adding a constant to columns. – ZSmyth Sep 23 '16 at 12:00

1 Answers1

0

After some assists from the comments above: a faster way to accomplish this is to use a generator function. That way it can deal with any exceptions if provided data that cannot be converted from hex.

def bin_vals(lst):
    for item in lst:
        try:
             yield int(item, 16) % __NUM_BINS__
        except:
             yield __ERROR_BIN__ #whatever you store weird items in

Then in your conversion portion you would do the following:

dataframe['binned_value'] = [bin for bin in bin_vals(df['val_to_bin'].tolist())]

That led to a substantial speedup from iterating through each row. It was also faster than the apply method which I had used originally.

ZSmyth
  • 11
  • 4