1

I have written the following piece of code which assigns tuples to segments. A segment is a container of tuples and spans a certain time interval. Contrary to a tuple which has just 1 timestamp.

However, since my code has ~ 30 000 tuples, and this step is iterated quite often, it spends a lot of time on this method.

Is there a more efficient way to handle this?

for timestamp, tuple in tuples.iterrows():
    this_seg = [s for s in segments if s.can_have(timestamp)]
    assert(len(this_seg) <= 1)
    for s in this_seg:
        s.append(tuple)
return segments

Here is some more context:

A segment is a class of type Segment, and has a constructor as follows:

def __init__(self, ts_max, ts_min):
            self._df = pd.DataFrame({})
            self._ts_max = ts_max
            self._ts_min = ts_min

The method can_have checks whether the given timestamp, could be part of the segment: i.e. timestamp lies between ts_min and ts_max.

Tuples is a Pandas dataframe, which has timestamps as indices and some other features as columns.

Thiebout
  • 171
  • 3
  • 15
  • 1
    What's a segment? Can you provide a [mcve]? – jpp Dec 17 '18 at 13:48
  • Same here, what is `segments` what is `can_have` , why is it applied on the index and not the value ? the code is really unclear – Uri Goren Dec 17 '18 at 13:50
  • @UriGoren I have updated my question with some more context. Please tell me if not clear. – Thiebout Dec 17 '18 at 14:04
  • @TomVerstraete, Looks like you have a class. Take the time to create a **minimal** and **self-contained** example. Your code doesn't run for any of us trying to replicate your issue. – jpp Dec 17 '18 at 14:07
  • I don't understand why you are using classes to iterate on date intervals. If you post a sample of your data and your desired output, I can help you further. I can recommend [this post](https://stackoverflow.com/questions/29370057/select-dataframe-rows-between-two-dates) and `searchsorted` method (see [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.searchsorted.html)) for something that is way faster than your lookup. `searchsorted` should work in your case, since given your `assert` statement, date intervals are disjoint. – Tarifazo Dec 17 '18 at 14:27

1 Answers1

2

Iterrows is the slowest way to do things in Pandas. It's not clear from your question what you're trying to do, but this tutorial offers several faster replacements for iterrows.

https://realpython.com/fast-flexible-pandas/

Allen
  • 236
  • 3
  • 12