1

I am relatively new to Python and I am currently facing some problems in implementing a conceptually simple algorithm in an efficient way. I have been able to do it in pandas (but it is quite slow to execute).

I have a ndarray composed by n rows and 3 columns:

--------------------
  "A"  |  1  |  12
--------------------
  "B"  |  2  |  34
--------------------
  "S"  |  3  |  1
--------------------
  "B"  |  4  |  145
--------------------
  "A"  |  5  |  132
--------------------
  "B"  |  6  |  234
--------------------
  "E"  |  7  |  1
--------------------
  "B"  |  8  |  15
--------------------

The first column represents an id, the second column a timestamp and the third a value. I have to filter the ndarray taking only the rows with timestamps included between the id "S" (Start) timestamp and the id "E" (End) timestamp. It is possible o have more than one pair of "S" and "E" in the same ndarray. In case of non-consecutive "S" and "E" pair, I need the shortest subarray. In other words, no id "S" or "E" should appear in output.

So the output should be:

--------------------
  "B"  |  4  |  145
--------------------
  "A"  |  5  |  132
--------------------
  "B"  |  6  |  234
--------------------

As already said, I have obtained this result using pandas, but the function is really long and complicated and it executes really slowly. So I am sure that with numpy it is possible to obtain a better and most efficient algorithm.

Do you have any ideas?

Thanks in advance.

EDIT

Here it is a code snippet which obtains expected results using pandas. The execution time is about 0.015 seconds on a "Intel Core i7-6820 @ 2.70GHz" processor.

df = pd.DataFrame({'id': ['A', 'B', 'C', 'S', 'C', 'C', 'A', 'B', 'E', 'A', 'C', 'B', 'B', 'S', 'C', 'A', 'E', 'B'],
                   't': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18],
                   'v': [145, 543, 12, 1, 14, 553, 65, 657, 1, 32, 54, 22, 11, 1, 6, 22, 1, 4]})
print(df)
res = pd.DataFrame()
id = "id"
t = "t"
v = "v"
id_bit_start = "S"
id_bit_end = "E"

# taking only "S" and "E" from df (when their value is 1)
df_data_bit = df.loc[
    ((df[id] == id_bit_start) |
     (df[id] == id_bit_end)) &
    (df[v] == 1.0)
    ]

# do something only if at least one "S" is present
if id_bit_start in df_data_bit[id].values:
    # creating empty list of time windows
    t_windows = list()
    t_wind_temp = [None, None]

    # for each bit "S" or "E"
    for index, bit in df_data_bit.iterrows():
        # if it is a "S"
        if bit[id] == id_bit_start:
            # set the start of the time window
            t_wind_temp[0] = bit[t]
        # if it is a "E" and the "S" has already been processed
        elif t_wind_temp[0] is not None:
            # set the end of the time window
            t_wind_temp[1] = bit[t]
            # append the current time window to our list
            t_windows.append(t_wind_temp)
            # reset the current time window
            t_wind_temp = [None, None]

    # taking everything but "S" and "E"
    df_data = df.loc[
        ~((df[id] == id_bit_start) |
          (df[id] == id_bit_end))
    ]

    # for each created time window
    for t_window in t_windows:
        # take only data with timestamps between the time window
        result = df_data.loc[
            (df_data[t] >= t_window[0])
            &
            (df_data[t] <= t_window[1])
            ]
        # append to the final result
        res = pd.concat([res, result])

print(res)
Ehsan
  • 12,072
  • 2
  • 20
  • 33
Luigi
  • 13
  • 3
  • 1
    u mind sharing a code to generate the numpy, or maybe a glimpse of what u did in pandas, as pandas is quite good when it comes to date manipulation – sammywemmy May 28 '20 at 09:19
  • Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on how to ask a good question may also be useful. – yatu May 28 '20 at 09:41
  • Thanks for your tips, I have updated my qustion – Luigi May 28 '20 at 10:35
  • @Luigi Do you want to S and E pairs be consecutive pairs? – Ehsan May 28 '20 at 10:42
  • @Ehsan I have the certainity that the data are ordered by ascending timestamp but I cannot know if S and E will always be consecutive, sometimes could happen that i have S, then a missing E, and then another S and E (that is why there are some check in the "iterrows()" part of the code) – Luigi May 28 '20 at 10:45
  • @Luigi Thank you for clarifying. And in the case you mentioned (S-S-E), do you want the shorter S-E window or longer S-E window? – Ehsan May 28 '20 at 10:46
  • @Luigi Please find my solution to your question. regardless of if you want shorter or longer S-Es, you can easily change that to fit your desire. The solution assumes smallest S-E pairs. – Ehsan May 28 '20 at 11:26

1 Answers1

0

This takes care of your uncertainty of consecutiveness of S and E:
Assuming your timestamps are in ascending order:

import re

a = df.to_records(index=False)
idx = [m.span() for m in re.finditer('S[^{SE}]*?E', ''.join(a['id']))]
indexer = np.r_[tuple([np.s_[i+1:j-1] for (i,j) in idx])]
a_filtered = a[indexer]

Explanation:

There are some tricks in fast calculating this:

  1. convert your dataframe to a structured array
  2. convert all id characters into a string
  3. look for non-greedy matches of S.*?E (note you can change S and E to any substrings if your id is not single letter)
  4. get the indices of start and end of substrings you find
  5. create a list of all indices between those indices in 4
  6. filter your array using the indices in 5
Ehsan
  • 12,072
  • 2
  • 20
  • 33
  • Thank you for clarifying the problem. I just saw the update. Will update the answer accordingly. – Ehsan May 28 '20 at 10:39
  • yes timestamp could have gaps, but always ordered in ascending way. – Luigi May 28 '20 at 11:26
  • modifying the dataframe with 2 consecutive "S" this does not work :( – Luigi May 28 '20 at 11:35
  • @Luigi my bad, this assumes the longer S-E pair. Did you want shorter one? – Ehsan May 28 '20 at 11:36
  • In case of 2 consecutive S it is OK to have the longer one (while, in case of 2 consecutive E, should be the shorter one) but in the filtered data not S nor E should appear. – Luigi May 28 '20 at 11:39
  • Your statement is in contradict with your desired output. if in case of 2 cons S, longer is allowed, then output will have S. I think what you are looking for is shortest S-E. fixing regex for it should be easy. – Ehsan May 28 '20 at 11:41
  • @Luigi please checkout the edit. I suggest in your future postings, clearly state all concerns of question at the beginning. Thank you. I tested on both S-S-E and S-E-E pairs and it works. – Ehsan May 28 '20 at 11:45