6

I'm trying to build a tool that'll help simplify a research effort and seemingly would need to detect when I have incrementing sequences in the data in one column and asc/desc sequences in another.

Is there a clean way to check if there are sequences in the rows without having to write a state-machine that iterates over the rows like this one https://stackoverflow.com/a/52679427/5045375 ? Writing such a snippet of code one would have to check if the values in the one column are incrementing (no gaps) and the other is either asc/desc (no gaps). I'm perfectly able to do this, I just wonder if there is something in my pandas toolbox that I'm missing.

Here are some examples to clarify my intent,

import pandas as pd 
from collections import namedtuple

QUERY_SEGMENT_ID_COLUMN = 'Query Segment Id'
REFERENCE_SEGMENT_ID_COLUMN = 'Reference Segment Id'

def dataframe(data):
    columns = [QUERY_SEGMENT_ID_COLUMN, REFERENCE_SEGMENT_ID_COLUMN]
    return pd.DataFrame(data, columns=columns)

# No sequence in either column. No results
data_without_pattern = [[1, 2], [7, 0], [3, 6]]

# Sequence in first column, but no sequence in second column. No results
data_with_pseodo_pattern_query = [[1, 2], [2, 0], [3, 6]]

# Sequence in second column, but no sequence in first column. No results
data_with_pseudo_pattern_reference = [[1, 2], [7, 3], [3, 4]]

# Broken sequence in first column, sequence in second column. No results
data_with_pseudo_pattern_query_broken = [[1, 2], [3, 3], [7, 4]]

# Sequence occurs in both columns, asc. Expect results
data_with_pattern_asc = [[1, 2], [2, 3], [3, 4]]

# Sequence occurs in both columns, desc. Expect results
data_with_pattern_desc = [[1, 4], [2, 3], [3, 2]]

# There is a sequence, and some noise. Expect results
data_with_pattern_and_noise = [[1, 0], [1, 4], [1, 2], [1, 3], [2, 3], [3, 4]]

In the first example, there are no patterns whatsoever,

print(dataframe(data_without_pattern))
   Query Segment Id  Reference Segment Id
0                 1                     2
1                 7                     0
2                 3                     6

The second example has a ascending sequence of ids in the query column, but none in the reference column,

print(dataframe(data_with_pseodo_pattern_query))
   Query Segment Id  Reference Segment Id
0                 1                     2
1                 2                     0
2                 3                     6

The third example is the same as the previous one, but for the reference column.

print(dataframe(data_with_pseudo_pattern_reference))
   Query Segment Id  Reference Segment Id
0                 1                     2
1                 7                     3
2                 3                     4

Here, there is a ascending sequence in the reference column and while the ids in the query column are also in ascending order there are gaps and so no results are expected

print(dataframe(data_with_pseudo_pattern_query_broken))
   Query Segment Id  Reference Segment Id
0                 1                     2
1                 3                     3
2                 7                     4

Here are two "perfect" examples, where the query column is in ascending order, unbroken and the reference column is in descending and ascending order respectively. Results are expected.

print(dataframe(data_with_pattern_asc))
   Query Segment Id  Reference Segment Id
0                 1                     2
1                 2                     3
2                 3                     4

print(dataframe(data_with_pattern_desc))
   Query Segment Id  Reference Segment Id
0                 1                     4
1                 2                     3
2                 3                     2

And ultimately, a case with less clean data but nevertheless the desired patterns are there

print(dataframe(data_with_pattern_and_noise))
   Query Segment Id  Reference Segment Id
0                 1                     0
1                 1                     4
2                 1                     2
3                 1                     3
4                 2                     3
5                 3                     4

This latest case might demand further explanation. My aim here would be to return something akin to q=(1, 3), r=(2, 4), e.g. the (start, end) values from the respective columns (not indexes).

I'm thinking if there is a possibility to do a nice sequence of group_by operations but I'm not seeing the woods for the trees.

I consider the question to have community value because I couldn't find a similar question where someone is looking for patterns in rows in multiple columns.

Edit: case from comments (@code-different)

For the dataframe,

data_with_multiple_contiguous_sequences = [[1, 1], [2, 2], [3, 3], [0, 4], [1, 5], [2, 6], [3, 7], [4, 8]]

   Query Segment Id  Reference Segment Id
0                 1                     1
1                 2                     2
2                 3                     3
3                 0                     4
4                 1                     5
5                 2                     6
6                 3                     7
7                 4                     8

The goal would be to identifying both sequences. Meaning we'd want to yield q1=(1, 3), r1=(1, 3), q2=(0, 4), r2=(4, 8).

Code Different
  • 90,614
  • 16
  • 144
  • 163
Filip Allberg
  • 3,941
  • 3
  • 20
  • 37
  • In the last example, what do `q=(1, 3)` and `r=(2, 4)` represent? The star and end of the increasing/decreasing range for each column? And what if there are multiple contiguous sequences in the same column, like `1 2 3 0 1 2 3 4`, what should the return value be in this case? – Code Different Apr 12 '20 at 19:08
  • @code-different Good questions. I've updated the question to clarify that the tuples are indeed the start and end of the ranges. I'm not sure I understand your second example. I'd say it depends on the content in the other column. But I'll edit the question to include that. – Filip Allberg Apr 13 '20 at 05:51

1 Answers1

3

If I understand you correctly, your problem is a variant of the island-and-gap problem. Every monotonic (increasing or decreasing) subsequence with a acceptable gap will form an island. For example, given a series s:

s   island
--  ------
0   1
0   1
1   1
3   2        # gap > 1, form new island
4   2
2   3        # stop increasing, form new island
1   3 
0   3

To generalize: whenever the gap between the current and previous row is outside of the [-1, 1] range, a new island is formed.

Applying this gap-and-island algorithm on Query Segment Id and Reference Segment Id:

Query Segment Id  Q Island    Reference Segment Id  R Island    Q-R Intersection
----------------  --------    --------------------  --------    ----------------
1                 1           1                     1           (1, 1)
2                 1           2                     1           (1, 1)
3                 1           3                     1           (1, 1)
0                 2           4                     1           (2, 1)
1                 2           5                     1           (2, 1)
2                 2           6                     1           (2, 1)
3                 2           7                     1           (2, 1)
4                 2           8                     1           (2, 1)
0                 3           9                     1           (3, 1)

The q and r ranges you are looking for are now the Query Segment Id and Reference Segment Id at the beginning and end of each Q-R Intersection. One final caveat though: ignore intersection of length 1 (like the last intersection).


Code:

columns = ['Query Segment Id', 'Reference Segment Id']
df = pd.DataFrame(data_with_multiple_contiguous_sequences, columns=columns)

def get_island(col):
    return (~col.diff().between(-1,1)).cumsum()

df[['Q Island', 'R Island']] = df[['Query Segment Id', 'Reference Segment Id']].apply(get_island)

result = df.groupby(['Q Island', 'R Island']) \
            .agg(**{
                'Q Start': ('Query Segment Id', 'first'),
                'Q End': ('Query Segment Id', 'last'),
                'R Start': ('Reference Segment Id', 'first'),
                'R End': ('Reference Segment Id', 'last'),
                'Count': ('Query Segment Id', 'count')
            }) \
            .replace({'Count': 1}, {'Count': np.nan}) \
            .dropna()
result['Q'] = result[['Q Start', 'Q End']].apply(tuple, axis=1)
result['R'] = result[['R Start', 'R End']].apply(tuple, axis=1)

Result:

                   Q Start  Q End  R Start  R End  Count       Q       R
Q Island R Island                                                       
1        1               1      3        1      3      3  (1, 3)  (1, 3)
2        1               0      4        4      8      5  (0, 4)  (4, 8)
Code Different
  • 90,614
  • 16
  • 144
  • 163
  • Beautiful, and correct solution. Thank you. This "unblocks" me. I can now further process the islands to tell the story that the data is telling. – Filip Allberg Apr 21 '20 at 08:57