I'm trying to build a tool that'll help simplify a research effort and seemingly would need to detect when I have incrementing sequences in the data in one column and asc/desc sequences in another.
Is there a clean way to check if there are sequences in the rows without having to write a state-machine that iterates over the rows like this one https://stackoverflow.com/a/52679427/5045375 ? Writing such a snippet of code one would have to check if the values in the one column are incrementing (no gaps) and the other is either asc/desc (no gaps). I'm perfectly able to do this, I just wonder if there is something in my pandas toolbox that I'm missing.
Here are some examples to clarify my intent,
import pandas as pd
from collections import namedtuple
QUERY_SEGMENT_ID_COLUMN = 'Query Segment Id'
REFERENCE_SEGMENT_ID_COLUMN = 'Reference Segment Id'
def dataframe(data):
columns = [QUERY_SEGMENT_ID_COLUMN, REFERENCE_SEGMENT_ID_COLUMN]
return pd.DataFrame(data, columns=columns)
# No sequence in either column. No results
data_without_pattern = [[1, 2], [7, 0], [3, 6]]
# Sequence in first column, but no sequence in second column. No results
data_with_pseodo_pattern_query = [[1, 2], [2, 0], [3, 6]]
# Sequence in second column, but no sequence in first column. No results
data_with_pseudo_pattern_reference = [[1, 2], [7, 3], [3, 4]]
# Broken sequence in first column, sequence in second column. No results
data_with_pseudo_pattern_query_broken = [[1, 2], [3, 3], [7, 4]]
# Sequence occurs in both columns, asc. Expect results
data_with_pattern_asc = [[1, 2], [2, 3], [3, 4]]
# Sequence occurs in both columns, desc. Expect results
data_with_pattern_desc = [[1, 4], [2, 3], [3, 2]]
# There is a sequence, and some noise. Expect results
data_with_pattern_and_noise = [[1, 0], [1, 4], [1, 2], [1, 3], [2, 3], [3, 4]]
In the first example, there are no patterns whatsoever,
print(dataframe(data_without_pattern))
Query Segment Id Reference Segment Id
0 1 2
1 7 0
2 3 6
The second example has a ascending sequence of ids in the query column, but none in the reference column,
print(dataframe(data_with_pseodo_pattern_query))
Query Segment Id Reference Segment Id
0 1 2
1 2 0
2 3 6
The third example is the same as the previous one, but for the reference column.
print(dataframe(data_with_pseudo_pattern_reference))
Query Segment Id Reference Segment Id
0 1 2
1 7 3
2 3 4
Here, there is a ascending sequence in the reference column and while the ids in the query column are also in ascending order there are gaps and so no results are expected
print(dataframe(data_with_pseudo_pattern_query_broken))
Query Segment Id Reference Segment Id
0 1 2
1 3 3
2 7 4
Here are two "perfect" examples, where the query column is in ascending order, unbroken and the reference column is in descending and ascending order respectively. Results are expected.
print(dataframe(data_with_pattern_asc))
Query Segment Id Reference Segment Id
0 1 2
1 2 3
2 3 4
print(dataframe(data_with_pattern_desc))
Query Segment Id Reference Segment Id
0 1 4
1 2 3
2 3 2
And ultimately, a case with less clean data but nevertheless the desired patterns are there
print(dataframe(data_with_pattern_and_noise))
Query Segment Id Reference Segment Id
0 1 0
1 1 4
2 1 2
3 1 3
4 2 3
5 3 4
This latest case might demand further explanation. My aim here would be to return something akin to q=(1, 3), r=(2, 4)
, e.g. the (start, end)
values from the respective columns (not indexes).
I'm thinking if there is a possibility to do a nice sequence of group_by
operations but I'm not seeing the woods for the trees.
I consider the question to have community value because I couldn't find a similar question where someone is looking for patterns in rows in multiple columns.
Edit: case from comments (@code-different)
For the dataframe,
data_with_multiple_contiguous_sequences = [[1, 1], [2, 2], [3, 3], [0, 4], [1, 5], [2, 6], [3, 7], [4, 8]]
Query Segment Id Reference Segment Id
0 1 1
1 2 2
2 3 3
3 0 4
4 1 5
5 2 6
6 3 7
7 4 8
The goal would be to identifying both sequences. Meaning we'd want to yield q1=(1, 3), r1=(1, 3), q2=(0, 4), r2=(4, 8)
.