1

I have the following Dataframe (1,2 millon rows):

df_test_2 = pd.DataFrame({"A":["end","beginn","end","end","beginn","beginn","end","end","end","beginn","end"],"B":[1,10,50,60,70,80,90,100,110,111,112]})`

Now I try to find a sequences. Each "beginn "should match the first "end"where the distance based on column B is at least 40 occur. For the provided Dataframe that would mean: enter image description here

The sould problem is that Your help is highly appreciated.

user8495738
  • 147
  • 1
  • 3
  • 14
  • 1
    What is your expected output? Indexes? What have you tried/investigated and why didn't that fulfill your requirements? – Jondiedoop Oct 06 '18 at 12:46
  • 1
    Is there only begin and end in the dataframe or something else? Should extra begin/end entries be ignored? Please post what you have tried already – 576i Oct 06 '18 at 12:57
  • 1
    Group number 2 only has a distance of 20, something is wrong with your example. – 576i Oct 06 '18 at 12:59
  • To be honest I am absolutley absolutely clueless. That is a part of an project which is due today and I have been working for around 12 h (Programming the UI etc.) . ^^therefore my head is prety empty – user8495738 Oct 06 '18 at 13:00

1 Answers1

2

I will assume that as your output you want a list of sequences with the starting and ending value. The second sequence that you identify in your picture has a distance lower to 40, so I also assumed that that was an error.

import pandas as pd
from collections import namedtuple
df_test_2 = pd.DataFrame({"A":["end","beginn","end","end","beginn","beginn","end","end","end","beginn","end"],"B":[1,10,50,60,70,80,90,100,110,111,112]})

sequence_list = []
Sequence = namedtuple('Sequence', ['beginn', 'end'])

beginn_flag = False
beginn_value = 0
for i, row in df_test_2.iterrows():
    state = row['A']
    value = row['B']

    if not beginn_flag and state == 'beginn':
        beginn_flag = True
        beginn_value = value 
    elif beginn_flag and state == 'end':
        if value >= beginn_value + 40:
            new_seq = Sequence(beginn_value, value)
            sequence_list.append(new_seq)
            beginn_flag = False

 print(sequence_list)

This code outputs the following:

[Sequence(beginn=10, end=50), Sequence(beginn=70, end=110)]

Two sequences, one starting at 10 and ending at 50 and the other one starting at 70 and ending at 110.

  • That is amazing. Thanks a lot – user8495738 Oct 06 '18 at 13:44
  • Thanks a lot again. Beside of the solution for my problem, I learned new a type "namedtuple" – user8495738 Oct 06 '18 at 16:01
  • 1
    Yeah, named tuples are great for creating a quick printable object: https://docs.python.org/3.6/library/collections.html#collections.namedtuple. Later on you can call its attribute like a class. Example: ```new_seq = Sequence(beginn=10, end=50) print(new_seq.beginn)``` – Fernando Irarrázaval G Oct 06 '18 at 16:27