0

Say I have the df:

Name         Sequence
Bob             IN,IN
Marley         OUT,IN
Jack     IN,IN,OUT,IN
Harlow               

The df has names, and sequences of 'ins/outs'. There can be blank values in the Sequence column. How can I apply these two functions on the Sequence column in an efficient manner? Something like this pseudocode:

df['Sequence'] = converter(sequencer(df['Sequence']))

# takes string of IN/OUT, converts to bits, returns bitstring. 'IN,OUT,IN' -> '010'
def sequencer(seq):
    # 'IN,IN' -> ['IN', 'IN']
    seq = seq.split(',')
    # get sequence up to 3 unique digits. [0,0,1,1,0] = sequence 010
    seq = [1 if x == 'IN' else 0 for x in seq]
    a = seq[0]
    try:
        b = seq.index(1-a, 1)
    except:
        return str(a)
    if a not in seq[b+1]:
        return str(a) + str(1-a)

    return str(a) + str(1-a) + str(a)

# converts bitstring back into in/out format
def converter(seq):
    return '-'.join(['IN' if x == '1' else 'OUT' for x in seq])

to result in this dataframe?

Name         Sequence
Bob                IN
Marley         OUT-IN
Jack        IN-OUT-IN
Harlow  

I glanced at this post here and the comments say to not use apply because it's inefficient and I need efficiency since I'm working on a large dataset.

  • I don't think there's a huge issue here to use `apply()`. This is specifically what it was designed for - apply a function to a column of a dataframe. – NotAName Mar 23 '21 at 23:39
  • How large is your dataset? If they are a few millions rows it shouldn't be too long.Try to speed up your function cause it seems not trivial to vectorize the change you wanna make. – politinsa Mar 23 '21 at 23:42
  • @pavel ok i'll just go with apply. how would i do that? – Jerry Stackhouse Mar 23 '21 at 23:51
  • Why are you doing all these shenanigans just to do a replace comma with dash operation? `df.Sequence.fillna('').str.replace(',', '-')` – piRSquared Mar 23 '21 at 23:53
  • @piRSquared that's what I was thinking, but row with "Jack" as the name, the input goes from "IN,IN,OUT,IN" to "IN-OUT-IN" (dropping the first or second "IN") – Cameron Riddell Mar 23 '21 at 23:55
  • it's not just replacing the char. it is finding the sequence and reducing it to the first 3 'unique' in/outs @piRSquared i fixed the resulting df to accurately depict the logic, sorry for the confusion – Jerry Stackhouse Mar 23 '21 at 23:55
  • Alright, I see now. – piRSquared Mar 23 '21 at 23:58

1 Answers1

1

itertools

  • use groupby to get unique (non-repeated) things
  • use islicde to get the first 3.

from itertools import islice, groupby

def f(s):
    return '-'.join([k for k, _ in islice(groupby(s.split(',')), 3)])

df.assign(Sequence=[*map(f, df.Sequence.fillna(''))])

     Name   Sequence
0     Bob         IN
1  Marley     OUT-IN
2    Jack  IN-OUT-IN
3  Harlow           

Variation with a better closure for maximum future flexibility.

from itertools import islice, groupby

def get_f(n, splitter=',', joiner='-'):
    def f(s):
        return joiner.join([k for k, _ in islice(groupby(s.split(splitter)), n)])
    return f

df.assign(Sequence=[*map(get_f(3), df.Sequence.fillna(''))])

Another variation that makes it more obvious what I'm doing (less obnoxious Python bling)

from itertools import islice, groupby

def get_f(n, splitter=',', joiner='-'):
    def f(s):
        return joiner.join([k for k, _ in islice(groupby(s.split(splitter)), n)])
    return f

f = get_f(3)
df['Sequence-InOut'] = [f(s) for s in df.Sequence.fillna('')]
df

     Name      Sequence Sequence-InOut
0     Bob         IN,IN             IN
1  Marley        OUT,IN         OUT-IN
2    Jack  IN,IN,OUT,IN      IN-OUT-IN
3  Harlow          None               
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • thanks! so just by playing with your first example i can figure out that 'Sequence=' sets the name of the column and if I change it to something like Sequence123, it will create a new column. is there a way to set the column name to a string like 'Sequence-InOut'? i tried passing in a str var but that just writes out the actual name of the var. also, can you explain the arguments in the brackets a bit? it seems like it's mapping f to everything under df.Sequence, but what is the '*' for? – Jerry Stackhouse Mar 24 '21 at 00:21