Pandas : Find the first occurence of a string then take all characters thereafter up until anther string

Question

I'm looking to update a column with the ISIN or CUSIP portion of a string which is contained in another column:

my_DestSystemNote1_string =  'ISIN=XS1906311763|CUSIP=         |CalTyp=1'
dfDest = [('DestSystemNote1', ['ISIN=XS1906311763|CUSIP=         |CalTyp=1', 
                               'ISIN=XS0736418962|CUSIP=         |CalTyp=1', 
                               'ISIN=XS1533910508|CUSIP=         |CalTyp=1', 
                               'ISIN=US404280AS86|CUSIP=404280AS8|CalTyp=1', 
                               'ISIN=US404280BW89|CUSIP=404280BW8|CalTyp=21',
                               'ISIN=US06738EBC84|CUSIP=06738EBC8|CalTyp=21',
                               'ISIN=XS0736418962|CUSIP=         |CalTyp=1',]),
         ]
# create pandas df
dfDest = pd.DataFrame.from_items(dfDest)
display(dfDest)
print("")

DestSystemNote1 contains the source string from which either ISIN or CUSIP need to be extracted:

DestSystemNote1                                Found_ISIN    Found_CUSIP
ISIN=XS1906311763|CUSIP=         |CalTyp=1     XS1906311763
ISIN=XS0736418962|CUSIP=         |CalTyp=1     XS0736418962
ISIN=XS1533910508|CUSIP=         |CalTyp=1     XS1533910508
ISIN=US404280AS86|CUSIP=404280AS8|CalTyp=1     US404280AS86  404280AS8
ISIN=US404280BW89|CUSIP=404280BW8|CalTyp=21    US404280BW89  404280BW8
ISIN=US06738EBC84|CUSIP=06738EBC8|CalTyp=21    US06738EBC84  06738EBC8
ISIN=XS0736418962|CUSIP=         |CalTyp=1     XS0736418962

ISIN's will always be preceeded by "ISIN=" and end a character before "|"

CUSIPS's will always be preceeded by "CUSIP=" and end a character before "|"

My attempt is as follows:

my_DestSystemNote1_string =  'ISIN=XS1906311763|CUSIP=         |CalTyp=1'
code = my_DestSystemNote1_string.split("ISIN=",1)[1]
code = code[:12]
print(code)

XS1906311763

So I'm getting there but would like to paramaterise it to find the nth occurence of a passed string (strStart) then take all characters after it's end char + 1 and up to, but not including; the nth occurence of another string (strEnd).

Pete

Could you give us more details about your expeted output and the given input ? — Charles R, Nov 09 '18 at 08:53
This might help you https://stackoverflow.com/questions/1883980/find-the-nth-occurrence-of-substring-in-a-string — Karl, Nov 09 '18 at 09:00

Karl · Accepted Answer · 2018-11-09T21:39:16.980

1

Working from this answer (Find the nth occurrence of substring in a string):

def findnth(haystack, needle, n):
    parts= haystack.split(needle, n+1)
    if len(parts)<=n+1:
        return -1
    return len(haystack)-len(parts[-1])-len(needle)

you could do something along the following lines:

def split_between(input_string, start_str, start_occurence, end_str, end_occurence):
    start_index = findnth(input_string, start_str, start_occurence-1) + len(start_str)
    end_index = findnth(input_string, end_str, end_occurence-1)
    return input_string[start_index:end_index]

input_string="ISIN=111111|ISIN=222222|333333|ISIN=444444"

split_between(input_string, "ISIN=", 2, "|", 2)
# returns "222222"

edited Nov 09 '18 at 21:39

answered Nov 09 '18 at 09:15

Karl

5,573
8
50
73

Thanks @Karl, that's awsome. – Peter Lucas Nov 09 '18 at 09:18
No worries. Please give the answer I used in order to build up mine an upvote. That is definitely the more difficult part! – Karl Nov 09 '18 at 09:21
Hey @Karl, just trying to implement this for the ISN column. I'm getting no records being returned. This my call: – Peter Lucas Nov 09 '18 at 10:01
dfDest['FOUND_ISIN'] = "" for index, row in dfDest.iterrows(): try: print(row.DestSystemNote1) row.FOUND_ISIN2 = split_between(dfDest.DestSystemNote1, "ISIN=", 1, "|", 1) print ('DestSystemNote1=' + row.DestSystemNote1 + " " + 'FOUND_ISIN= ' + dfDest.FOUND_ISIN) except: pass # doing nothing on exception display(dfDest) – Peter Lucas Nov 09 '18 at 10:02
Surely it should be ...split_between(row.DestSystemNote1, "ISIN=", 1, "|", 1)...? I.e. replace "dfDest" with "row" – Karl Nov 09 '18 at 10:26
Ah yes, thanks @Karl. : row.FOUND_ISIN = split_between(row.DestSystemNote1, "ISIN=", 1, "|", 1) – Peter Lucas Nov 09 '18 at 10:55

Pandas : Find the first occurence of a string then take all characters thereafter up until anther string

1 Answers1