Extracting ISIN, Cusip and other patterns from text in Python

Question

An ISIN code, or international securities identification number, is a 12 digit code consisting of numbers and letters that distinctly identify securities.

An example of an ISIN is: LU1234567890

If I have a long text string, how can I detect an ISIN-pattern and extract all the isins from it in Python?

You could use some sort of regular expression but need to clarify tha actual format. — Jan, Apr 07 '20 at 22:18
Is the format of ISIN two letters and ten numbers? 123456789012 and ABCDE1234567 are not ISIN? — Gilseung Ahn, Apr 07 '20 at 22:19

score 1 · Answer 1 · answered Apr 07 '20 at 22:22

1

You can use the Regular Expressions module, it is specific for that. Here's the doc: https://www.guru99.com/python-regular-expressions-complete-tutorial.html

answered Apr 07 '20 at 22:22

martin

887
7
23

That is to extract a pattern, but a 2 digit followed by 10 numbers is quite a specific pattern. – WJA Apr 07 '20 at 22:23

Gilseung Ahn · Answer 2 · 2020-04-07T22:39:08.777

Use string method isalpha() and isdigit() as follows.

def find_ISIN(S):
    ISIN_list = []
    for i in range(len(S) - 12):
        if S[i:i+2].isalpha() and S[i+2:i+12].isdigit():
            ISIN_list.append(S[i:i+12])    
    return ISIN_list

Code test.

S = 'LU1234567890-SU1234567890f$3121'
print(find_ISIN(S))

The output is ['LU1234567890', 'SU1234567890'].

Edit. If S[i:i+12] is ISIN format, then S[i+k:i+k+12] for 1<=k<=11 cannot be ISIN. Thus, I edited the code to avoid to find S[i+k:i+k+12].

def find_ISIN(S):
    ISIN_list = []
    i = 0
    while i <= len(S) - 12:
        if S[i:i+2].isalpha() and S[i+2:i+12].isdigit():
            ISIN_list.append(S[i:i+12])
            i += 12
        else:
            i += 1
    return ISIN_list

score 1 · Accepted Answer · answered Apr 07 '20 at 22:30

1

You could use:

\b[A-Z]{2}\d{10}\b

See a demo on regex101.com.

In Python:

import re

for number in re.finditer(r'\b[A-Z]{2}\d{10}\b', your_actual_string):
    print(number.group(0))

answered Apr 07 '20 at 22:30

Jan

42,290
8
54
79

score 1 · Answer 4 · answered Apr 07 '20 at 23:00

ISIN format is defined by ISO 6166. The last character is a single check digit. So regex is not enough in theory.

One option is to iterate on all 12 character [A-Z]{2}[A-Z0-9]{9}[0-9]sequences and verify the checksum. In any normal text, that should be enough.

 for isin in re.findall(isin_regex, text):
     If check_isin(isin):
         print('ISIN found: %s' % isin)

One could even argue that the probability of a wrong ISIN in a common text is very small and we could do without the check.

For the sake of the discussion, assuming the text is any sequence of data, re.findall is no longer an option because it only finds non overlapping sequences. Which means a wrong ISIN could hide a real one. This question has already been answered elsewhere.

If performance is an issue and you are in a complex case, it should be possible to implement a DFA like algorithm to find them in near linear complexity.

score 0 · Answer 5 · answered Apr 07 '20 at 22:26

I would use regular expressions with:

import re

And a query similar to laid out here:

RegEx for ISIN with at least 1 number

If you string has a very large number of consecutive 12 character/digit combinations that look like ISINs and would be erroneously picked up by a naive parser, an option would be to downloaded list of all ISINs/CUSIPs/etc, throw them in a hash table and use this as a relatively quick additional filter.

Extracting ISIN, Cusip and other patterns from text in Python

5 Answers5