0

An ISIN code, or international securities identification number, is a 12 digit code consisting of numbers and letters that distinctly identify securities.

An example of an ISIN is: LU1234567890

If I have a long text string, how can I detect an ISIN-pattern and extract all the isins from it in Python?

WJA
  • 6,676
  • 16
  • 85
  • 152

5 Answers5

1

You can use the Regular Expressions module, it is specific for that. Here's the doc: https://www.guru99.com/python-regular-expressions-complete-tutorial.html

martin
  • 887
  • 7
  • 23
  • That is to extract a pattern, but a 2 digit followed by 10 numbers is quite a specific pattern. – WJA Apr 07 '20 at 22:23
1

Use string method isalpha() and isdigit() as follows.

def find_ISIN(S):
    ISIN_list = []
    for i in range(len(S) - 12):
        if S[i:i+2].isalpha() and S[i+2:i+12].isdigit():
            ISIN_list.append(S[i:i+12])    
    return ISIN_list

Code test.

S = 'LU1234567890-SU1234567890f$3121'
print(find_ISIN(S))

The output is ['LU1234567890', 'SU1234567890'].

Edit. If S[i:i+12] is ISIN format, then S[i+k:i+k+12] for 1<=k<=11 cannot be ISIN. Thus, I edited the code to avoid to find S[i+k:i+k+12].

def find_ISIN(S):
    ISIN_list = []
    i = 0
    while i <= len(S) - 12:
        if S[i:i+2].isalpha() and S[i+2:i+12].isdigit():
            ISIN_list.append(S[i:i+12])
            i += 12
        else:
            i += 1
    return ISIN_list
Gilseung Ahn
  • 2,598
  • 1
  • 4
  • 11
1

You could use:

\b[A-Z]{2}\d{10}\b

See a demo on regex101.com.


In Python:
import re

for number in re.finditer(r'\b[A-Z]{2}\d{10}\b', your_actual_string):
    print(number.group(0))
Jan
  • 42,290
  • 8
  • 54
  • 79
1

ISIN format is defined by ISO 6166. The last character is a single check digit. So regex is not enough in theory.

One option is to iterate on all 12 character [A-Z]{2}[A-Z0-9]{9}[0-9]sequences and verify the checksum. In any normal text, that should be enough.

 for isin in re.findall(isin_regex, text):
     If check_isin(isin):
         print('ISIN found: %s' % isin)

One could even argue that the probability of a wrong ISIN in a common text is very small and we could do without the check.

For the sake of the discussion, assuming the text is any sequence of data, re.findall is no longer an option because it only finds non overlapping sequences. Which means a wrong ISIN could hide a real one. This question has already been answered elsewhere.

If performance is an issue and you are in a complex case, it should be possible to implement a DFA like algorithm to find them in near linear complexity.

Michael Doubez
  • 5,937
  • 25
  • 39
0

I would use regular expressions with:

import re

And a query similar to laid out here:

RegEx for ISIN with at least 1 number

If you string has a very large number of consecutive 12 character/digit combinations that look like ISINs and would be erroneously picked up by a naive parser, an option would be to downloaded list of all ISINs/CUSIPs/etc, throw them in a hash table and use this as a relatively quick additional filter.

Alan
  • 49
  • 5