Extracting text after specific character set from a text file using regex in python

Question

Hi I have text in the following format below from which I wanted to save name(ex:2ND ACADEMY OF NATURAL SCIENCES) and its a.k.a. names along with original name in a dictionary like the following format,

Tried to do it using the following code not able to extract the pattern,

re.findall(r'[a-z A-z 0-9 /n/-]+', ^[a.k.a.][a-z A-z 0-9 /n/-]+', textData)
re.findall(r'a.k.a. : (\S+)', textData)

Completely confused about how to go about it, can someone help with this

#Expected Output

"2ND COMPLEX OF NEURAL SCIENCES":["2ND COMPLEX OF NATURAL NEURAL", "ACADEMY OF NEURAL 
SCIENCES", "CHE 2 CHAON KWAHAK-WON", "KUKPAN KAHAK-WON", "SECOND COMPLEX OF NEURAL SCIENCES 
RESEARCH INSTITUTE"]

"LOSTIK VE HAVAIK HIZMETLARI LTD":["LOSTIK VE HAVAIK HIZMETLARI LTD"]

"7 KARNES":["7 KARNES"]

"SWING OF TIR":["7TH OF TIR COMPLEX", "7TH OF TIR INDUSTRIAL COMPLEX", "7TH OF TIR 
INDUSTRIES", "7TH OF TIR INDUSTRIES OF ISFAHAN/ESFAHAN", "MOJTAMAE SANATE HAFTOME TIR" etc]

#textData.txt

2ND COMPLEX OF NEURAL SCIENCES (a.k.a. ACADEMY OF NEURAL 
SCIENCES; a.k.a. CHE 2 CHAON KAHAK-WON; a.k.a. CHE 2 CHAYON KAHAK-WON;
a.k.a. KUKPAN KAHAK-WON; a.k.a. NATIONAL DEFENSE ACADEMY; a.k.a.
SANSRI; a.k.a. SECOND COMPLEX OF NEURAL SCIENCES; a.k.a. SECOND
COMPLEX OF NEURAL SCIENCES RESEARCH INSTITUTE), Pyongyang, Korea,
North; Secondary sanctions risk: North Korea Sanctions Regulations,
sections 510.201 and 510.210; Transactions Prohibited For Persons
Owned or Controlled By U.S. Financial Institutions: North Korea
Sanctions Regulations section 510.214.

LOSTIK VE HAVAIK HIZMETLARI LTD., No. 3/182 Antepe
Bagdat Cad. Istasyon Yolu Sok., Istanbul 34840, Turkey; Additional
Sanctions Information - Subject to Secondary Sanctions.
[IFSR] (Linked To: MAHAN AIR).

7 KARNES, Avenida Ciudad de Cali No. 15A-91, Local A06-07, Bogota,
Colombia; Matricula Mercantil No 1978075 (Colombia).

SWING OF TIR (a.k.a. 7TH OF TIR COMPLEX; a.k.a. 7TH OF TIR INDUSTRIAL
COMPLEX; a.k.a. 7TH OF TIR INDUSTRIES; a.k.a. 7TH OF TIR INDUSTRIES
OF ISFAHAN/ESFAHAN; a.k.a. MOJTAMAE SANATE HAFTOME TIR; a.k.a.
SANAYE HAFTOME TIR; a.k.a. SEVENTH OF TIR), Mobarakeh Road Km 45,
Isfahan, Iran; P.O. Box 81465-478, Isfahan, Iran; Additional
Sanctions Information - Subject to Secondary Sanctions.

The fourth bird · Answer 1 · 2021-08-12T08:51:21.720

You could use 2 capture groups, and split the value of group 2 on (?:;\s)?a\.k\.a\.\s to get the separate values.

Using re.findall will return the capture group values

^([A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?\b)(?: \((a\.k\.a\.[^()]+(?:\sa\.k\.a\.[^()]+)*)\))?

The pattern matches

^ Start of string
( Capture group 1
- [A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?\b Match uppercase chars and spaces not ending with a word character
) Close group 1
(?: Non capture group
- \( Match (
- ( Capture group 2
  - a\.k\.a\.[^()]+(?:\sa\.k\.a\.[^()]+)*)\ Match repeating parts that start with a.k.a followed by matching any char except for ( and )
- ) Close group 2
)? Close non capture group and make it optional

Regex demo | Python demo

For example

import re
import pprint

pattern = r"^([A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?\b)(?: \((a\.k\.a\.[^()]+(?:\sa\.k\.a\.[^()]+)*)\))?"

with open('textData.txt') as f:
    textData = f.read()
    d = {}
    for t in re.findall(pattern, textData, re.M):
        parts = [p for p in re.split(r"(?:;\s)?a\.k\.a\.\s", t[1]) if p]
        parts.insert(0, (t[0]))
        d[t[0]] = parts

    pprint.pprint(d)

Output

{'2ND COMPLEX OF NEURAL SCIENCES': ['2ND COMPLEX OF NEURAL SCIENCES',
                                    'ACADEMY OF NEURAL \nSCIENCES',
                                    'CHE 2 CHAON KAHAK-WON',
                                    'CHE 2 CHAYON KAHAK-WON',
                                    'KUKPAN KAHAK-WON',
                                    'NATIONAL DEFENSE ACADEMY',
                                    'SANSRI',
                                    'SECOND COMPLEX OF NEURAL SCIENCES',
                                    'SECOND\n'
                                    'COMPLEX OF NEURAL SCIENCES RESEARCH '
                                    'INSTITUTE'],
 '7 KARNES': ['7 KARNES'],
 'LOSTIK VE HAVAIK HIZMETLARI LTD': ['LOSTIK VE HAVAIK HIZMETLARI LTD'],
 'SWING OF TIR': ['SWING OF TIR',
                  '7TH OF TIR COMPLEX',
                  '7TH OF TIR INDUSTRIAL\nCOMPLEX',
                  '7TH OF TIR INDUSTRIES',
                  '7TH OF TIR INDUSTRIES\nOF ISFAHAN/ESFAHAN',
                  'MOJTAMAE SANATE HAFTOME TIR',
                  'SANAYE HAFTOME TIR',
                  'SEVENTH OF TIR']}

I have to avoid a.k.a. and take only the text after that...and save all the texts after a.k.a in list — Sherlock, Aug 12 '21 at 08:35
I am getting an error TypeError: expected string or bytes-like object — Sherlock, Aug 12 '21 at 08:51
Yaa I checked it is exact but I have thousands of these in a .txt file and I am loading it using open() and later converted it using str(testDoc) but still getting the error — Sherlock, Aug 12 '21 at 08:56
@Sherlock I have updated the code using `textData = f.read()` to read the whole file. — The fourth bird, Aug 12 '21 at 08:56

tripleee · Accepted Answer · 2021-08-12T08:48:03.477

1

You seem to be confused about the meaning of square brackets. Perhaps review What is the difference between square brackets and parentheses in a regex?

Your requirements seem rather unclear, but something like this?

import re

with open('textData.txt', 'r') as lines:
    text = lines.read()

for segment in text.split('\n\n'):
    para = ' '.join(segment.splitlines())
    if para:
        name = re.match(r'^[^,()]+(?=, | \()', para)
        if name:
            akas = [name.group(0)]
            akas.extend(re.findall(r'(?<=a\.k\.a\. )([^;)]+)', para))
            print('"%s": ["%s"]' % (name.group(0), '", "'.join(akas)))

This assumes that each record is a separated from every other record by an empty line, and that the file is small enough to fit into memory.

edited Aug 12 '21 at 08:48

answered Aug 12 '21 at 08:19

tripleee

175,061
34
275
318

they are in different lines and each section of the text is again divided by a new empty line...which I should use to separate the text – Sherlock Aug 12 '21 at 08:39
Refactored to collect the lines into a paragraph. It made the code slightly more complex but it should not be hard to follow. – tripleee Aug 12 '21 at 08:48
It is creating duplicate keys...due to which I am not able to use it further...as it is not generating a valid dictionary format...can you suggest how can avoid it from creating duplicate keys like below, – Sherlock Aug 16 '21 at 09:01
"ABDRABBA": ["ABDRABBA", "ABDRABBA, Ghunia", "ABD'RABBAH, Ghuma", "ABDURABBA, Ghunia", “ABD'RABBAH”, “ABU JAMIL”], "ABDRABBA": ["ABDRABBA", "ABDRABBA, Ghoma", "ABD'RABBAH, Ghuma", "ABDURABBA, Ghunia", “ABD'RABBAH”, “ABU JAMIL”], "'ABDU": ["'ABDU"], "'ABDU": ["'ABDU"], – Sherlock Aug 16 '21 at 09:02
As you have probably discovered, there is no sane way to post properly formatted information in comments. If you have additional requirements which were not in your question, probably (accept one of the answers here, or post an answer of your own and accept that, and) ask a new question with your _actual_ requirements, and more details than the current one. – tripleee Aug 16 '21 at 09:03
Finding if you have already processed an entry should be trivial as such; keep a set or a dictionary of the ones you have already seen, and skip the ones you see again. Whether you want to check whether the extracted values are the same or not is a separate question, which again is probably significant enough to warrant a new, separate post with _exactly_ the input you want to process, and _exactly_ the expected output. In brief, if you want to collect information from multiple inputs, you need to keep them in memory, and only print them at the end. – tripleee Aug 16 '21 at 09:06

Extracting text after specific character set from a text file using regex in python

2 Answers2

Linked