Regex/Beautifulsoup HTML parsing

Question

Context: I have a large HTML document that contains business data that I would like to extract. I've opted to use a regex but am open Beautifulsoup if users want to provide BS logic to address the problem. Below is a snippet of the document. The doc contains a series of repeated HTML portions with patterns like whats shown. And in bold are the regex pattern targets I want to extract. Below is also a snippet of the Python script I started in an attempt to extract the transaction descriptions which here is the first field in the snippet (ISSUEMO)

What the first function does is scan the document for a transaction description & print the index location of each.

String match "ISSUEMO" at 15102:15109

What I would like to do in the second function is extract & print the transaction IDs which always follows the transaction description in function one (1MOI-00237)

HTML Snippet

<tr class="style_12" valign="top" align="left">
                                            <td style=" overflow:hidden; border-bottom: 1px solid rgb(222, 219, 239);">
                                                    <div class="style_51" style=" text-align:left;">ISSUEMO</div>
                                                </td>
                                            <td style=" overflow:hidden; border-bottom: 1px solid rgb(222, 219, 239);">
                                                    <div class="style_51" style=" text-align:left;">1MOI-00237</div>
                                            ...
                                            ...
                                            ...

                                            <td style=" overflow:hidden; border-bottom: 1px solid rgb(222, 219, 239);">
                                                    <div class="style_97" style=" text-align:right;">12.86</div>
                                                </td>
                                            <td style=" overflow:hidden; border-bottom: 1px solid rgb(222, 219, 239);">
                                                    <div class="style_98" style=" text-align:right;">-64.30</div>
                                                </td>
                                            </tr>

Python

def find_transaction_desc():

    regex_pattern = re.compile(r'ADJQTY|ADJCST|ISSUEPAO|TRNFLOC|RCPTMISC|ISSUEPO|TRNFPAO|RESVN|ISSUEMO|RCPTMO|ADJSCRAP|TRNFRCPT|TRNFINSP|PO|RETVEND|TRNFMRB|PHYSCNT|REQ|SO|MO|APLYPOINFO|GENPO|STDCSTTVAR')

    for match in re.finditer(regex_pattern, html_doc):
        start = match.start()
        end = match.end()
        print('String match "%s" at %d:%d' % (html_doc[start:end], start, end))

find_transaction_desc()

#def extract_transaction_ids():

#extract_transaction_ids()

Question: I'm not a python expert. Can someone please assist in pointers or a new pattern to address capturing & printing the IDs or BS logic?

Use Beautiful Soup; you won't regret it or summon [ZA̡͊͠͝LGΌ](https://stackoverflow.com/a/1758162/4541045) — ti7, Apr 26 '21 at 22:29
If what you want to find were all inside one tag you might have a chance using regex, but.. it isn’t. Go for BS. — DisappointedByUnaccountableMod, Apr 26 '21 at 22:31
For the people suggesting I use BS can you provide some sample code to address the question please? — emalcolmb, Apr 26 '21 at 22:33
There are many questions on here about scraping using BS - just search for, err, `BeautifulSoup` maybe? — DisappointedByUnaccountableMod, Apr 27 '21 at 06:30

score 0 · Accepted Answer · answered Apr 26 '21 at 23:31

If I understand you correctly, this is how this can be done with beautifulsoup, at least with the sample html in your question (these may or may not work with your actual file):

from bs4 import BeautifulSoup as bs
soup=bs(html_doc,'html.parser')
for item in soup.select('td'):
    if 'ISSUEMO' in item.text:
        target = item.findNextSibling('td')
        print(target.text.strip())

It's actually easier to do with lxml and xpath:

import lxml.html as lh
doc = lh.fromstring(html_doc)
target = doc.xpath('//td["ISSUEMO"]//following-sibling::td')
print(target[0].text_content().strip())

or

target = doc.xpath('//td["ISSUEMO"]//following-sibling::td/div')
print(target[0].text.strip())

In both cases, the output is

1MOI-00237

Regex/Beautifulsoup HTML parsing

1 Answers1