Context: I have a large HTML document that contains business data that I would like to extract. I've opted to use a regex but am open Beautifulsoup if users want to provide BS logic to address the problem. Below is a snippet of the document. The doc contains a series of repeated HTML portions with patterns like whats shown. And in bold are the regex pattern targets I want to extract. Below is also a snippet of the Python script I started in an attempt to extract the transaction descriptions which here is the first field in the snippet (ISSUEMO)
What the first function does is scan the document for a transaction description & print the index location of each.
String match "ISSUEMO" at 15102:15109
What I would like to do in the second function is extract & print the transaction IDs which always follows the transaction description in function one (1MOI-00237)
HTML Snippet
<tr class="style_12" valign="top" align="left">
<td style=" overflow:hidden; border-bottom: 1px solid rgb(222, 219, 239);">
<div class="style_51" style=" text-align:left;">ISSUEMO</div>
</td>
<td style=" overflow:hidden; border-bottom: 1px solid rgb(222, 219, 239);">
<div class="style_51" style=" text-align:left;">1MOI-00237</div>
...
...
...
<td style=" overflow:hidden; border-bottom: 1px solid rgb(222, 219, 239);">
<div class="style_97" style=" text-align:right;">12.86</div>
</td>
<td style=" overflow:hidden; border-bottom: 1px solid rgb(222, 219, 239);">
<div class="style_98" style=" text-align:right;">-64.30</div>
</td>
</tr>
Python
def find_transaction_desc():
regex_pattern = re.compile(r'ADJQTY|ADJCST|ISSUEPAO|TRNFLOC|RCPTMISC|ISSUEPO|TRNFPAO|RESVN|ISSUEMO|RCPTMO|ADJSCRAP|TRNFRCPT|TRNFINSP|PO|RETVEND|TRNFMRB|PHYSCNT|REQ|SO|MO|APLYPOINFO|GENPO|STDCSTTVAR')
for match in re.finditer(regex_pattern, html_doc):
start = match.start()
end = match.end()
print('String match "%s" at %d:%d' % (html_doc[start:end], start, end))
find_transaction_desc()
#def extract_transaction_ids():
#extract_transaction_ids()
Question: I'm not a python expert. Can someone please assist in pointers or a new pattern to address capturing & printing the IDs or BS logic?