0

So I am just now getting into building regexes and have had some great success overall. However I have a particular case that is puzzling me. I can get my desired match but its not pretty and not well done in any way, shape, or form.

I am regex matching some html documents with multiple lines. There are blocks of information i need out of these documents that match a variable pattern in each block and then pull my needed information in.

there are multiple blocks of html with information i need that look like this:

<td headers="col0" class="OraTableCellNumber" style=";" nowrap="1"  valign="top" ><a href='/Orion/PatchDetails/process_form?patch_num=6880880&aru=13915384&release=80101000&plat_lang=226P&patch_num_id=979662&' title="View Patch Details">6880880</a></td>
<td headers="col0" class="OraTableCellText" style=";"   valign="top" ><b>Universal Installer</b>: Patch<br>OPatch 9i, 10.1</td>
<td headers="col0" class="OraTableCellText" style=";"   valign="top" >10.1.0.0.0</td>
<td headers="col0" class="OraTableCellText" style=";" nowrap="1"  valign="top" >08-JUL-2011</td>
<td headers="col0" class="OraTableCellText" style=";"   valign="top" >25M</td>
<td headers="col0" class="OraTableCellText" style=";text-align: center;"   valign="middle" width="15"><a href='javascript:showDetails("/Orion/Readme/process_form?aru=13915384&no_header=1")'><img src="/olaf/images/forms/readme.gif" valign=bottom border=0 title="View Readme" alt="View Readme"></a></td>
<td headers="col0" class="OraTableCellText" style=";text-align: center;"   valign="middle" width="15"><a href="https://updates.oracle.com/Orion/Download/process_form/p6880880_101000_Linux-x86-64.zip?aru=13915384&file_id=42098007&patch_file=p6880880_101000_Linux-x86-64.zip&"><img src="/olaf/images/forms/download.gif" valign=bottom border=0 title="Download Now" alt="Download Now"></a></td></tr>
<tr class="OraBGAccentLight" height="28" onMouseOver="javascript:setRowClass(this, 'highlight', 1);" onMouseOut="javascript:setRowClass(this, 'highlight', 0);">

I am currently working in Python and my regex is:

re.compile(r"/Orion/PatchDetails/process_form.+?release=80102000.*\n.*\n.*\n.*\n.*\n.*\n.*zip[^\"]*", re.MULTILINE)

my desired output is:

20180516140046EDT - DEBUG - ['/Orion/PatchDetails/process_form?patch_num=6880880&aru=13116068&release=80102000&plat_lang=226P&patch_num_id=979663&\' title="View Patch Details">6880880</a></td>\n<td headers="col0" class="OraTableCellText" style=";"   valign="top" ><b>Universal Installer</b>: Patch<br>OPatch 10.2</td>\n<td headers="col0" class="OraTableCellText" style=";"   valign="top" >10.2.0.0.0</td>\n<td headers="col0" class="OraTableCellText" style=";" nowrap="1"  valign="top" >18-NOV-2010</td>\n<td headers="col0" class="OraTableCellText" style=";"   valign="top" >26M</td>\n<td headers="col0" class="OraTableCellText" style=";text-align: center;"   valign="middle" width="15"><a href=\'javascript:showDetails("/Orion/Readme/process_form?aru=13116068&no_header=1")\'><img src="/olaf/images/forms/readme.gif" valign=bottom border=0 title="View Readme" alt="View Readme"></a></td>\n<td headers="col0" class="OraTableCellText" style=";text-align: center;"   valign="middle" width="15"><a href="https://updates.oracle.com/Orion/Download/process_form/p6880880_102000_Linux-x86-64.zip?aru=13116068&file_id=34545782&patch_file=p6880880_102000_Linux-x86-64.zip&']

I am pulling a list of releases, and then applying them as search criteria to pull download urls. I would normally be open to different solutions. However I would like to keep the scope of this to using regex as that is the tag i used, if this is a gross miss use of regex let me know

can anyone help me not just optimize this but explain to me the logic using said suggested regex.

TLDR: I need to match a leading pattern to a variable (80102000 is the variable in this example) ignoring \n till my second pattern is matched.

pattern 1: /Orion/PatchDetails/process_form.+?release=80102000 need the text between... pattern 2: *zip[^\"]*

Thank you in advanced!

DataDecay
  • 3
  • 5

4 Answers4

0

Popular opinion is that parsing HTML with regular expressions is not a great idea, see https://stackoverflow.com/a/1732454/9778302

Armando Garza
  • 486
  • 3
  • 9
0
map(lambda line: re.search(expr,line), iterable_containing_lines)

will probably do what you want. You'll get back a map object (which is iterable) containing only lines that succeed on the regex.

AmphotericLewisAcid
  • 1,824
  • 9
  • 26
0
import re

regex = r"""
  Orion/PatchDetails/process_form.+?release=\d+       
  (.+)   # use this as your match
  zip[^\"]
  """

matches = re.compile(regex, re.MULTILINE | re.DOTALL | re.VERBOSE)

Add re.DOTALL to let . include \n. For your regex, this lets you match multiple lines

https://regex101.com/r/jBwq20/1

c2huc2hu
  • 2,447
  • 17
  • 26
0

I improved this to work for varied \n's and i have this stable and working in my code:

regex = re.compile('/Orion/PatchDetails/process_form.+?release=' + patch_info['Release'] + '.*?"((https)s?://.*?)"', re.DOTALL)
DataDecay
  • 3
  • 5