0

I have a file with several instances of rows with this structure:

<tr>
                  <td style="width:25%;">
                     <span class="results_title_text">DUNS:</span> <span class="results_body_text"> 012361296</span>
                  </td>
                  <td style="width:25%;">
                  </td>
                  <!-- label as CAGE when US Territory is listed as Country -->
                  <td style="width:27%;">
                     <span class="results_title_text">CAGE Code:</span> <span class="results_body_text">HELLO</span>
                  </td>
                  <td style="width:15%" rowspan="2">
                     <input type="button" value="View Details" title="View Details for Rascal X-Press, Inc." class="center" style="height:25px; width:90px; vertical-align:middle; margin:7px 3px 7px 3px;" onClick="viewEntry('4420848', '1472652382619')" />
                  </td>
</tr>

I want to select only those <span class="results_body_text"> that are preceeded by <span class="results_title_text">DUNS:</span> so in this case I would only return the span that contains 012361296 and not the one that contains HELLO

How can I do this using a regular expression or anything else? I have tried the "starts with" regex format, but I am failing to see what string I would be parsing in that case. I eventually want to parse the regex into a re.compile() compile function in python.

Tendekai Muchenje
  • 440
  • 1
  • 6
  • 20
  • 3
    further reading on parsing html with regex: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 and https://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – danyamachine Aug 31 '16 at 22:31
  • here's an article on using a proper html parser in python: http://docs.python-guide.org/en/latest/scenarios/scrape/ – danyamachine Aug 31 '16 at 22:34
  • 2
    Have you considered using BeautifulSoup to scrape the HTML tag contents? – GodSaveTheDucks Aug 31 '16 at 22:48
  • @user2825425 I am using BeautifulSoup. That html is part of my `soup` object. So my script has this line: `duns_list = soup.findAll(re.compile("this is where the regex would go"))` Do you know any other way I could accomplish this? – Tendekai Muchenje Sep 01 '16 at 18:20
  • @danyamachine if I use lxml as the parser and say: `duns = tree.xpath('//span[@class="results_body_text"]/text()')` or `duns_list = soup.findAll("span": {"class": "results_body_text"})` in beautifulSoup. it returns both spans in first `td` and second `td` since both have `class=results_body_text`. So it would bring back both `012361296` and `HELLO`. I only want the first result. both results have the same classes, the only difference is the text in the first `span` i.e. `DUNS:` & `CAGE Code:`. I only want the results whose first `span`'s text is `DUNS:` so i need another qualifier. – Tendekai Muchenje Sep 01 '16 at 18:34
  • @tendekai if you know that your result will always be the first match for that selector, then you can refine your selector to only grab the first match. here's an SO question on selecting one of several matches: https://stackoverflow.com/questions/4117953/get-second-element-text-with-xpath – danyamachine Sep 01 '16 at 18:41

2 Answers2

0

Use a positive lookbehind. Since positive look(ahead|behind)s aren't included in the resulting match, they come very handy in parsing stuff at specific locations.

(?<=<span class="results_title_text">\w*DUNS:\w*</span>\w*)<span class="results_body_text">\w*[\u0000-\uFFFF]*\w*</span>

If the existence of the lookbehind throws an error, you can just do it without a lookbehind:

<span class="results_title_text">\w*DUNS:\w*</span>\w*<span class="results_body_text">\w*[\u0000-\uFFFF]*\w*</span>

and then extract exactly what you want by passing the result(s) to another regex, which is basically a subset of the above regex:

<span class="results_body_text">\w*[\u0000-\uFFFF]*\w*</span>

Also, I placed \w*s at points where one can put an arbitrary amount of whitespace.

The SE I loved is dead
  • 1,517
  • 4
  • 23
  • 27
0

Using pyparsing to process HTML allows you to gloss over things like unexpected whitespace, extra/missing attributes, tags in upper or lower case. Assuming you have read your HTML source into a variable html, this pyparsing code will extract the target value:

from pyparsing import makeHTMLTags, SkipTo
span,end_span = makeHTMLTags("span")

patt = span + 'DUNS:' + end_span + span + SkipTo(end_span)("results_body") + end_span

print(patt.searchString(html)[0].results_body)

prints:

012361296
PaulMcG
  • 62,419
  • 16
  • 94
  • 130