Using regex to select html content

Question

I have a file with several instances of rows with this structure:

<tr>
                  <td style="width:25%;">
                     <span class="results_title_text">DUNS:</span> <span class="results_body_text"> 012361296</span>
                  </td>
                  <td style="width:25%;">
                  </td>
                  <!-- label as CAGE when US Territory is listed as Country -->
                  <td style="width:27%;">
                     <span class="results_title_text">CAGE Code:</span> <span class="results_body_text">HELLO</span>
                  </td>
                  <td style="width:15%" rowspan="2">
                     <input type="button" value="View Details" title="View Details for Rascal X-Press, Inc." class="center" style="height:25px; width:90px; vertical-align:middle; margin:7px 3px 7px 3px;" onClick="viewEntry('4420848', '1472652382619')" />
                  </td>
</tr>

I want to select only those <span class="results_body_text"> that are preceeded by <span class="results_title_text">DUNS:</span> so in this case I would only return the span that contains 012361296 and not the one that contains HELLO

How can I do this using a regular expression or anything else? I have tried the "starts with" regex format, but I am failing to see what string I would be parsing in that case. I eventually want to parse the regex into a re.compile() compile function in python.

further reading on parsing html with regex: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 and https://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la — danyamachine, Aug 31 '16 at 22:31
here's an article on using a proper html parser in python: http://docs.python-guide.org/en/latest/scenarios/scrape/ — danyamachine, Aug 31 '16 at 22:34
Have you considered using BeautifulSoup to scrape the HTML tag contents? — GodSaveTheDucks, Aug 31 '16 at 22:48
@user2825425 I am using BeautifulSoup. That html is part of my `soup` object. So my script has this line: `duns_list = soup.findAll(re.compile("this is where the regex would go"))` Do you know any other way I could accomplish this? — Tendekai Muchenje, Sep 01 '16 at 18:20
@danyamachine if I use lxml as the parser and say: `duns = tree.xpath('//span[@class="results_body_text"]/text()')` or `duns_list = soup.findAll("span": {"class": "results_body_text"})` in beautifulSoup. it returns both spans in first `td` and second `td` since both have `class=results_body_text`. So it would bring back both `012361296` and `HELLO`. I only want the first result. both results have the same classes, the only difference is the text in the first `span` i.e. `DUNS:` & `CAGE Code:`. I only want the results whose first `span`'s text is `DUNS:` so i need another qualifier. — Tendekai Muchenje, Sep 01 '16 at 18:34
@tendekai if you know that your result will always be the first match for that selector, then you can refine your selector to only grab the first match. here's an SO question on selecting one of several matches: https://stackoverflow.com/questions/4117953/get-second-element-text-with-xpath — danyamachine, Sep 01 '16 at 18:41

The SE I loved is dead · Answer 1 · 2016-09-01T19:54:29.770

0

Use a positive lookbehind. Since positive look(ahead|behind)s aren't included in the resulting match, they come very handy in parsing stuff at specific locations.

(?<=<span class="results_title_text">\w*DUNS:\w*</span>\w*)<span class="results_body_text">\w*[\u0000-\uFFFF]*\w*</span>

If the existence of the lookbehind throws an error, you can just do it without a lookbehind:

<span class="results_title_text">\w*DUNS:\w*</span>\w*<span class="results_body_text">\w*[\u0000-\uFFFF]*\w*</span>

and then extract exactly what you want by passing the result(s) to another regex, which is basically a subset of the above regex:

<span class="results_body_text">\w*[\u0000-\uFFFF]*\w*</span>

Also, I placed \w*s at points where one can put an arbitrary amount of whitespace.

edited Sep 01 '16 at 19:54

answered Aug 31 '16 at 23:13

The SE I loved is dead

1,517
4
23
27

I don't want just the `012361296` result. Like I said, there are multiple rows with the same structure. But the text content of the span itself can be different from `012361296` – Tendekai Muchenje Sep 01 '16 at 18:16
1

@TendekaiMuchenje [Fixed](http://stackoverflow.com/revisions/39260413/2) – The SE I loved is dead Sep 01 '16 at 18:34
Note that you can replace `[\u0000-\uFFFF]*` with `[\x00-\xFF]*` in environments where Unicode isn't supported. – The SE I loved is dead Sep 01 '16 at 18:35
I tried your solution here: [link](https://repl.it/DIrJ) but it's throwing back an error `error: look-behind requires fixed-width pattern` – Tendekai Muchenje Sep 01 '16 at 18:55
@TendekaiMuchenje [What about two regexes, then?](http://stackoverflow.com/revisions/39260413/3) – The SE I loved is dead Sep 01 '16 at 19:54

score 0 · Answer 2 · answered Sep 01 '16 at 07:35

Using pyparsing to process HTML allows you to gloss over things like unexpected whitespace, extra/missing attributes, tags in upper or lower case. Assuming you have read your HTML source into a variable html, this pyparsing code will extract the target value:

from pyparsing import makeHTMLTags, SkipTo
span,end_span = makeHTMLTags("span")

patt = span + 'DUNS:' + end_span + span + SkipTo(end_span)("results_body") + end_span

print(patt.searchString(html)[0].results_body)

prints:

012361296

Using regex to select html content

2 Answers2