0

I have some scraped content I got from with urllib.request.urlopen(url) as response: and I'm trying to run regex on it to extract some information in a <td>...</td>. But I can't get the regex to look further; I think the document has new lines that are getting in the way. I've tried adding \s or \r but it isn't working for me.

I'm trying to retrieve

The content was pretty nice and would participate again&nbsp;

using the regex:

(?<=showPollResponses\()(.*)(?=)

and here is a sample of the document:

</thead>
<tr>
<td class="oddpoll" style="width:20%"><b><a href="#" onclick="showPollResponses(123456, 99, '1A2B3C4D5E6F7G8H9I0J1K2L3M4N5O6P', 123456, 123456, 99);return false;">The stuf (i</a></b>
<br>
</td><td class="oddpoll" style="width:35%">The content was pretty nice and would participate again&nbsp;</td><td class="oddpoll" style="width:45%"><b>123 Total</b>
<br>
</td>
</tr>
<tr>
<td class="oddpoll">&nbsp;</td>

I've tried using (?<=showPollResponses\()(.*)(?=width:45%) but it's not returning anything. I was going to take that chunk of html and regex it further to extract the final text.

Here's my regex101.com

There's not a more simple way to do this, is there? In PHP I've used tools to scrape data with css selectors, so I could easily retrieve this that way. Or in the urllib context, is using regex the only way? Thanks for any help provided.

Toto
  • 89,455
  • 62
  • 89
  • 125
Kenny
  • 2,124
  • 3
  • 33
  • 63
  • Is there a reason you are not using [xml.etree.ElementTree](https://docs.python.org/3.4/library/xml.etree.elementtree.html) for html parsing? – cowbert Jul 31 '17 at 20:34
  • I'm new to Python and have never heard of it. I have a script right now that's doing Pandas parsing, but I needed to extract a quick url, and now need to extract some other text. – Kenny Jul 31 '17 at 20:35
  • 1
    Have you tried setting `re.MULTILINE` flag in either `re.compile ()`or the `re.search()` function? see https://stackoverflow.com/questions/587345/regular-expression-matching-a-multiline-block-of-text (so this question is probably a dupe of that). – cowbert Jul 31 '17 at 20:37
  • I've set that setting in regex101 but still does not return past the new line. – Kenny Jul 31 '17 at 20:39
  • It's probably an error with me. Here's what I'm doing: `with urllib.request.urlopen(url) as response:` then `soup = BeautifulSoup(response, "html.parser")` then `print(soup.select_one("a[onclick*=showPollResponses]").find_next("td").get_text())` – Kenny Jul 31 '17 at 20:56
  • In this page, there are multiple of these occurrences. Could be anywhere from 0 to a few or 5 or more. Could the existence of multiples of this selector be why it's not working? Is there a way to select them all, then access a specific one via similar to an array? [0] or [2]? – Kenny Jul 31 '17 at 21:00

2 Answers2

2

Parsing HTML with regular expressions is quite a controversial thing to do - it is only sometimes justified: RegEx match open tags except XHTML self-contained tags.

The better way would be to utilize a specialized tool - an HTML parser like BeautifulSoup. The idea would be to locate the a element by a partial match on the onclick attribute and then get the next td element after the a:

from bs4 import BeautifulSoup

data = """
<table>
    </thead>
        <tr>
            <td class="oddpoll" style="width:20%"><b><a href="#" onclick="showPollResponses(123456, 99, '1A2B3C4D5E6F7G8H9I0J1K2L3M4N5O6P', 123456, 123456, 99);return false;">The stuf (i</a></b>
            <br>
            </td><td class="oddpoll" style="width:35%">The content was pretty nice and would participate again&nbsp;</td><td class="oddpoll" style="width:45%"><b>123 Total</b>
            <br>
            </td>
        </tr>
        <tr>
    </thead>
</table>"""

soup = BeautifulSoup(data, "html.parser")

print(soup.select_one("a[onclick*=showPollResponses]").find_next("td").get_text())

Prints:

The content was pretty nice and would participate again 
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks, I'll definitely check into using this vs regex for getting the content I need. Real quick, the print line of code is giving the following error: `AttributeError: 'NoneType' object has no attribute 'find_next'`. Any idea why? – Kenny Jul 31 '17 at 20:45
  • @Kenny sure, this is because it cannot locate the `a` element for some reason. Could you point me to the website you are downloading the source from? Thanks. – alecxe Jul 31 '17 at 20:47
  • Unfortunately, I can't. It's internal company info, I'm just trying to manipulate it to get things I need to put together a better report. – Kenny Jul 31 '17 at 20:50
  • @Kenny gotcha, no problem, are you sure you have this `a` element in the html source you are downloading? – alecxe Jul 31 '17 at 20:51
  • Yes, there's definitely an anchor tag there. – Kenny Jul 31 '17 at 20:52
  • @Kenny okay, do you see this tag and that `onclick` value if you would do `print(soup.find_all("a"))`? – alecxe Jul 31 '17 at 21:04
  • I can retrieve it now. I was using the `response` from `urllib.request.urlopen(url)`. I guess beautiful soup only works with `import responses`? Is there a way to retrieve multiple occurrences of this into a variable that's an array? – Kenny Jul 31 '17 at 21:32
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/150662/discussion-between-alecxe-and-kenny). – alecxe Jul 31 '17 at 21:33
0

Your problem is with (.*). . only matches characters, so that doesn't include newlines. The way to fix this is to use ([\s\S]*). So without modifying your regex too much, (?<=showPollResponses\()([\S\s]*)(?=width:45%).

Edit: Since your regex is matching past (?=width:45%), I would make an educated guess that it occurs again later in your document. Since ([\s\S]*) is greedy, it will match as much as it can. To remedy this, we can add ? to match just the first iteration. So now, (?<=showPollResponses\()([\S\s]*?)(?=width:45%).

TheDetective
  • 642
  • 1
  • 6
  • 17
  • Thank you very much for your answer, this works and takes it through multiple lines, but it's not stopping for me after `(?=width:45%)`. I have a larger document, and there are about 5 matches I would like it to return. I'm currently using `scrape = re.findall(r'(?<=showPollResponses\()(.*)(?=\))', response.read().decode('utf-8'))` which return an array of 5 matches. – Kenny Jul 31 '17 at 20:49
  • Thank you, it's working properly at regex101.com. I'm now trying with it in my code using `question = re.findall(r'(?<=showPollResponses\()([\S\s]*?)(?=width:45%)', response.read().decode('utf-8'))` then `print(question)` but it seems to be empty. So I'll keep scrutinizing my code to figure out what may be wrong. – Kenny Jul 31 '17 at 21:16
  • I'm not sure if this is the problem, but try removing the `r` in your findall statement. This indicates a raw strong and might not be what you need. – TheDetective Aug 01 '17 at 00:05