2

I have two numbers (NUM1; NUM2) I am trying to extract across webpages that have the same format:

<div style="margin-left:0.5em;">  
  <div style="margin-bottom:0.5em;">
    NUM1 and NUM2 are always followed by the same text across webpages
  </div>

I am thinking that regex might be the way to go for these particular fields. Here's my attempt (borrowed from various sources):

def nums(self):
    nums_regex = re.compile(r'\d+ and \d+ are always followed by the same text across webpages')
    nums_match = nums_regex.search(self)
    nums_text = nums_match.group(0)
    digits = [int(s) for s in re.findall(r'\d+', nums_text)]
    return digits

By itself, outside of a function, this code works when specifying the actual source of the text (e.g., nums_regex.search(text)). However, I am modifying another person's code and I myself have never really worked with classes or functions before. Here's an example of their code:

@property
def title(self):
    tag = self.soup.find('span', class_='summary')
    title = unicode(tag.string)
    return title.strip()

As you might have guessed, my code isn't working. I get the error:

nums_match = nums_regex.search(self)
TypeError: expected string or buffer

It looks like I'm not feeding in the original text correctly, but how do I fix it?

logi-kal
  • 7,107
  • 6
  • 31
  • 43
Matt
  • 113
  • 3
  • 10

1 Answers1

0

You can use the same regular expression pattern to find with BeautifulSoup by text and then to extract the desired numbers:

import re

pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")

for elm in soup.find_all("div", text=pattern):
    print(pattern.search(elm.text).groups())

Note that, since you are trying to match a part of text and not anything HTML structure related, I think it's pretty much okay to just apply your regular expression to the complete document instead.


Complete working sample code samples below.

With BeautifulSoup regex/"by text" search:

import re

from bs4 import BeautifulSoup

data = """<div style="margin-left:0.5em;">
  <div style="margin-bottom:0.5em;">
    10 and 20 are always followed by the same text across webpages
  </div>
</div>
"""

soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")

for elm in soup.find_all("div", text=pattern):
    print(pattern.search(elm.text).groups())

Regex-only search:

import re

data = """<div style="margin-left:0.5em;">
  <div style="margin-bottom:0.5em;">
    10 and 20 are always followed by the same text across webpages
  </div>
</div>
"""

pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")
print(pattern.findall(data))  # prints [('10', '20')]
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • The BeautifulSoup code works great by itself. I added self. to soup.findall to integrate it with the other code, but that's just led to a "()" output even though there should be numbers there. – Matt Feb 11 '16 at 22:29
  • @Matt well, it works for the input you've provided. Can you share the complete HTML you are parsing and the code you currently have? Thanks. – alecxe Feb 11 '16 at 23:40
  • Works! I'm not sure what I was doing wrong yesterday, but your BeautifulSoup code works when I add the self. to soup.findall today. Thanks! – Matt Feb 12 '16 at 14:56