Extracting a line of text using BeautifulSoup

Question

I have two numbers (NUM1; NUM2) I am trying to extract across webpages that have the same format:

<div style="margin-left:0.5em;">  
  <div style="margin-bottom:0.5em;">
    NUM1 and NUM2 are always followed by the same text across webpages
  </div>

I am thinking that regex might be the way to go for these particular fields. Here's my attempt (borrowed from various sources):

def nums(self):
    nums_regex = re.compile(r'\d+ and \d+ are always followed by the same text across webpages')
    nums_match = nums_regex.search(self)
    nums_text = nums_match.group(0)
    digits = [int(s) for s in re.findall(r'\d+', nums_text)]
    return digits

By itself, outside of a function, this code works when specifying the actual source of the text (e.g., nums_regex.search(text)). However, I am modifying another person's code and I myself have never really worked with classes or functions before. Here's an example of their code:

@property
def title(self):
    tag = self.soup.find('span', class_='summary')
    title = unicode(tag.string)
    return title.strip()

As you might have guessed, my code isn't working. I get the error:

nums_match = nums_regex.search(self)
TypeError: expected string or buffer

It looks like I'm not feeding in the original text correctly, but how do I fix it?

[I've heard this one before...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — apex-meme-lord, Feb 11 '16 at 21:47

alecxe · Accepted Answer · 2016-02-11T21:50:35.863

You can use the same regular expression pattern to find with BeautifulSoup by text and then to extract the desired numbers:

import re

pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")

for elm in soup.find_all("div", text=pattern):
    print(pattern.search(elm.text).groups())

Note that, since you are trying to match a part of text and not anything HTML structure related, I think it's pretty much okay to just apply your regular expression to the complete document instead.

Complete working sample code samples below.

With BeautifulSoup regex/"by text" search:

import re

from bs4 import BeautifulSoup

data = """<div style="margin-left:0.5em;">
  <div style="margin-bottom:0.5em;">
    10 and 20 are always followed by the same text across webpages
  </div>
</div>
"""

soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")

for elm in soup.find_all("div", text=pattern):
    print(pattern.search(elm.text).groups())

Regex-only search:

import re

data = """<div style="margin-left:0.5em;">
  <div style="margin-bottom:0.5em;">
    10 and 20 are always followed by the same text across webpages
  </div>
</div>
"""

pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")
print(pattern.findall(data))  # prints [('10', '20')]

The BeautifulSoup code works great by itself. I added self. to soup.findall to integrate it with the other code, but that's just led to a "()" output even though there should be numbers there. — Matt, Feb 11 '16 at 22:29
@Matt well, it works for the input you've provided. Can you share the complete HTML you are parsing and the code you currently have? Thanks. — alecxe, Feb 11 '16 at 23:40
Works! I'm not sure what I was doing wrong yesterday, but your BeautifulSoup code works when I add the self. to soup.findall today. Thanks! — Matt, Feb 12 '16 at 14:56

Extracting a line of text using BeautifulSoup

1 Answers1