3

According to the book Web Scraping with Python by Ryan Mitchell, he used re.compile. Can anyone explain what is the use of re.compile() at this case, and the content in the re.compile()

code is written in python 3

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org"+pageUrl)
    bsObj = BeautifulSoup(html)
    for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks("")
Taku
  • 31,927
  • 11
  • 74
  • 85
johnson lai
  • 996
  • 1
  • 11
  • 15

3 Answers3

3

This creates a regex object, BeautifulSoup's findAll method checks whether you pass a compiled regex or just a string - this saves it from doing needless calculations and can just using simple string comparison. (Regex are fairly cpu intensive operations).

In this case it is being used on the href property to find /wiki/ anywhere inside the href property of <a> tags, otherwise just passing a string it would have to match the entire href property.

Another example of its use would be on the first tag argument, Take the regex '^t[dh]$' which you might use to look for td or th tags. if just passed a regex string it would literally look for <^t[dh]$> tags.

See Docs for findAll method

Generally why would you use re.compile

As other answers say this "compiles" a regex, until you call re.match your regex is just a string, re has to convert it before it can use it, and it will do this if you pass a string, but this takes some cpu time.

If you are going to use the regex more than once, in a loop for example then converting each time will use more cpu than if you do it just once, so doing it before the loop and re-using will give you a faster speed.

In reality re actually is doing this behind the scenes for you and "cache" the converted objects, but this in its self will add a small amount of work so could still take longer than if you do it manually.

Theo
  • 1,608
  • 1
  • 9
  • 16
  • What would happen if I pass in something like `BMW` what would be the output? – johnson lai Feb 26 '17 at 04:19
  • @johnsonlai How do you mean pass it? pass re.compile this string? or as HTML to BeautifulSoup? The regex is looking for a tags with the text "/wiki/" inside. So it would match this tag if that's what you mean. – Theo Mar 04 '17 at 11:54
  • @johnsonlai I would be interested in what changed, and why you unaccepted this answer? I would like to improve if you feel any part of this is wrong. – Theo Mar 06 '17 at 19:32
2

re.compile() compiles a regex into a regex object

For example,

line = re.compile("line")
print(line)

result : re.compile("line")

Here are a few good reference on regex: 1. python 3 re library 2. python re.compile

Community
  • 1
  • 1
superoo7
  • 99
  • 2
1

That statement finds anchors where the href match the regex that is compiled in the re.compile. (if you need to know more about regex, go here)

Rafael Barros
  • 2,738
  • 1
  • 21
  • 28