Using python re.compile with beautiful soup to match a string

Question

I am wanting to find a url inside of returned http headers. According to beautiful soup there is a way to use soup.find_all(re.compile("yourRegex") to collect the regex matches in an array. However, I must be missing something from my regex, which has a match in the regex find of the text editor that I am using, but doesn't match insided of the following code:

from bs4 import BeautifulSoup import requests import re import csv import json import time import fileinput import urllib2

data = urllib2.urlopen("http://stackoverflow.com/questions/16627227/http-error-403-in-python-3-web-scraping").read()
soup = BeautifulSoup(data)
stringSoup = str(soup)

#Trying to use compile 
print soup.find_all(re.compile("[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?"))

I have tried putting () around the regex, as well as starting it with r...what am I missing that is necessary?

I've also been using http://www.pythonregex.com/, putting [a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])? in the regex part and a url in the other part, but there's no match there either. Thanks!

re.match will any ways not give any result as it matches from start.you can try re.findall instead — vks, Oct 23 '14 at 18:37

score 2 · Accepted Answer · answered Oct 23 '14 at 18:42

2

print re.findall(r"[a-zA-Z0-9\-\.]+\.(?:com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)+(?:[\w\-\.,@?^=%&amp;:\/~\+#]*[\w\-\@?^=%&amp;\/~\+#])?",x)

Try this.This works for me.

x="""<!DOCTYPE html>

<html itemscope itemtype="http://schema.org/QAPage">

<head>
"""

Output:schema.org/QAPage

answered Oct 23 '14 at 18:42

vks

67,027
10
91
124

2

@maudulus: `oh la la la`, you don't need to escape special characters in a character class except `-` unless it is at the end or the begining of the class. Writing something like `[&]` DO NOT HAVE ANY SENSE, a character class is a collection of characters without order, it's the same than `[;p&ma]`. – Casimir et Hippolyte Oct 23 '14 at 19:00

Hackaholic · Answer 2 · 2014-10-23T21:49:49.440

0

your regex has no problem but you dident got the concept. find_all only search in tag.
Example:
find_all("^b") this will give u all tag that start with name b
so output will be tag like p, tbody, body etc..
.If you put re.compile in find_all, it will look for pattern only in tag element not the whole html document.
you need to use the method explained by vks.

edited Oct 23 '14 at 21:49

answered Oct 23 '14 at 21:42

Hackaholic

19,069
5
54
72

Using python re.compile with beautiful soup to match a string

2 Answers2