I am wanting to find a url inside of returned http headers. According to beautiful soup there is a way to use soup.find_all(re.compile("yourRegex")
to collect the regex matches in an array. However, I must be missing something from my regex, which has a match in the regex find of the text editor that I am using, but doesn't match insided of the following code:
from bs4 import BeautifulSoup import requests import re import csv import json import time import fileinput import urllib2
data = urllib2.urlopen("http://stackoverflow.com/questions/16627227/http-error-403-in-python-3-web-scraping").read()
soup = BeautifulSoup(data)
stringSoup = str(soup)
#Trying to use compile
print soup.find_all(re.compile("[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?"))
I have tried putting ()
around the regex, as well as starting it with r
...what am I missing that is necessary?
I've also been using http://www.pythonregex.com/, putting [a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?
in the regex part and a url in the other part, but there's no match there either.
Thanks!