Parsing a site using Regex in Python

Question

I am trying to use regex to parse a site for

blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah

(there are many of these, and I want all of them in some tokenized form). The problem is that "a href" actually has TWO spaces, not just one (there are some that are "a href" with one space that I do NOT want to retrieve), so using LXML has proven to be quite a pain and I do not want to use BeautifulSoup (for other reasons). Does anyone know how I might go about doing this?

Thanks!

possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — JBernardo, Feb 15 '13 at 02:45
No this is different. There is difficulty in teasing out the two spaces from a href, rather than just one space. I'm also fine with it being extremely brittle, as long as it does generally what I want it to, i.e. extract out the a href where there are two spaces in between. — user1922956, Feb 15 '13 at 03:13

score 0 · Answer 1 · edited May 23 '17 at 12:08

0

Depending on the level of robustness you want, you can fetch the tag in a first shot and store it, then replace " " to " " while your string contains " ". This will effectively remove any multiple spaces in your string.

It is to note that using regex to parse HTML is not recommended =)

edited May 23 '17 at 12:08

Community

1
1

answered Feb 15 '13 at 03:54

Eric

19,525
19
84
147

score 0 · Answer 2 · answered Feb 15 '13 at 04:31

Don't let you be impressed by the answer whose link is given each time someone asks the same question as you. It's apparently considered as a page of catechism that is semi-automatically cited by plenty of people. However, in programming, it's like in everyday life, there is the catechism, and there is what we do in the real days.
Personally, if I don't consider that HTML can be entirely parsed with regex, I esteem that limited analysis of certain parts of HTML can be done with regex. That's a pragmatical point of view.
And I do realize such analysises of web pages with regex. There are some problems, sometimes, but they can be managed by a developper. Regex are fast. One time I measured that Beautiful Soup was 10 times slower than a regex, and that lxml was around 50 times slower.
I'm relatively skilled to fetch web dat with regexes, if you would like to have hints, I could give some, my email is on my page.

A reasonable viewpoint, but you're not answering the question. — alexis, Feb 15 '13 at 22:25

Robert Harris · Accepted Answer · 2013-02-15T04:07:59.963

I believe this answers your question. It is just a couple of regular expressions that will get all of the href's that are exactly two spaces after an opening 'a' tag.

fh = open("index.html", 'r')
rawString = fh.read()   # read entire file to string
fh.close()

temp =  re.findall("<a  href=\".*?\"", rawString) 
if temp:
    for i in range(len(temp)): # process each match
        temp[i] = re.search("\".*?\"", temp[i]).group(0) # remove 'href='
    print temp    
else:
    print "Not found"

For your example the output is:

['"THIS IS WHAT I WANT"']

Parsing a site using Regex in Python

3 Answers3