Python regex HTML

Question

I am going crazy over this, i hope someone can help me.

I am trying to regex this url: https://www.reddit.com/r/spacex/?count=50&after=t3_xxxxxxx where the x are numbers and letters.

The url is from an HTML file:

https://www.reddit.com/r/spacex/?count=25&after=t3_319905

I tried this:

re.search(r'(<a href=")(https://www.reddit.com/r/spacex/?count=25.+?)(")', subreddit).group(2)

but i keep getting NoneType' object has no attribute 'group'.

I would recommend looking into a scraper like [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/). — TigerhawkT3, Apr 08 '15 at 00:20
yes yes i know, but for this i want to use regex. I am trying to learn why my regex is not working. — BubbleTea, Apr 08 '15 at 00:22
you have plenty of characters in there that have special meanings in regular expressions ... they need escaping — Julien Spronck, Apr 08 '15 at 00:24
First extract (with beautiful soup as recommended) urls you are interested by using an XPath query to filter urls that begin with `https://www.reddit.com/r/spacex/?count=25` and after extract with a regex (or an url parser) the part of the url you want. — Casimir et Hippolyte, Apr 08 '15 at 00:24
@user2369869 Hi there, I'm /u/EchoLogic, one of the mods of /r/SpaceX. What are you trying to accomplish? I may already have done whatever you're trying to do. — marked-down, Apr 08 '15 at 00:35
@EchoLogic It has nothing specifically do to with spacex, just that i frequent the subreddit and so i used it as an example :p Basically i am trying to get a url of each page of a subreddit. — BubbleTea, Apr 08 '15 at 01:46
@user2369869 No worries! Take a look at [PRAW](https://praw.readthedocs.org/en/v2.1.21/), you'll find it a lot easier to grab the page urls of a subreddit by using Reddit's API directly than scraping it with regex like this. You'll find the task much nicer to complete that way :) — marked-down, Apr 08 '15 at 02:10

score 1 · Answer 1 · edited May 23 '17 at 12:06

Use an HTML Parser, like BeautifulSoup. It provides you a way to specify a regular expression to match an attribute value:

soup.find_all('a', href=re.compile("after=t3_\w+"))

Working example:

import re
from bs4 import BeautifulSoup
import requests

url = "https://www.reddit.com/r/spacex/?count=25&after=t3_319905"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}
response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.content)

print soup.find_all('a', href=re.compile("after=t3_\w+"))

Also see the must-provide link for regex+HTML questions:

RegEx match open tags except XHTML self-contained tags

Avinash Raj · Answer 2 · 2015-04-08T00:27:51.423

0

? is a special character in regex which makes the previous token as optional. You need to escape ? in the regex in-order to match a literal ? character. You need to escape the dots also but not the one in .+?.

re.search(r'(<a href=")(https://www\.reddit\.com/r/spacex/\?count=25.+?)(")', subreddit).group(2)
                                                          ^
                                                          |

Extra capturing groups are unnecessary here. Just a single capturing group would be enough.

re.search(r'<a href="(https://www\.reddit\.com/r/spacex/\?count=25.+?)"', subreddit).group(1)

edited Apr 08 '15 at 00:27

answered Apr 08 '15 at 00:26

Avinash Raj

172,303
28
230
274

what is the difference between .+? and .+ or similarly .*? and .* – BubbleTea Apr 08 '15 at 00:35
`.*` matches any character (_except line breaks_) zero or more times. `.+` matches any character one or more times. Also see [this](http://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions) question to learn about lazy and greedy. – Avinash Raj Apr 08 '15 at 00:38
yeah but what makes it different when you add the ? after * or + – BubbleTea Apr 08 '15 at 01:38

Python regex HTML

2 Answers2