0

I am going crazy over this, i hope someone can help me.

I am trying to regex this url: https://www.reddit.com/r/spacex/?count=50&after=t3_xxxxxxx where the x are numbers and letters.

The url is from an HTML file:

https://www.reddit.com/r/spacex/?count=25&after=t3_319905

I tried this:

re.search(r'(<a href=")(https://www.reddit.com/r/spacex/?count=25.+?)(")', subreddit).group(2)

but i keep getting NoneType' object has no attribute 'group'.

BubbleTea
  • 209
  • 1
  • 4
  • 12
  • 1
    I would recommend looking into a scraper like [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/). – TigerhawkT3 Apr 08 '15 at 00:20
  • yes yes i know, but for this i want to use regex. I am trying to learn why my regex is not working. – BubbleTea Apr 08 '15 at 00:22
  • you have plenty of characters in there that have special meanings in regular expressions ... they need escaping – Julien Spronck Apr 08 '15 at 00:24
  • First extract (with beautiful soup as recommended) urls you are interested by using an XPath query to filter urls that begin with `https://www.reddit.com/r/spacex/?count=25` and after extract with a regex (or an url parser) the part of the url you want. – Casimir et Hippolyte Apr 08 '15 at 00:24
  • @user2369869 Hi there, I'm /u/EchoLogic, one of the mods of /r/SpaceX. What are you trying to accomplish? I may already have done whatever you're trying to do. – marked-down Apr 08 '15 at 00:35
  • @EchoLogic It has nothing specifically do to with spacex, just that i frequent the subreddit and so i used it as an example :p Basically i am trying to get a url of each page of a subreddit. – BubbleTea Apr 08 '15 at 01:46
  • @user2369869 No worries! Take a look at [PRAW](https://praw.readthedocs.org/en/v2.1.21/), you'll find it a lot easier to grab the page urls of a subreddit by using Reddit's API directly than scraping it with regex like this. You'll find the task much nicer to complete that way :) – marked-down Apr 08 '15 at 02:10

2 Answers2

1

Use an HTML Parser, like BeautifulSoup. It provides you a way to specify a regular expression to match an attribute value:

soup.find_all('a', href=re.compile("after=t3_\w+"))

Working example:

import re
from bs4 import BeautifulSoup
import requests

url = "https://www.reddit.com/r/spacex/?count=25&after=t3_319905"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}
response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.content)

print soup.find_all('a', href=re.compile("after=t3_\w+"))

Also see the must-provide link for regex+HTML questions:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
0

? is a special character in regex which makes the previous token as optional. You need to escape ? in the regex in-order to match a literal ? character. You need to escape the dots also but not the one in .+?.

re.search(r'(<a href=")(https://www\.reddit\.com/r/spacex/\?count=25.+?)(")', subreddit).group(2)
                                                          ^
                                                          |

Extra capturing groups are unnecessary here. Just a single capturing group would be enough.

re.search(r'<a href="(https://www\.reddit\.com/r/spacex/\?count=25.+?)"', subreddit).group(1)
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • what is the difference between .+? and .+ or similarly .*? and .* – BubbleTea Apr 08 '15 at 00:35
  • `.*` matches any character (_except line breaks_) zero or more times. `.+` matches any character one or more times. Also see [this](http://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions) question to learn about lazy and greedy. – Avinash Raj Apr 08 '15 at 00:38
  • yeah but what makes it different when you add the ? after * or + – BubbleTea Apr 08 '15 at 01:38