How to extract substring from string in Python?

Question

So I was just wondering how I would extract http://www.google.com from the following string:

<div class="asdf"><a href="http://www.google.com">

Let's say I had a huge string with a bunch of links in there, and I wanted to extract all of the links within the a href's quotation marks, how would I do that?

Which regex is it that matches any number of characters? Let's say I want to find all of them, it would be str.findall('a href="http://RegExHere"'). I want to match all finds for 'a href="http://...." ' — Matt, Nov 07 '15 at 03:32

score 2 · Answer 1 · edited May 23 '17 at 12:13

2

You need an HTML Parser. Example using BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(data)
for link in soup.select("div.asdf > a[href]"):
    print(link["href"])

This would match all the links having href attribute located directly inside the div element having "asdf" class.

You can also just find all the a elements in the input document:

for link in soup.find_all("a", href=True):
    print(link["href"])

Or:

for link in soup.select("a[href]"):
    print(link["href"])

edited May 23 '17 at 12:13

Community

1
1

answered Nov 07 '15 at 03:23

alecxe

462,703
120
1,088
1,195

But what if there are multiple divs? It would be one huge string with a ton of divs that contain an a[href]. – Matt Nov 07 '15 at 03:24
@Matt I've updated the answer and added some more generic information. Though it would be good to see your real input and the desired output. – alecxe Nov 07 '15 at 03:27
Gotcha, thanks! I'm currently using Scrapy's xpaths. So I think it might be response.xpath("//div.asdf/a/@href").extract() then?? Sorry, I'm not sure if you're familiar with xpaths. – Matt Nov 07 '15 at 03:34
@Matt in case of Scrapy, you can do it with `response.xpath("//div[contains(@class, 'asdf')]/a/@href").extract()` or a CSS selector: `response.css("div.asdf > a::attr(href)").extract()`. – alecxe Nov 07 '15 at 03:35
The only problem is, there are multiple listings by day in the entire response, so I only want to parse through the first day. Hence why I split the body and only deal with the first portion (the one that contains today's date). Once I get today's portion, that is when I want to parse for links. Parsing through ALL of the response's links takes forever. – Matt Nov 07 '15 at 03:35
Alex, see my most recent comment. That is what we currently do, and it takes forever to parse through the ENTIRE page's divs. I only want the divs for a certain section, but I don't know how to do that without turning it into a string after splitting. – Matt Nov 07 '15 at 03:36
@Matt as I've mentioned previously, it's difficult to tell how to help without seeing your actual html and the desired result. – alecxe Nov 07 '15 at 14:51

How to extract substring from string in Python?

1 Answers1