-2

So I was just wondering how I would extract http://www.google.com from the following string:

<div class="asdf"><a href="http://www.google.com">

Let's say I had a huge string with a bunch of links in there, and I wanted to extract all of the links within the a href's quotation marks, how would I do that?

Remi Guan
  • 21,506
  • 17
  • 64
  • 87
Matt
  • 191
  • 4
  • 14

1 Answers1

2

You need an HTML Parser. Example using BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(data)
for link in soup.select("div.asdf > a[href]"):
    print(link["href"])

This would match all the links having href attribute located directly inside the div element having "asdf" class.

You can also just find all the a elements in the input document:

for link in soup.find_all("a", href=True):
    print(link["href"])

Or:

for link in soup.select("a[href]"):
    print(link["href"])
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • But what if there are multiple divs? It would be one huge string with a ton of divs that contain an a[href]. – Matt Nov 07 '15 at 03:24
  • @Matt I've updated the answer and added some more generic information. Though it would be good to see your real input and the desired output. – alecxe Nov 07 '15 at 03:27
  • Gotcha, thanks! I'm currently using Scrapy's xpaths. So I think it might be response.xpath("//div.asdf/a/@href").extract() then?? Sorry, I'm not sure if you're familiar with xpaths. – Matt Nov 07 '15 at 03:34
  • @Matt in case of Scrapy, you can do it with `response.xpath("//div[contains(@class, 'asdf')]/a/@href").extract()` or a CSS selector: `response.css("div.asdf > a::attr(href)").extract()`. – alecxe Nov 07 '15 at 03:35
  • The only problem is, there are multiple listings by day in the entire response, so I only want to parse through the first day. Hence why I split the body and only deal with the first portion (the one that contains today's date). Once I get today's portion, that is when I want to parse for links. Parsing through ALL of the response's links takes forever. – Matt Nov 07 '15 at 03:35
  • Alex, see my most recent comment. That is what we currently do, and it takes forever to parse through the ENTIRE page's divs. I only want the divs for a certain section, but I don't know how to do that without turning it into a string after splitting. – Matt Nov 07 '15 at 03:36
  • @Matt as I've mentioned previously, it's difficult to tell how to help without seeing your actual html and the desired result. – alecxe Nov 07 '15 at 14:51