Python and beautifulSoup: scraping two values - url and date and having the return results paired

Question

I have a working code that scrapes 'a href' tags for the urls and I can get the date info from the nested 'p' tags.

<div class='blah'>
    <a href='target_url'></a>
    <p class='date'>Today's date</p>

right now it the my code looks like...

for p in table.find_all('p', {'class':'categoryArticle__meta'}):
    date = p.get_text()
for a in table.find_all('a', href=True)[::2][:-5]:
    headline = a['href']

I'm skipping every other href but I need every date.

How would I go about joining the search parameters to give me the returned info, paired i.e. - 'target_url', 'Today's date' ?

Can you share the actual link? Are there the same number of hrefs as date if you take every second href? — Padraic Cunningham, May 11 '16 at 23:00
added closing the first href is for an image. Sure, http://oilprice.com/Latest-Energy-News/World-News — Derek_P, May 11 '16 at 23:11
Certainly, in hindsight it would probably be better to scrape the outer div of both for the url and date, instead of scraping for every url. — Derek_P, May 11 '16 at 23:20

Padraic Cunningham · Accepted Answer · 2016-05-11T23:33:56.267

If you scrape the divs with categoryArticle__content you can pull the links and the associated dates:

import  requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://oilprice.com/Latest-Energy-News/World-News").content)
main_div = soup.select_one("div.tableGrid__column.tableGrid__column--articleContent.category")

divs = main_div.select('div.categoryArticle__content')

print([(d.select_one("p.categoryArticle__meta").text, d.a["href"]) for d in divs])

The text also includes more than just the date so you will want to split on a pipe char:

 [(d.select_one("p.categoryArticle__meta").text.split("|")[0].strip(), d.a["href"]) for d in divs]

Which gives you:

[(u'May 11, 2016 at 17:21', 'http://oilprice.com/Latest-Energy-News/World-News/Oil-Hits-6-Month-High-on-Crude-Inventory-Draw.html'), (u'May 11, 2016 at 16:56', 'http://oilprice.com/Latest-Energy-News/World-News/Nigerian-President-Lashes-Out-At-UK-Over-Stolen-Assets.html'), (u'May 11, 2016 at 15:41', 'http://oilprice.com/Latest-Energy-News/World-News/Germany-Ups-Gazprom-Imports-by-19-percent-in-Q1.html'), (u'May 11, 2016 at 15:39', 'http://oilprice.com/Latest-Energy-News/World-News/Solar-Hits-Millionth-Installation-In-The-US-Faster-Growth-Ahead.html'), (u'May 11, 2016 at 14:14', 'http://oilprice.com/Latest-Energy-News/World-News/OPEC-Production-Up-140000-Bpd-in-April.html'), (u'May 11, 2016 at 14:03', 'http://oilprice.com/Latest-Energy-News/World-News/Tullow-Ghana-Oil-Production-Down-by-More-Than-50.html'), (u'May 11, 2016 at 13:47', 'http://oilprice.com/Latest-Energy-News/World-News/Tesla-To-Complete-Model-3-Design-By-End-June.html'), (u'May 11, 2016 at 12:30', 'http://oilprice.com/Latest-Energy-News/World-News/Iraqi-Kurds-Boost-Oil-Exports-to-Turkey.html'), (u'May 11, 2016 at 11:57', 'http://oilprice.com/Latest-Energy-News/World-News/Security-Services-Raid-Headquarters-of-Ukraines-Largest-Gas-Company.html'), (u'May 11, 2016 at 10:59', 'http://oilprice.com/Latest-Energy-News/World-News/Oil-Up-3-AS-EIA-Reports-34M-Barrel-Crude-Inventory-Drop.html')]

It is always better to associate values from the parent tag if possible, pulling all anchors and slicing is not a very robust approach.

select and select_one use css-selectors, the equivalent code using find and find_all would be:

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get("http://oilprice.com/Latest-Energy-News/World-News").content)
main_div = soup.find("div", class_="tableGrid__column tableGrid__column--articleContent category")
divs = main_div.find_all("div","categoryArticle__content")

print([(d.find("p", {"class": "categoryArticle__meta"}).text.split("|")[0].strip(), d.a["href"]) for d in divs])

class_=.. lets you search-by-css-class

Also in this case categoryArticle__content only appears in the main div so you could search initially for those divs in place of first selecting the main div.

soup = BeautifulSoup(requests.get("http://oilprice.com/Latest-Energy-News/World-News").content)

divs = soup.find_all("div","categoryArticle__content")
print([(d.find("p", {"class": "categoryArticle__meta"}).text.split("|")[0].strip(), d.a["href"]) for d in divs])

@citramaillo, they are bs4 methods that use css selectors, I will add how to do it using just findall etc.. — Padraic Cunningham, May 11 '16 at 23:29
Another option is to pair the values using a dictionary if one of the values is guaranteed to be unique. `namedtuple` is a bit better to use than a regular tuple in most cases because it is easier to access without introducing bugs. Building a simple storage class is also a good solution if you need to extend functionality; just store a small piece of the beautifulSoup object, and include methods to retrieve pertinent values. I've been working on something that may be relevant on my [github](https://github.com/Aarowaim/RingAnime) — Aaron3468, May 12 '16 at 00:28
@Aaron3468, it is safer to avoid a dict for that very reason, the dates are definitely not unique and accessing by url would be a arduous. Most likey the OP will be using or storing each date/href pair so a list of tuples is probably as good as any approach in this specific case — Padraic Cunningham, May 12 '16 at 00:31

score -1 · Answer 2 · edited May 23 '17 at 11:52

You can make the dates and urls into lists, and then zip them up, like so:

dates = []
urls = []
for p in table.find_all('p', {'class':'categoryArticle__meta'}):
    date = p.get_text()
    dates.add(date)
for a in table.find_all('a', href=True)[::2][:-5]:
    headline = a['href'] # should this be called headline?
    urls.add(headline)
easy_access = zip(dates, urls)

See these for zip help:

Zip turns the two lists into a list of tuples, so the output might look something like this:

easy_access = [('1/2/12', 'http://somewhere.com'), 
               ('2/2/12', 'http://somewhereelse.com'), 
               ('3/2/12', 'http://nowhere.com'), 
               ('4/2/12', 'http://here.com')]

score -1 · Answer 3 · answered May 11 '16 at 23:22

I think if you're trying to use one expression to match on to get both the value in href and date, it will be pretty difficult (i.e. I have no idea how to do that). However, if you use xpath to navigate to the parts you want and store them, you can easily pick out what you need. Based on the code you provided, I'd recommend something like this:

from lxml import html
import requests

webpage = requests.get('www.myexampleurl.com')
tree = html.fromstring(webpage.content)

currentHRef = tree xpath('//div//a/')[0].get("href")
currentDate = tree.xpath('//div//p/text()')[0]

dateTarget = (currentHRef, currentDate)

Python and beautifulSoup: scraping two values - url and date and having the return results paired

3 Answers3