1

I have a working code that scrapes 'a href' tags for the urls and I can get the date info from the nested 'p' tags.

<div class='blah'>
    <a href='target_url'></a>
    <p class='date'>Today's date</p> 

right now it the my code looks like...

for p in table.find_all('p', {'class':'categoryArticle__meta'}):
    date = p.get_text()
for a in table.find_all('a', href=True)[::2][:-5]:
    headline = a['href']

I'm skipping every other href but I need every date.

How would I go about joining the search parameters to give me the returned info, paired i.e. - 'target_url', 'Today's date' ?

Derek_P
  • 658
  • 8
  • 29

3 Answers3

2

If you scrape the divs with categoryArticle__content you can pull the links and the associated dates:

import  requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://oilprice.com/Latest-Energy-News/World-News").content)
main_div = soup.select_one("div.tableGrid__column.tableGrid__column--articleContent.category")

divs = main_div.select('div.categoryArticle__content')

print([(d.select_one("p.categoryArticle__meta").text, d.a["href"]) for d in divs])

The text also includes more than just the date so you will want to split on a pipe char:

 [(d.select_one("p.categoryArticle__meta").text.split("|")[0].strip(), d.a["href"]) for d in divs]

Which gives you:

[(u'May 11, 2016 at 17:21', 'http://oilprice.com/Latest-Energy-News/World-News/Oil-Hits-6-Month-High-on-Crude-Inventory-Draw.html'), (u'May 11, 2016 at 16:56', 'http://oilprice.com/Latest-Energy-News/World-News/Nigerian-President-Lashes-Out-At-UK-Over-Stolen-Assets.html'), (u'May 11, 2016 at 15:41', 'http://oilprice.com/Latest-Energy-News/World-News/Germany-Ups-Gazprom-Imports-by-19-percent-in-Q1.html'), (u'May 11, 2016 at 15:39', 'http://oilprice.com/Latest-Energy-News/World-News/Solar-Hits-Millionth-Installation-In-The-US-Faster-Growth-Ahead.html'), (u'May 11, 2016 at 14:14', 'http://oilprice.com/Latest-Energy-News/World-News/OPEC-Production-Up-140000-Bpd-in-April.html'), (u'May 11, 2016 at 14:03', 'http://oilprice.com/Latest-Energy-News/World-News/Tullow-Ghana-Oil-Production-Down-by-More-Than-50.html'), (u'May 11, 2016 at 13:47', 'http://oilprice.com/Latest-Energy-News/World-News/Tesla-To-Complete-Model-3-Design-By-End-June.html'), (u'May 11, 2016 at 12:30', 'http://oilprice.com/Latest-Energy-News/World-News/Iraqi-Kurds-Boost-Oil-Exports-to-Turkey.html'), (u'May 11, 2016 at 11:57', 'http://oilprice.com/Latest-Energy-News/World-News/Security-Services-Raid-Headquarters-of-Ukraines-Largest-Gas-Company.html'), (u'May 11, 2016 at 10:59', 'http://oilprice.com/Latest-Energy-News/World-News/Oil-Up-3-AS-EIA-Reports-34M-Barrel-Crude-Inventory-Drop.html')]

It is always better to associate values from the parent tag if possible, pulling all anchors and slicing is not a very robust approach.

select and select_one use css-selectors, the equivalent code using find and find_all would be:

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get("http://oilprice.com/Latest-Energy-News/World-News").content)
main_div = soup.find("div", class_="tableGrid__column tableGrid__column--articleContent category")
divs = main_div.find_all("div","categoryArticle__content")

print([(d.find("p", {"class": "categoryArticle__meta"}).text.split("|")[0].strip(), d.a["href"]) for d in divs])

class_=.. lets you search-by-css-class

Also in this case categoryArticle__content only appears in the main div so you could search initially for those divs in place of first selecting the main div.

soup = BeautifulSoup(requests.get("http://oilprice.com/Latest-Energy-News/World-News").content)

divs = soup.find_all("div","categoryArticle__content")
print([(d.find("p", {"class": "categoryArticle__meta"}).text.split("|")[0].strip(), d.a["href"]) for d in divs])
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • are select and select_one request functions?? – Derek_P May 11 '16 at 23:28
  • @citramaillo, they are bs4 methods that use css selectors, I will add how to do it using just findall etc.. – Padraic Cunningham May 11 '16 at 23:29
  • Another option is to pair the values using a dictionary if one of the values is guaranteed to be unique. `namedtuple` is a bit better to use than a regular tuple in most cases because it is easier to access without introducing bugs. Building a simple storage class is also a good solution if you need to extend functionality; just store a small piece of the beautifulSoup object, and include methods to retrieve pertinent values. I've been working on something that may be relevant on my [github](https://github.com/Aarowaim/RingAnime) – Aaron3468 May 12 '16 at 00:28
  • 1
    @Aaron3468, it is safer to avoid a dict for that very reason, the dates are definitely not unique and accessing by url would be a arduous. Most likey the OP will be using or storing each date/href pair so a list of tuples is probably as good as any approach in this specific case – Padraic Cunningham May 12 '16 at 00:31
-1

You can make the dates and urls into lists, and then zip them up, like so:

dates = []
urls = []
for p in table.find_all('p', {'class':'categoryArticle__meta'}):
    date = p.get_text()
    dates.add(date)
for a in table.find_all('a', href=True)[::2][:-5]:
    headline = a['href'] # should this be called headline?
    urls.add(headline)
easy_access = zip(dates, urls)

See these for zip help:

Zip turns the two lists into a list of tuples, so the output might look something like this:

easy_access = [('1/2/12', 'http://somewhere.com'), 
               ('2/2/12', 'http://somewhereelse.com'), 
               ('3/2/12', 'http://nowhere.com'), 
               ('4/2/12', 'http://here.com')]
Community
  • 1
  • 1
Nevermore
  • 7,141
  • 5
  • 42
  • 64
-1

I think if you're trying to use one expression to match on to get both the value in href and date, it will be pretty difficult (i.e. I have no idea how to do that). However, if you use xpath to navigate to the parts you want and store them, you can easily pick out what you need. Based on the code you provided, I'd recommend something like this:

from lxml import html
import requests

webpage = requests.get('www.myexampleurl.com')
tree = html.fromstring(webpage.content)

currentHRef = tree xpath('//div//a/')[0].get("href")
currentDate = tree.xpath('//div//p/text()')[0]

dateTarget = (currentHRef, currentDate)
Jayson
  • 940
  • 8
  • 14