How can I make urllib use the link it has found?

Question

Use this link in raw_input: http://edition.cnn.com/

import urllib
import re


CNN_Technology = (raw_input('Paste your link here: '))

urls = ["http://edition.cnn.com/"]
pattern = 'Entertainment</a><a class="nav-menu-links__link" href="//(.+?)data-analytics-header="main-menu_tech'
result = re.compile(pattern)

for url in urls:
    htmlsource = urllib.urlopen(url)
    htmltext = htmlsource.read()
    cnntech = re.findall(result, htmltext)
    print ""
    print "CNN Link:"
    print cnntech
    print ""

I want the newly found link money.cnn.com/technology/ to be where cnntech is and scan it afterwards.

urls = ["cnntech"] 
pattern = 'Entertainment</a><a class="nav-menu-links__link" href="//(.+?)data-analytics-header="main-menu_tech'
result = re.compile(pattern)

for url in urls:
    htmlsource = urllib.urlopen(url)
    htmltext = htmlsource.read()
    cnntech2 = re.findall(result, htmltext)
    print "CNN Link:"
    print cnntech2
<code>

Trying to extract pieces of HTML with a regular expression is ... How to say that ? [A controversial topic](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). But trying to extract a precise link with a regular expression itself consisting of HTML tags is sheer madness. You *definitely* need to learn how to use a html parsing library, for example [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#). — Ettore Rizza, Feb 18 '17 at 00:52

Ettore Rizza · Accepted Answer · 2017-02-19T00:25:02.980

Well, let's imagine for a moment that regexes are a great, great way to parse HTML. So we're in a world of sci-fi.

The output of your first script looks like this: ['money.cnn.com/technology/" ']

This is a list containing a bad link: the http:// protocol is not specified and there is a quotation mark at the end. Urllib can't do anything with that.

The first thing to do is to fix your regex in order to get the most correct URL possible:

pattern = 'Entertainment</a><a class="nav-menu-links__link" href="//(.+?)" data-analytics-header="main-menu_tech'

Now, add the prefix "http://" to all the urls in your cnntech list:

urls = []
for links in cnntech:
    urls.append("http://" + links)

Finally, you can try the second part of the script:

pattern = YOURSECONDREGEGEX #I do not understand what you want to extract
result = re.compile(pattern)

for url in urls:
    html = urllib.urlopen(url).read()
    cnntech2 = re.findall(result, str(html))
    print "CNN Link:", cnntech2, "\ n"

Now, back to the real world with the same example, but this time using an HTML parser like Pyquery.

import requests #better than urllib
from pyquery import PyQuery as pq

urls = ["http://edition.cnn.com/"]

for url in urls:
    response = requests.get(url)
    doc = pq(response.content)
    cnntech = doc('.m-footer__subtitles--money .m-footer__list-item:nth-child(3) .m-footer__link').attr('href')
    print("CNN Link: ", cnntech)

Output:

CNN Link:  http://money.cnn.com/technology/

The strange string '.m-footer__subtitles--money .m-footer__list-item:nth-child(3) .m-footer__link' is a CSS selector. It seems at first glance even more frightening than a regular expression, and yet it is much simpler. You can find it easily using tools like Google Chrome's Selector gadget.

Thanks for helping. I will take a look at beautifulsoup after i have tried you answer. — Pythor, Feb 18 '17 at 22:54
@Pythor You're welcome. I've updated the answer with an example of script using Pyquery, very similar in its principles to BeautifulSoup. — Ettore Rizza, Feb 19 '17 at 00:17
Thanks man I appreciate it your help. I will get working on beautifulsoup right away — Pythor, Feb 20 '17 at 14:58

How can I make urllib use the link it has found?

1 Answers1