Pulling links from website in python

Question

I am trying to create a program to pull all the links from a webpage and put them into a list.

import urllib.request as ur

#user defined functions
def findLinks(website):
    links = []
    line = website.readline()
    while 'href=' not in line: 
        line = website.readline() 
        p
    while '</a>' not in line :
        links.append(line)
        line = website.readline()



#connect to a URL
website = ur.urlopen("https://www.cs.ualberta.ca/")
findLinks(website)

When I run this program it delays and returns a TypeError : string does not support buffer interference.

Anyone with any pointers?

There are many tools to make this much easier, you are making an assumption that there are line breaks in the html, or that the link does not have a line break in it. You should Google, finding links Python - that should bring you back to some useful q&a here. — PyNEwbie, Jan 12 '16 at 16:41
Possible duplicate of [how can I get href links from html code](http://stackoverflow.com/questions/3075550/how-can-i-get-href-links-from-html-code) — PyNEwbie, Jan 12 '16 at 16:42

score 0 · Answer 1 · answered Jan 12 '16 at 16:52

0

Python cannot use bytes with strings, to make it "work" I had to change "href=" to b"href=" and "</a>" to b"</a>".
The links were not extracted, though. Using re, I was able to do this:

def findthem(website):
    import re

    links = []
    line = website.readline()
    while len(line) != 0:
        req = re.findall('href="(.*?)"', line.decode())
        for l in req:
            links.append(l)

        line = website.readline()

    return links

answered Jan 12 '16 at 16:52

Rolbrok

308
1
7

Oh nice post, I was looking at an easy way, but I don't really know any other solutions except by reading other stackoverflow posts. Thank you. – Rolbrok Jan 12 '16 at 16:59
Yeah, that's one to bookmark. People on here get really upset whenever you suggest using regex to parse HTML. – OneCricketeer Jan 12 '16 at 17:27
Thank you, that fixed the problem! for future reference, why was it that the other method wouldn't work? – spaceinvaders101 Jan 12 '16 at 17:32
The code returned a list of lines containing links, not the links themselves, and the script read all the lines until it read an `href`, then continued but appending every line which did not contain an ``. And when you make something like that, you should take in consideration that not every html page is written with indentation, newlines etc... This is why using html/xml parsers are recommended, because they are much more efficient. – Rolbrok Jan 12 '16 at 18:04
one last question... for the link Tillie; How would I go about extracting specifically the part that says 'Tillie" before the ? – spaceinvaders101 Jan 12 '16 at 19:18
This is answered [here](http://stackoverflow.com/a/7911577/5775381) (again, using regex, sorry). Inside the `re.compile`, changing `r'(.*?)'` to `r'(.*?)'` should do the trick. – Rolbrok Jan 13 '16 at 14:59

score 0 · Answer 2 · answered Jan 12 '16 at 17:43

A better way to get all the links from a URL would be to parse the HTML using a library like BeautifulSoup.

Here's an example that grabs all links from a URL and prints them.

import requests
from bs4 import BeautifulSoup

html = requests.get("https://www.cs.ualberta.ca/").text
soup = BeautifulSoup(html, "html.parser")

for a in soup.find_all("a"):
    link = a.get("href")
    if link:
        print(link)

Pulling links from website in python

2 Answers2