1

I am trying to create a program to pull all the links from a webpage and put them into a list.

import urllib.request as ur

#user defined functions
def findLinks(website):
    links = []
    line = website.readline()
    while 'href=' not in line: 
        line = website.readline() 
        p
    while '</a>' not in line :
        links.append(line)
        line = website.readline()



#connect to a URL
website = ur.urlopen("https://www.cs.ualberta.ca/")
findLinks(website)

When I run this program it delays and returns a TypeError : string does not support buffer interference.

Anyone with any pointers?

Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895
  • Which version of python? – Logan Jan 12 '16 at 16:36
  • There are many tools to make this much easier, you are making an assumption that there are line breaks in the html, or that the link does not have a line break in it. You should Google, finding links Python - that should bring you back to some useful q&a here. – PyNEwbie Jan 12 '16 at 16:41
  • 3
    Possible duplicate of [how can I get href links from html code](http://stackoverflow.com/questions/3075550/how-can-i-get-href-links-from-html-code) – PyNEwbie Jan 12 '16 at 16:42

2 Answers2

0

Python cannot use bytes with strings, to make it "work" I had to change "href=" to b"href=" and "</a>" to b"</a>".
The links were not extracted, though. Using re, I was able to do this:

def findthem(website):
    import re

    links = []
    line = website.readline()
    while len(line) != 0:
        req = re.findall('href="(.*?)"', line.decode())
        for l in req:
            links.append(l)

        line = website.readline()

    return links
Rolbrok
  • 308
  • 1
  • 7
  • Oh nice post, I was looking at an easy way, but I don't really know any other solutions except by reading other stackoverflow posts. Thank you. – Rolbrok Jan 12 '16 at 16:59
  • Yeah, that's one to bookmark. People on here get really upset whenever you suggest using regex to parse HTML. – OneCricketeer Jan 12 '16 at 17:27
  • Thank you, that fixed the problem! for future reference, why was it that the other method wouldn't work? – spaceinvaders101 Jan 12 '16 at 17:32
  • The code returned a list of lines containing links, not the links themselves, and the script read all the lines until it read an `href`, then continued but appending every line which did not contain an ``. And when you make something like that, you should take in consideration that not every html page is written with indentation, newlines etc... This is why using html/xml parsers are recommended, because they are much more efficient. – Rolbrok Jan 12 '16 at 18:04
  • one last question... for the link Tillie; How would I go about extracting specifically the part that says 'Tillie" before the ? – spaceinvaders101 Jan 12 '16 at 19:18
  • This is answered [here](http://stackoverflow.com/a/7911577/5775381) (again, using regex, sorry). Inside the `re.compile`, changing `r'(.*?)'` to `r'(.*?)'` should do the trick. – Rolbrok Jan 13 '16 at 14:59
0

A better way to get all the links from a URL would be to parse the HTML using a library like BeautifulSoup.

Here's an example that grabs all links from a URL and prints them.

import requests
from bs4 import BeautifulSoup

html = requests.get("https://www.cs.ualberta.ca/").text
soup = BeautifulSoup(html, "html.parser")

for a in soup.find_all("a"):
    link = a.get("href")
    if link:
        print(link)
Leo
  • 1,273
  • 9
  • 14