Returning multiple "href"

Question

I cannot get my program to work, and I have tried for so long. Here it is, pretty simple but I cannot get it. Supposed to return anything containing "html" in it. It is really frustrating. This is for command line python 2.x

#!/usr/bin/env python

import sys
import re

#Make this program work both on python 2.x and Python 3.x
if (sys.version_info[0] == 3): raw_input = input

import urllib2
url = urllib2.urlopen('http://makeitwork.com/')
data = url.read()
urlsearch = re.findall(r'href=[\'"]?([^\'"]+)' , data)

for x in urlsearch:
    line = x.split()
    print(" %s" %line[0])

Questions seeking debugging help (**"why isn't this code working?"**) must include the desired behavior, *a specific problem or error* and *the shortest code necessary* to reproduce it **in the question itself**. Questions without **a clear problem statement** are not useful to other readers. See: [How to create a Minimal, Complete, and Verifiable Example](http://stackoverflow.com/help/mcve). — MattDMo, Nov 23 '15 at 02:05

score 3 · Answer 1 · answered Nov 23 '15 at 02:11

3

Try BeautifulSoup, Never use regex to parse HTML code :

import urllib2
from bs4 import BeautifulSoup

url = urllib2.urlopen('http://makeitwork.com/')
data = url.read()

soup = BeautifulSoup(data)
for i in soup.find_all(a):
    print(link.get('href'))

answered Nov 23 '15 at 02:11

Remi Guan

21,506
17
64
87

score 0 · Answer 2 · edited Nov 23 '15 at 03:52

0

Try using this RegEx

'r'a\shref="/?(.*)">'

Basically searching for Anything after the <a href html tag and before the > closing statement.

edited Nov 23 '15 at 03:52

Pang

9,564
146
81
122

answered Nov 23 '15 at 03:39

Scott

51
1
12

Returning multiple "href"

2 Answers2