0

I cannot get my program to work, and I have tried for so long. Here it is, pretty simple but I cannot get it. Supposed to return anything containing "html" in it. It is really frustrating. This is for command line python 2.x

#!/usr/bin/env python

import sys
import re

#Make this program work both on python 2.x and Python 3.x
if (sys.version_info[0] == 3): raw_input = input

import urllib2
url = urllib2.urlopen('http://makeitwork.com/')
data = url.read()
urlsearch = re.findall(r'href=[\'"]?([^\'"]+)' , data)

for x in urlsearch:
    line = x.split()
    print(" %s" %line[0])
  • Questions seeking debugging help (**"why isn't this code working?"**) must include the desired behavior, *a specific problem or error* and *the shortest code necessary* to reproduce it **in the question itself**. Questions without **a clear problem statement** are not useful to other readers. See: [How to create a Minimal, Complete, and Verifiable Example](http://stackoverflow.com/help/mcve). – MattDMo Nov 23 '15 at 02:05

2 Answers2

3

Try BeautifulSoup, Never use regex to parse HTML code :

import urllib2
from bs4 import BeautifulSoup

url = urllib2.urlopen('http://makeitwork.com/')
data = url.read()

soup = BeautifulSoup(data)
for i in soup.find_all(a):
    print(link.get('href'))
Remi Guan
  • 21,506
  • 17
  • 64
  • 87
0

Try using this RegEx

'r'a\shref="/?(.*)">'

Basically searching for Anything after the <a href html tag and before the > closing statement.

Pang
  • 9,564
  • 146
  • 81
  • 122
Scott
  • 51
  • 1
  • 12