0

Im trying to grab proxies from a site using python by scanning through the page with urlib and finding proxies using regex.

A proxy on the page looks something like this:

<a href="/ip/190.207.169.184/free_Venezuela_proxy_servers_VE_Venezuela">190.207.169.184</a></td><td>8080</td><td>

My code looks like this:

for site in sites:
content = urllib.urlopen(site).read()
e = re.findall("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\<\/\a\>\<\/td\>\<td\>\d+", content)
#\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d+

for proxy in e:
    s.append(proxy)
    amount += 1

Regex:

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\<\/\a\>\<\/td\>\<td\>\d+

I know that the code works but that the Regex is wrong.

Any idea on how I could fix this?

EDIT: http://www.regexr.com/ seems to thing my Regex is fine?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Cephon
  • 41
  • 7
  • 3
    look into `lxml` or `beautifulsoup`. Using regex for html is a hack. – Martin Konecny Oct 04 '14 at 18:08
  • 2
    don't escape `<,>,a,/` http://regex101.com/r/xB5sT0/2 – Avinash Raj Oct 04 '14 at 18:10
  • Refer this http://stackoverflow.com/questions/26183643/find-specific-text-in-beautifulsoup/26183877#26183877 – Hussain Shabbir Oct 04 '14 at 18:11
  • Also, you need to use a raw string if you don't want to escape every \ in your regex: prefix the sting with `r`, e.g. `r"\d{1, 3}"` – Alex Riley Oct 04 '14 at 18:13
  • The site even has an "export as JSON" and "export as text" feature. Maybe you're riding the wrong horse? – Tomalak Oct 04 '14 at 18:20
  • @Tomalak They export about 10 proxies per time – Cephon Oct 04 '14 at 18:26
  • I see. It's not the solution you asked for, but maybe you should let [kimonolabs](http://www.kimonolabs.com) do the crawling for you. They make this task beautifully easy and you can get your data programmatically in form of RSS or JSON. I've created a sample API here: https://www.kimonolabs.com/apis/4w7y5cxi, if everything works out it should even follow the pagination and crawl everything in the background. Give it a shot. – Tomalak Oct 04 '14 at 18:30
  • ...turns out automatic pagination does not work because that proxy list page is missing a generic "next page" link. The rest however works, you can also page via kimono's URL parameters. Well, it was just a thought, by all means, write your own scraper if you want. – Tomalak Oct 04 '14 at 18:49
  • HTML is best parsed via DOM. Regex parsing of html is painful and error prone. For fun, refer to [this stack-o answer](http://stackoverflow.com/a/1732454/564406). – David Oct 06 '14 at 14:13

1 Answers1

3

One option would be to use an HTML parser to find IP addresses and ports.

Example (using BeautifulSoup HTML parser):

import re
import urllib2
from bs4 import BeautifulSoup

data = urllib2.urlopen('http://letushide.com/protocol/http/3/list_of_free_HTTP_proxy_servers')

IP_RE = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
PORT_RE = re.compile(r'\d+')

soup = BeautifulSoup(data)
for ip in soup.find_all('a', text=IP_RE):
    port = ip.parent.find_next_sibling('td', text=PORT_RE)
    print ip.text, port.text

Prints:

80.193.214.231 3128
186.88.37.204 8080
180.254.72.33 80
201.209.27.119 8080
...

The idea here is to find all a tags with the text matching \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} regular expression. For each link, find the parent's next td sibling with the text matching \d+.


Alternatively, since you know the table structure and the columns where there are IPs and ports, you can just get the cell values from each row by index, no need to dive into regular expressions here:

import urllib2
from bs4 import BeautifulSoup

data = urllib2.urlopen('http://letushide.com/protocol/http/3/list_of_free_HTTP_proxy_servers')

soup = BeautifulSoup(data)
for row in soup.find_all('tr', id='data'):
    print [cell.text for cell in row('td')[1:3]]

Prints:

[u'80.193.214.231', u'3128']
[u'186.88.37.204', u'8080']
[u'180.254.72.33', u'80']
[u'201.209.27.119', u'8080']
[u'190.204.96.72', u'8080']
[u'190.207.169.184', u'8080']
[u'79.172.242.188', u'8080']
[u'1.168.171.100', u'8088']
[u'27.105.26.162', u'9064']
[u'190.199.92.174', u'8080']
...
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195