(Python) Trying to isolate some data from a website

Question

Essentially the script will download images from wallbase.cc's random and toplist pages. Essentially it looks for a 7 digit string which identifies each image as that image. It the inputs that id into a url and downloads it. The only problem I seem to have is isolating the 7 digit string.

What I want to be able to do is..

Search for <div id="thumbxxxxxxx" and then assign xxxxxxx to a variable.

Here's what I have so far.

import urllib
import os
import sys
import re


#Written in Python 2.7 with LightTable


def get_id():
    import urllib.request
    req = urllib.request.Request('http://wallbase.cc/'+initial_prompt)
    response = urllib.request.urlopen(req)
    the_page = response.read()
    for "data-id="" in the_page


def toplist():
    #We need to define how to find the images to download
    #The idea is to go to http://wallbase.cc/x and to take all of strings containing <a href="http://wallbase.cc/wallpaper/xxxxxxx" </a>
    #And to request the image file from that URL.
    #Then the file will be put in a user defined directory

    image_id = raw_input("Enter the seven digit identifier for the image to be downloaded to "+ directory+ "...\n>>> ")

    f = open(directory+image_id+ '.jpg','wb')
    f.write(urllib.urlopen('http://wallpapers.wallbase.cc/rozne/wallpaper-'+image_id+'.jpg').read())
    f.close()


directory = raw_input("Enter the directory in which the images will be downloaded.\n>>> ")

initial_prompt = input("What do you want to download from?\n\t1: Toplist\n\t2: Random\n>>> ")
if initial_prompt == 1:
    urlid = 'toplist'
    toplist()

elif initial_prompt == 2:
    urlid = 'random'
    random()

Any/all help is very much appreciated :)

score 3 · Answer 1 · edited May 23 '17 at 12:21

You probably want to use a web scraping library like BeautifulSoup, see eg. this SO question on web scraping in Python.

import urllib2
from BeautifulSoup import BeautifulSoup

# download and parse HTML
url = 'http://wallbase.cc/toplist'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

# find the links we want
links = soup('a', href=re.compile('^http://wallbase.cc/wallpaper/\d+$'))
for l in links:
    href = l.get('href')
    print href                # u'http://wallbase.cc/wallpaper/1750539'
    print href.split('/')[-1] # u'1750539'

charmoniumQ · Answer 2 · 2014-02-03T01:05:27.143

0

If you want to only use the default library, you could use regular expressions.

pattern = re.compile(r'<div id="thumb(.{7})"')

...

for data-id in re.findall(pattern, the_page):
    pass # do something with data-id

edited Feb 03 '14 at 01:05

answered Feb 03 '14 at 00:58

charmoniumQ

5,214
5
33
51

1

I couldn't resist linking to this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – GavinH Feb 03 '14 at 01:13

(Python) Trying to isolate some data from a website

2 Answers2