First post on stackoverflow. I'll try my best to format correctly.
I'm working on a little python script, which I have little experience with, to scrape images off image subreddits. Currently I can pull down an html page, for example from r/pics, but i'm having trouble parsing it for image urls, specifically ones from imgur. What I'd like to do is filter out urls of the form
http://i.imgur.com/*******.png
into a tuple, but I'm unsure how to do this.
My current attempt looks like this:
from subprocess import call
picture_url_list = []
return_code = call("wget -O redithtml www.reddit.com/r/pics/", shell = True)
inputfile = open("redithtml")
find_text = "http://i.imgur.com/"
for line in inputfile:
while True:
this_url = line.rfind(find_text)
if this_url == -1:
break
line_partition = line.partition(line[this_url:this_url + 31])
picture_url_list.append(line_partition[1])
line = line_partition[2]
if line.lenght() == 0:
break
I've been looking here for help, but the only examples are using 're' or 'fnmatch' to parse through tuples, not text streams.
So, just to clarify; I am trying to scrape images off reddit by finding and placing i.imgur urls in a tuple to be scraped off in the next segment of code (not shown).