0

First post on stackoverflow. I'll try my best to format correctly.

I'm working on a little python script, which I have little experience with, to scrape images off image subreddits. Currently I can pull down an html page, for example from r/pics, but i'm having trouble parsing it for image urls, specifically ones from imgur. What I'd like to do is filter out urls of the form

http://i.imgur.com/*******.png

into a tuple, but I'm unsure how to do this.

My current attempt looks like this:

    from subprocess import call
    picture_url_list = []
    return_code = call("wget -O redithtml www.reddit.com/r/pics/", shell = True)

    inputfile = open("redithtml")
    find_text = "http://i.imgur.com/"

    for line in inputfile:
        while True:
            this_url = line.rfind(find_text)
            if this_url == -1:
                break
            line_partition = line.partition(line[this_url:this_url + 31])
            picture_url_list.append(line_partition[1])
            line = line_partition[2]
            if line.lenght() == 0:
                break

I've been looking here for help, but the only examples are using 're' or 'fnmatch' to parse through tuples, not text streams.

So, just to clarify; I am trying to scrape images off reddit by finding and placing i.imgur urls in a tuple to be scraped off in the next segment of code (not shown).

Community
  • 1
  • 1
  • I think you can safely assume that there is only 1 "http://i.imgur.com/" per line (Just did a quick skim of the source code). If this is the case You can do: `if "http://i.imgur.com/" in line: return line[line.find("http://i.imgur.com/"):line.find("http://i.imgur.com/")+line[line.find("http://i.imgur.com/"):line.find('"')]].` What that does is line[start of the url:first quotation mark after the text.] – 1478963 May 13 '14 at 17:33
  • 1
    Actually, it looked like many of the i.imgur.com/ references were in one line. –  May 13 '14 at 17:36
  • I would recommend using for example BeautifulSoup to scrape the webpage for all urls. And then filter those to get the ones you're interested in. [Web Scraping with BeautifulSoup](http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/) – M4rtini May 13 '14 at 17:37
  • I did not catch that. I'm sorry. You can do templinearry = line.split("href="). What this does is return a list. In each element there will only be 1 link ever. Note that not all of them will have an imgur link in there. But you can just quick check by using `if "i.imugr" in element.` – 1478963 May 13 '14 at 17:40

1 Answers1

0

Using BeautifulSoup and requests to download and process the page.

from bs4 import BeautifulSoup

import requests
r  = requests.get("http://www.reddit.com/r/pics/")

data = r.text

soup = BeautifulSoup(data)

for link in soup.find_all('a', href=True):
    linkHref = link.get('href')
    if linkHref.startswith('http://i.imgur.com/'):
        print(linkHref)

soup.find_all('a', href=True) will get all link's with a defined href property. In the loop we check if the link starts with http://i.imgur.com/, if it does we print it (here you would have to add code to do what you wan't to do with that image.)

M4rtini
  • 13,186
  • 4
  • 35
  • 42
  • Thanks! This is definitely a step in the right direction. Now I just need to learn how to colate them in folders somewhere else on the file system. You were great help! –  May 13 '14 at 22:17