Python: Parcing through buffer for sub-stirng with wild cards within

Question

First post on stackoverflow. I'll try my best to format correctly.

I'm working on a little python script, which I have little experience with, to scrape images off image subreddits. Currently I can pull down an html page, for example from r/pics, but i'm having trouble parsing it for image urls, specifically ones from imgur. What I'd like to do is filter out urls of the form

http://i.imgur.com/*******.png

into a tuple, but I'm unsure how to do this.

My current attempt looks like this:

    from subprocess import call
    picture_url_list = []
    return_code = call("wget -O redithtml www.reddit.com/r/pics/", shell = True)

    inputfile = open("redithtml")
    find_text = "http://i.imgur.com/"

    for line in inputfile:
        while True:
            this_url = line.rfind(find_text)
            if this_url == -1:
                break
            line_partition = line.partition(line[this_url:this_url + 31])
            picture_url_list.append(line_partition[1])
            line = line_partition[2]
            if line.lenght() == 0:
                break

I've been looking here for help, but the only examples are using 're' or 'fnmatch' to parse through tuples, not text streams.

So, just to clarify; I am trying to scrape images off reddit by finding and placing i.imgur urls in a tuple to be scraped off in the next segment of code (not shown).

I think you can safely assume that there is only 1 "http://i.imgur.com/" per line (Just did a quick skim of the source code). If this is the case You can do: `if "http://i.imgur.com/" in line: return line[line.find("http://i.imgur.com/"):line.find("http://i.imgur.com/")+line[line.find("http://i.imgur.com/"):line.find('"')]].` What that does is line[start of the url:first quotation mark after the text.] — 1478963, May 13 '14 at 17:33
Actually, it looked like many of the i.imgur.com/ references were in one line. — , May 13 '14 at 17:36
I would recommend using for example BeautifulSoup to scrape the webpage for all urls. And then filter those to get the ones you're interested in. [Web Scraping with BeautifulSoup](http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/) — M4rtini, May 13 '14 at 17:37
I did not catch that. I'm sorry. You can do templinearry = line.split("href="). What this does is return a list. In each element there will only be 1 link ever. Note that not all of them will have an imgur link in there. But you can just quick check by using `if "i.imugr" in element.` — 1478963, May 13 '14 at 17:40

score 0 · Accepted Answer · answered May 13 '14 at 17:51

Using BeautifulSoup and requests to download and process the page.

from bs4 import BeautifulSoup

import requests
r  = requests.get("http://www.reddit.com/r/pics/")

data = r.text

soup = BeautifulSoup(data)

for link in soup.find_all('a', href=True):
    linkHref = link.get('href')
    if linkHref.startswith('http://i.imgur.com/'):
        print(linkHref)

soup.find_all('a', href=True) will get all link's with a defined href property. In the loop we check if the link starts with http://i.imgur.com/, if it does we print it (here you would have to add code to do what you wan't to do with that image.)

Thanks! This is definitely a step in the right direction. Now I just need to learn how to colate them in folders somewhere else on the file system. You were great help! — , May 13 '14 at 22:17

Python: Parcing through buffer for sub-stirng with wild cards within

1 Answers1