0

I started a little project. I am trying to scrape the URL http://pr0gramm.com/ and save the tags under a picture in a variable, but I have problems to do so.

I am searching for this in the code

<a class="tag-link" href="/top/Flaschenkind">Flaschenkind</a>

And I actually just need the part "Flaschenkind" to be saved, but also the following tags in that line.

This is my code so far

import requests
from bs4 import BeautifulSoup

url = "http://pr0gramm.com/"
r = requests.get(url)

soup = BeautifulSoup(r.content, "lxml")

links = soup.find_all("div", {"class" : "item-tags"})

print(links)

I sadly just get this output

[]

I already tried to change the URL to http://pr0gramm.com/top/ but I get the same output. I wonder if it happens because the site might be made with JavaScript and it can't scrape the data correctly then?

Remi Guan
  • 21,506
  • 17
  • 64
  • 87
kratze
  • 186
  • 2
  • 11
  • Your webpage appears to be JavaScript protected. Take a look at http://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-pythonpage – Wondercricket Mar 07 '16 at 20:42
  • thank you very much, i guessed so that i am stuck because of js. I'll take a look at your posted link. – kratze Mar 07 '16 at 20:52
  • So you want to get `class="tag-link"` for example, why are you searching for `{"class" : "item-tags"}` in your code? – Remi Guan Mar 07 '16 at 20:59
  • actually i want to get from this `http://img.pr0gramm.com/2016/03/07/f693234d558334d7.jpg ['Datsun 1600 Wagon', 'Garage 88', 'Kombi', 'nur Oma liegt tiefer', 'rolladen', 'slow']` only this `Datsun 1600 Wagon, Garage 88, Kombo, nur Oma liegt tieger, rolladen, slow` – kratze Mar 07 '16 at 21:07

2 Answers2

1

The problem is - this is a dynamic site and all of the data you see is loaded via additional XHR calls to the website JSON API. You need to simulate that in your code.

Working example using requests:

from urllib.parse import urljoin

import requests

base_image_url = "http://img.pr0gramm.com"
with requests.Session() as session:
    response = session.get("http://pr0gramm.com/api/items/get", params={"flags": 1, "promoted": "1"})

    posts = response.json()["items"]
    for post in posts:
        image_url = urljoin(base_image_url, post["image"])

        # get tags
        response = session.get("http://pr0gramm.com/api/items/info", params={"itemId": post["id"]})
        post_data = response.json()
        tags = [tag["tag"] for tag in post_data["tags"]]

        print(image_url, tags)

This would print the post image url as well as a list of post tags:

http://img.pr0gramm.com/2016/03/07/f693234d558334d7.jpg ['Datsun 1600 Wagon', 'Garage 88', 'Kombi', 'nur Oma liegt tiefer', 'rolladen', 'slow']
http://img.pr0gramm.com/2016/03/07/185544cda956679e.webm ['Danke Merkel', 'deeskalierte zeitnah', 'demokratie im endstadium', 'Fachkraft', 'Far Cry Primal', 'Invite is raus', 'typ ist nackt', 'VVS', 'webm', 'zeigt seine stange']
http://img.pr0gramm.com/2016/03/07/4a6719b33219fd87.jpg ['bmw', 'der Gerät', 'Drehmoment', 'für mehr Motorräder auf pr0', 'Motorrad']
...
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • this comes already close to what i am looking for but i need only the tags from the picture which is viewed at the moment. – kratze Mar 07 '16 at 21:01
  • @kratze what do you mean by "viewed at the moment"? Viewed there? Neither BeautifulSoup, nor requests or urllib is a browser. There is no such thing as viewed at the moment. Thanks. – alecxe Mar 07 '16 at 21:02
  • yes, maybe its more clear when i explain for what i need this. i want to save the tags of the picture which is viewed by user in that moment. then these tags shall be saved in a list and be compared with another list. if any tag matches with my prepared list it shall print print a banner on the website. its kinda like a custom dfp from google i try to build – kratze Mar 07 '16 at 21:04
  • its not possible to show tags just from the active picture? and then that the code works dynamical, each time you activate another picture then fetch tags from that picture? – kratze Mar 07 '16 at 21:15
  • @kratze what do you mean by "active" picture? What your script would have as an input? Sorry, but it's unclear what the question is about, I think everything you need is already posted. You just need to understand it, tweak if needed and use. – alecxe Mar 07 '16 at 21:17
  • when you click that link you see that one picture is enlarged - http://pr0gramm.com/top/1215023 - and i want that it only fetches tags from the enlarged picture. but yes thanks very much, your code already helped me a lot , also to understand what i did wrong. – kratze Mar 07 '16 at 21:19
0

First off your URL is a Java Script enabled version of this site. They offer a static URL as www.pr0gramm.com/static/ Here you'll find the content formatted more like your example suggests you expect.

Using this static version of the URL I retrieved <a> tags using the code below like yours. I removed the class tag filter. Python 2.7

import bs4
import urllib2

def main():

    url = "http://pr0gramm.com/static/"
    try:
        fin = urllib2.urlopen(url)
    except:
        print "Url retrieval failed url:",url
        return None

    html = fin.read()

    bs = bs4.BeautifulSoup(html,"html5lib")

    links = bs.find_all("a")
    print links
    return None


if __name__ == "__main__":
    main()
JimmyNJ
  • 1,134
  • 1
  • 8
  • 23
  • thank your for your suggestion. i tried to change this to python3.X but then it doesn't print what i need. – kratze Mar 07 '16 at 21:02