0

I am working on an app that uses beatifulsoup, Python, requests and django. I've been kind of grasping how to use beautiful soup. But drilling down seems to different elements is confusing at times. I created a function, albeit not the best, that scrapes links from posts and uses those links to go to the posts detail page. And from that page scrape the the script data that contains the the face book url and the image associated with it. This is the the code

from my scraper.py

def panties():
    pan_url = 'http://www.panvideos.com'
    html = requests.get(pan_url, headers=headers)
    soup = BeautifulSoup(html.text, 'html5lib')
    video_row = soup.find_all('div', {'class': 'video'})

    def youtube_link(url):
        youtube_page = requests.get(url, headers=headers)
        soupdata = BeautifulSoup(youtube_page.text, 'html5lib')
        video_row = soupdata.find('div', {'class': 'video-player'})
        entries = [{'text': str(div),
                    } for div in video_row][3]
        return entries

    entries = [{'text': div.h4.text,
                'href': div.a.get('href'),
                'tube': youtube_link(div.a.get('href')),
                } for div in video_row][:3]

    return entries

from my views.py

   pan = panties()
    context = {
        'pan': pan,
    }
    return render(request, 'index.html', context)

and in my template

{% for p in pan %}
   Title: {{p.text}}<br>
   Href: {{p.href}}<br>
   Tube: {{p.tube}}<hr>
{% endfor %}

and heres what it returns

Title: Juanka - Esperando por ti (Official Video)
Href: http://www.videos.com/video/2962/juanka-esperando-por-ti-official-video-/
Tube: {'text': '<script type="text/javascript">jwplayer("video-setup").setup({file:"http://www.youtube.com/watch?v=QL4JFUHd71o",image:"http://i1.ytimg.com/vi/QL4JFUHd71o/maxresdefault.jpg",primary:"html5",stretching:"fill","controlbar":"bottom",width:"100%",aspectratio:"16:9",autostart:"true",logo:{file:"http://www.panvideos.com/uploads/gopcds-png5787dbcd53a72.png",position:"bottom-right",link:"http://www.panvideos.com/"},sharing:{link:"http://www.panvideos.com/video/2962/juanka-esperando-por-ti-official-video-/","sites":["facebook","twitter","linkedin","pinterest","tumblr","googleplus","reddit"]}});</script>'}

my thing is I only want

http://www.youtube.com/watch?v=QL4JFUHd71o

and

http://i1.ytimg.com/vi/QL4JFUHd71o/maxresdefault.jpg

which are the video and image respectively. How can I accomplish this. My code is not set in stone and I don't mind changing it to make it work. Thanks for any advice i advance.

losee
  • 2,190
  • 3
  • 29
  • 55

1 Answers1

0

If I understand well, you want to find 2 elements from your p.tube BeautifulSoup object. I'll call it soup for easier understanding.

First, I would get rid of the <script> with soup.text function.

Then I would use regular expression re package https://docs.python.org/2/library/re.html to find .setup( to get rid of everything that is before it, and -2 to get rid of the ); at the end

import re
s = re(".setup(", soup)
soup = soup[s.end:-2]

and then, to transform your string into a dictionary, I advise you use ast.literal_eval : Convert a String representation of a Dictionary to a dictionary?

Unfortunately, (that would be to easy) your string is not well formatted to be transformed easily into a dictionary.

Therefore, I would get rid of the {} , and split with comas ,

soup = soup[1:-1]
l = soup.split(',')

And hopefully, because the elements you are searching are the first two, you should find them easily

Community
  • 1
  • 1
Albyorix
  • 637
  • 1
  • 6
  • 13
  • Hello thanks for the response. But how would I make this fit my code. Your solution does'nt work the way you have explained it with my code because p.tube or soup by your definition is in the template. so I cant do this import re s = re(".setup(", soup) soup = soup[s.end:-2] – losee Jul 17 '16 at 15:47
  • Could you take what you have and ad my code to it? Because I don't know where to start. when I tried using s = re(".setup(", soup) I got an "re is not callable " error – losee Jul 17 '16 at 16:02
  • see my previous responses above – losee Jul 17 '16 at 16:09
  • I think I see the error you @Alby you have re instead of re.search or re.compile I don't think you can use re() by itself. In the documentation re is always followed by function – losee Jul 17 '16 at 16:37