3

I am a newbie just trying to follow the webscraping examples from automate the boring stuff webscraping example. What I'm trying is to automate downloading images from phdcomics in one python code that will

  • find the link of the image from HTML and download then

  • find the link for the previous page from HTML and go there to repeat step 1 until the very first page.

For the downloading current page image, the segment of the HTML code after printing soup.prettify() looks like this -

<meta content="Link to Piled Higher and Deeper" name="description">
 <meta content="PHD Comic: Remind me" name="title">
  <link 
href="http://www.phdcomics.com/comics/archive/phd041218s.gif" rel="image_src">
   <div class="jumbotron" style="background-color:#52697d;padding: 0em 0em 0em;  margin-top:0px; margin-bottom: 0px; background-image: url('http://phdcomics.com/images/bkg_bottom_stuff3.png'); background-repeat: repeat-x;">
    <div align="center" class="container-fluid" style="max-width: 1800px;padding-left: 0px; padding-right:0px;">

and then when I write

newurl=soup.find('link', {'rel': "image_src"}).get('href')

it gives me what I need, which is

"http://www.phdcomics.com/comics/archive/phd041218s.gif"

In the next step when I want to find the previous page link, which I believe is in the following part of the HTML code -

<!-- Comic Table --!>
        <table border="0" cellspacing="0" cellpadding="0">
          <tr> 
            <td align="right" valign="top">
            <a href=http://phdcomics.com/comics/archive.php?comicid=2004><img height=52 width=49 src=http://phdcomics.com/comics/images/prev_button.gif border=0 align=middle><br></a><font 
                face=Arial,Helvetica,Geneva,Swiss,SunSans-Regular size=-1><i><b>previous </b></i></font><br><br><a href=http://phdcomics.com/comics/archive.php?comicid=1995><img src=http://phdcomics.com/comics/images/jump_bck10.gif border=0></a><br><a href=http://phdcomics.com/comics/archive.php?comicid=2000><img src=http://phdcomics.com/comics/images/jump_bck5.gif border=0></a><br><font face=Arial,Helvetica,Geneva,Swiss,SunSans-Regular size=-1><i><b>jump</b></i></font><br><br><a href=http://phdcomics.com/comics/archive.php?comicid=1><img src=http://phdcomics.com/comics/images/first_button.gif border=0 align=middle><br></a><font face=Arial,Helvetica,Geneva,Swiss,SunSans-Regular size=-1><i><b>first</b></i></font><br><br>               </td>
            <td align="center" valign="top"><font color="black"> 

From this part of the code I want to find

=http://phdcomics.com/comics/archive.php?comicid=2004

as my previous link. when I try something like this -

Prevlink=soup.find('a',{'src': 'http://phdcomics.com/comics/images/prev_button.gif'}).get('href')
print(Prevlink)

it gives me an error like this-

Prevlink=soup.find('a',{'src': 'http://phdcomics.com/comics/images/prev_button.gif'}).get('href')
AttributeError: 'NoneType' object has no attribute 'get'

Even when I try to do this-

Prevlink=soup.find('a',{'href': 'http://phdcomics.com/comics/archive.php?comicid=2004'}).get('href')
print(Prevlink)

I get similar error -

Prevlink=soup.find('a',{'href': 'http://phdcomics.com/comics/archive.php?comicid=2004'}).get('href')
AttributeError: 'NoneType' object has no attribute 'get'

What should be the right way to get the right 'href'? TIA

1 Answers1

3

The problem is in the way comments are added on the html of Phd comics. If you see closely in the output of soup.prettify() you will find comments like this

<!-- Comic Table --!>

when it should be,

<!-- Comic Table -->

This causes BeautifulSoup to miss certain tags. There are many ways to parse and remove comments like using regex, Comment, but it might be difficult to get them to work in this case. The easiest way would be to fix comment tags after collecting the html.

from bs4 import BeautifulSoup
import requests
url = "https://phdcomics.com/"
r  = requests.get(url)
data = r.text
data = data.replace("--!>","-->") # fix comments
soup = BeautifulSoup(data)
Prevlink=soup.find('a',{'href': 'http://phdcomics.com/comics/archive.php?comicid=2004'}).get('href')
print Prevlink
http://phdcomics.com/comics/archive.php?comicid=2004

Update: To auto find the requested link, we need to find the parent element of "http://phdcomics.com/comics/images/prev_button.gif" and extract the link

img_tag = soup.find('img',{'src':'http://phdcomics.com/comics/images/prev_button.gif'})
print img_tag.find_parent().get('href')
http://phdcomics.com/comics/archive.php?comicid=2005
Shivam Singh
  • 1,584
  • 1
  • 10
  • 9
  • thanks a lot! that fixes the problem but now the question is how do I get 'http://phdcomics.com/comics/archive.php?comicid=2004' as output? because if I already know this, I wouldn't be looking for that. This is the page ID which will change depending on the page number but I need to somehow connect http://phdcomics.com/comics/images/prev_button.gif this one to the output as I believe this one is the image of the previous page button. again, thanks in advance! – Muhsiul Hassan Apr 26 '18 at 01:18
  • 1
    Updated the answer, hope it helps. – Shivam Singh Apr 26 '18 at 04:40