-1

I'm downloading books from the website, and almost my code runs smoothly, but when I try to open the pdf Book on my PC. An error generated by Adobe Acrobat Reader that this is not supported file type.

Error Image

Here is the image of the Book formate, and I'm sure my code needs to be a correction because the formate of the book on the website is different from normally PDF Files.

Book Formate Image

Code:

import requests
from bs4 import BeautifulSoup
url = 'https://global.oup.com/education/support-learning-anywhere/key-resources-online/?region=international&utm_campaign=learninganywhere&utm_source=umbraco&utm_medium=display&utm_content=support_learning_key_resources&utm_team=int#Primary'

response = requests.get(url)
soup     = BeautifulSoup(response.content, 'html.parser')
table_data = soup.find_all('td')

books_url_list = []
for link in table_data:
    books_url = link.find('a')['href']
    books_url_list.append(books_url+'.pdf')
    
book = books_url_list[1]
book_response = requests.get(book)

with open('books.pdf', 'wb') as f:
    f.write(book_response.content)

`

furas
  • 134,197
  • 12
  • 106
  • 148
Awan
  • 1
  • 1
  • Check [this](https://stackoverflow.com/a/55919705/10824407) answer, it could be helpful. – Olvin Roght Sep 13 '20 at 14:20
  • The book links do not provide the location of a pdf. The link is to another webpage which then displays the pdf stored on the server via webpage, so the value of `book_response` is the raw html of the page shoing the book, not the pdf content of the book. – joshmeranda Sep 13 '20 at 14:23
  • 3
    If you inspect the website, you can see that there are no PDFs that you can scrape. They are displayed as svgz files. See an example [here](http://p.calameoassets.com/200406174918-271f79d0ce92452de86df83977cbb8e0/p3.svgz). You could try to convert them with svglib. – runDOSrun Sep 13 '20 at 14:25
  • 1
    If you open the pdf file in notepad, you will see `access denied` – Mike67 Sep 13 '20 at 14:26

1 Answers1

1

Well, I inspected element from website, then I find no '.pdf' files. We can inspect one book page using following link: https://en.calameo.com/read/000777721d10096b9e9ca?authid=gWc48kAQQoD0&region=international

After inspecting the element, I find is not pdf. It's just an image in the page.

https://p.calameoassets.com/200406174654-2bfa9441783e162c8da42a712feda3e2/p1.svgz

https://p.calameoassets.com/200406174654-2bfa9441783e162c8da42a712feda3e2/p2.svgz

....

https://p.calameoassets.com/200406174654-2bfa9441783e162c8da42a712feda3e2/p98.svgz

And so on.

So, you can write a code to download this image.

Shmn
  • 681
  • 1
  • 4
  • 22
0x0ffff
  • 11
  • 3