Requests making changes while downloading the HTML

Question

I've been trying to download a webpage in HTML, and then go through it looking for a link. However, it does not work, since I've noticed that the link somehow changes when I download the page.

This is the webpage I want to use python to download something from: https://b-ok.asia/book/4201067/7cd79d

When I open the page source of that webpage on my browser I can easily see the bit that I need: <a class="btn btn-primary dlButton addDownloadedBook" href="/dl/4201067/b9ffc6" target="" data-book_id="4201067" rel="nofollow"> (I need that /dl/..../.... bit)

However, when I try to use this code to get it using python, it does not work:

import requests
booklink="https://b-ok.asia/book/4201067/7cd79d"
downpage=requests.get(booklink, allow_redirects=True).text
print(downpage)
z=downpage.find("/dl/")
print(downpage[z+z+18])
dllink="https://b-ok.asia"+downpage[z:z+18]
print(dllink)

Here, downpage[z:z+18], which should have been "/dl/4201067/b9ffc6", instead comes out to be "/dl/4201067/89c216". I have absolutely no idea where this new number came from. When I use this, it brings me back to the original page which had the download link.

Can anyone help me out as to how to go about doing this?

@arundeepchohan I'm probably being really dense here (I'm a beginner) but could you explain how .find() works with tags, and how I can get this to work? — DorianGray, Sep 08 '20 at 09:41

score 1 · Accepted Answer · answered Sep 09 '20 at 21:03

1

I guess you want to download the book. The website changes the URL to prevent people linking to it. Presumably by using cookies or session cookies. If you use session from requests it keeps you cookies from one request to the next and you can download the book. The code below saves the book to book.epub it the directory you run the script from.

import requests
import shutil
from bs4 import BeautifulSoup

sess = requests.session()
req = sess.get('https://b-ok.asia/book/4201067/7cd79d')
soup = BeautifulSoup(req.content, 'html.parser')
link = soup.find('a', {'class': 'btn btn-primary dlButton addDownloadedBook'})['href']
with sess.get(f'https://b-ok.asia{link}', stream=True) as req2:
    with open('./book.epub', 'wb') as file:
        shutil.copyfileobj(req2.raw, file)

answered Sep 09 '20 at 21:03

Dan-Dev

8,957
3
38
55

This worked like a charm the first time. It got me an epub file of 444kb in the python directory, and it opened just fine. However when I ran the code a second time, with the same link but having changed "'./book.epub', 'wb'" to "'./book2.epub', 'wb'", it did not work. It made another file called book2, yes, but it was a 5 kb file, and gave an error when I tried to open it using my epub reader. – DorianGray Sep 10 '20 at 06:20
By the 'with the same link' do you mean you didn't fetch a fresh link and then fetch the file using the same session? You can not reuse the link outside the session. – Dan-Dev Sep 10 '20 at 11:59
No, I simply ran the code again as it is, just changing ./book.epub to ./book2.epub. I would have expected it to make another file, similar in every aspect but the name, in the same directory. But instead, the 'book2' file was not valid. Am I missing something obvious? – DorianGray Sep 10 '20 at 12:36
Strange I have updated the output file name 5 times and run it after each change successfully. I assume the ./book2.epub didn't exist before? – Dan-Dev Sep 10 '20 at 13:25
Nope, there was no prior file named book2.epub. ALso, for that matter, when I ran it again with book.epub as the file name, it had replaced the good 444kb file with the weird 5 kb one (I had copied the good file to another location before trying this). – DorianGray Sep 10 '20 at 13:32
In fact, right now I just copied your code again and ran it (the same code which had worked the first time) and it didn't work (it gave me that 5kb file that my reader cant open) – DorianGray Sep 10 '20 at 13:35
I opened it in notepad, and it gave me a whole bunch of special characters: ‹ í – DorianGray Sep 10 '20 at 13:47
What is your platform? I'll try to reproduce it – Dan-Dev Sep 10 '20 at 13:50
I'm on a Dell G7 running Windows 10 2004 if that's what you wanted? – DorianGray Sep 10 '20 at 14:35
Also, I have python 3.8 installed if that matters here. – DorianGray Sep 10 '20 at 16:58
I managed to reproduce it on the 6th attempt it gives a webpage with the text "WARNING: There are more than 5 downloads from your IP xxx.xxx.xxx.xxx during last 24 hours" – Dan-Dev Sep 10 '20 at 18:43
Oh yes. Silly of me to forget that. Sorry! Probably the first time I tried your program, it was my fifth download from that site for the day, so it did not work after that. I'll try it again in a while and let you know! – DorianGray Sep 11 '20 at 03:20

score 0 · Answer 2 · answered Sep 08 '20 at 09:57

0

To simply get the ahref attribute you can use .find() to get the a tag with the class.

import requests
from bs4 import BeautifulSoup

r = requests.get('https://b-ok.asia/book/4201067/7cd79d')
if r.status_code != 200:
    print("Error fetching page")
    exit()
else:
    content = r.content

soup = BeautifulSoup(r.content, 'html.parser')
print(soup)

z=soup.find('a',{'class':'btn btn-primary dlButton addDownloadedBook' }) 
print(z['href'])

answered Sep 08 '20 at 09:57

Arundeep Chohan

9,779
5
15
32

See here's the issue: From this code, the output I get is: /dl/4201067/89c216 (I was getting the same from mine) But, if you go to the link and look at the page source, you'll see that it actually is: /dl/4201067/b9ffc6 I have no clue why this change is there. – DorianGray Sep 08 '20 at 10:03
Bs4 doesn't run js so you could use something like selenium to pass it. – Arundeep Chohan Sep 08 '20 at 10:17
https://stackoverflow.com/questions/36129963/use-beautifulsoup-to-obtain-view-element-code-instead-of-view-source-code – Arundeep Chohan Sep 08 '20 at 10:17

Requests making changes while downloading the HTML

2 Answers2