1

I'm very new to web scraping (I know next to nothing about html and this is my first time using BeautifulSoup) and i'm making a program that essentially lets me generate PDFs or epubs for novels online. I'm not worried about compatibility with a wide variety of sites, since I'm just making this for me. I made the code that gets the links for all the chapters of the webnovel from any link for that specific chapter and put's them all into a list, however this takes a long time. Somewhere around a second for each link. Given that some novels are literally upwards of 1-2 thousand chapters, that's like half an hour just to get all the links, and the program hasn't even gotten the body text of each links and compiled them into PDFs, is there a way I can make this code faster?

import requests
from bs4 import BeautifulSoup
def list_chapters():
    given_chapter = 'https://www.box-novel.com/novel/cannon-fodder-counterattack-system/chapter-4-1/'
    current_chapter = find_first_chapter(given_chapter)
    print("Starting chapter: ", current_chapter)
    link_list = []
    try:
        while True:
            link_list.append(current_chapter)
            r = requests.get(current_chapter)
            soup = BeautifulSoup(r.content, 'html.parser')
            s = soup.find('div', class_='nav-next')
            for link in s.find_all('a'):
                current_chapter = link.get('href')
    except AttributeError:
        link_list.pop(-1)
        print(len(link_list), "chapters detected.")

Please let me know other ways to improve my code as well. note: I pop the last value in the link because it's easier than detecting when the nav-next value is for manga-info which what is referenced in nav-next on the last chapter, also ignore the random trash novel link I used, it's the shortest one I could find on the first page.

2 Answers2

0

If one request at the time is too long, we should fire multiple of them at the same time!

How? Well, there are multiple options, but I'd stick to aiohttp library, which does what requests does, but asynchronously.

Here's some example of using it which I totally stole from another question:

import asyncio
import aiohttp
import time

websites = """https://www.youtube.com
http://www.chrome.com
http://www.booking.com
http://www.googleusercontent.com
http://www.google.com.au
http://www.popads.net
http://www.cntv.cn"""


async def get(url, session):
    try:
        async with session.get(url=url) as response:
            resp = await response.read()
            print("Successfully got url {} with resp of length {}.".format(url, len(resp)))
    except Exception as e:
        print("Unable to get url {} due to {}.".format(url, e.__class__))


async def main(urls):
    async with aiohttp.ClientSession() as session:
        ret = await asyncio.gather(*[get(url, session) for url in urls])
    print("Finalized all. Return is a list of len {} outputs.".format(len(ret)))


urls = websites.split("\n")
start = time.time()
asyncio.run(main(urls))
end = time.time()

print("Took {} seconds to pull {} websites.".format(end - start, len(urls)))
Daniel
  • 202
  • 1
  • 3
  • Wow, thanks for the fast response. I was already planning to use multithreading, but this might be better. Just have to get over being too lazy to implement async. Any suggestions on how I can split up the processes while not losing the order of the list? Especially since I don't actually know what the links are or how many there are? – Renni Stewart Aug 13 '22 at 02:44
  • Each task can take a different amount of time, so it's obvious that the order of the requests itself can be messy, but `asyncio.gather` returns the result of all tasks in original order, because it waits for them all to complete. Of course, every task passed to `asyncio.gather` must return something. **BUT** this won't work neither with async, nor with multithreading, nor with multiprocessing because you have to load a fully to get a link to other page, asynchronous/parallel approach simply won't work here. – Daniel Aug 13 '22 at 03:03
  • What I suggest is to get all the Manga chapters (https://www.box-novel.com/novel/cannon-fodder-counterattack-system) and then process them in any order, because it won't matter. But the thing is the site doesn't show all the chapters at once so that you can get all the links at once, it uses lazy loading, which is not logged anywhere in the requests when expanded, and it can't be called without JS support, so requests won't work. I'd use Selenium to open the manga page, expand the list of chapters via JavaScript and pull out all the links, and then pass them to aiohttp. – Daniel Aug 13 '22 at 03:08
  • I'm able to get all the chapters using a method someone else posted, do you have any recommendations for how I can order them? I'm able to get the chapter number using webscraping, which is enough for most novels, but some go by books, like book 3 chapter 20, and some have parts to each chapter like the example used above, which destroys my ability to just get an int literal and sort a dataframe – Renni Stewart Aug 20 '22 at 09:52
0

Your task is non-trivial. First, the links to all chapters are loaded via an ajax POST request in that entry-point page. After you sort that out, you need a robust async solution, and I mean something which can handle a 1BN links list, and can be executed on a Raspberry pi (so you need some concept of a queue). The following will take approx 10 seconds, and will return a dataframe with title and content for each of those 90 chapters from the novel (which you can then sort by title, if you want):

import asyncio
from httpx import Client, AsyncClient, Limits
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime


pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

## run this is you're executing the code in a notebook
import nest_asyncio
nest_asyncio.apply()

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
#### setup some sort of mock persistence ###
big_df_list = []

#### async scrape funcs ####
def all_chapters_urls():
    url_list = []
    payload = {
        'action': 'manga_get_reading_nav',
        'manga': '1987979',
        'chapter': 'chapter-29-7',
        'volume_id': '0',
        'type': 'content'
              }
    with Client(headers=headers, timeout=60.0, follow_redirects=True) as client:
        r = client.post('https://www.box-novel.com/wp-admin/admin-ajax.php', data = payload)
        soup = BeautifulSoup(r.text, 'html.parser')
        links = soup.select_one('select.c-selectpicker.selectpicker_chapter.selectpicker.single-chapter-select').select('option')
        for l in links:
            url_list.append(l.get('data-redirect'))
    return url_list
            
async def get_chapters(url):
    async with AsyncClient(headers=headers, timeout=60.0, follow_redirects=True) as client:
        try:
            r = await client.get(url)
            soup = BeautifulSoup(r.text, 'html.parser')
            title = soup.select_one('h1#chapter-heading').get_text(strip=True)
            text_content = soup.select_one('div.text-left').get_text(strip=True)
            big_df_list.append((title, text_content))
        except Exception as e:
            print(url, e)

async def scrape_chapters():
    start_time = datetime.now()
    tasks = asyncio.Queue()
    for x in all_chapters_urls():
        tasks.put_nowait(get_chapters(x))

    async def worker():
        while not tasks.empty():
            await tasks.get_nowait()
            
    await asyncio.gather(*[worker() for _ in range(20)])
    end_time = datetime.now()
    duration = end_time - start_time
    print('chapters scraping took', duration)

asyncio.run(scrape_chapters())
df = pd.DataFrame(big_df_list, columns = ['Chapter', 'Content'])
print(df)

This will return in terminal:

chapters scraping took 0:00:10.991827
Chapter Content
0   Cannon Fodder Counterattack System - Chapter 30.1   The power of gossip was never been underestimated. Huang Dezheng’s reputation for kind and charismatic was far-reaching. His neighbours recognized him. The original impression of him was quite good, but he did not expect that he would be well-known not only in public but also in private. Especially messing about with your own students!Seeing his white and tender student being dragged by him, notice the way he couldn’t even walk properly. Hehe! What a scumbag!The gossipy neighbours recalled the scene they saw through their door’s peepholes and were still amazed. There was no way. At that time, the two of them were getting intimate, there was still energy to pay attention to whether the door was open, wasn’t there?Huang Dezheng did not notice this little detail when he left with Su Yibai in anger. The time he realized this, it was already several days later.The campus forum calmness of the past was swept away with an earthquake. The entire page layout was filled by posts with similar titles! Among them, the top one was the most eye-catching and popular!“During the 18th of August, School grass[1] Su and Teacher Huang’s cohabitation dog blood drama, here are the pictures and truth”Huang Dezheng, who was passing by his colleague’s computer, inadvertently caught a glimpse of this thick red line of words, and his heart jerked. He quietly held his breath as he returned to his office. His face paled as he entered into the forum he had previously scorned. With trembling hands, he opened the very hot post.“It is said that the landlord was shocked when he heard this. He was not familiar with the school, but the teacher Huang’s reputation in the school was very good. How could it be that he did not close the door and even did it with a student? What a scum?! But there are pictures of the truth, so it was not nonsense, the pictures are linked below.”“Fu*k! It turned out to be true!!!”“The soft and cute school grass together with the male god! Look at the hickey on the neck! Fu*k! It’s too intense! Teacher Huang bao dao wei lao[2]!!”“After examining the pictures, it truly hasn’t been photo-shopped… Fu*k! What a scumbag!!”“It should be true… School grass Su never returned to the dormitory and stayed outside, so it turned out…”“To help the landlord add fire, the photos were taken by a friend who went to the nightclub to play”” It turns out that Su Xuedi[3] is like this in private! Look at the half-covered chest, the creamy thighs! No wonder Teacher Huang This white flower has a half-covered chest and a chest, and the trough is still pink!! No wonder Huang teacher doesn’t love Jiangshan beauties!!”“Wow, there’s a reason the number of people who never go to class is so high. With these two pictures, it seems like our Su Xuedi’s eyes are not very good!”“…”Huang Dezheng looked at the increasingly unsightly text and pictures on the computer screen, his whole body was shaking in anger!Who was it?! Who did he offend for him to be framed so viciously?!He immediately left a message asking the moderator to delete the post, but it didn’t take long for the message that didn’t hide his identity to completely detonate the entire forum!Fu*k the person involved actually appeared!!!The forum was boiling with this additional drama and Huang Dezheng got so angry that his liver began to ache. Not only were the posts not deleted, but his message was even re-posted with screenshots!These students were really shameless![1]School grass: most handsome guy in school. For the opposite gender it would be school flower.[2] Bao dao wei loa: Old but still vigorous. I think that explains it.[3] Xuedi: junior or younger male school mate.(Visited 1 times, 1 visits today)
1   Cannon Fodder Counterattack System - Chapter 29.7   Qin Shiyue rushed back to the house without saying a word, he was tempted to blow up, but he was afraid of hurting the stupid rabbit, so he kept suppressing it.Ye Si Nian also did not say a word, and when he got home, he went into the bathroom without saying anything.The more he thought about the more frustrated he was! Qin Shiyue was tense like a trapped beast as he moved about in the study. The desk was already in chaos, and there were scattered documents on the floor.Just as his anger was reaching the apex, the study door was opened, and the stupid rabbit who had just taken a bath with a towel around his body leisurely walked in.His body was covered with a thin layer of tight and well-proportioned muscles. The skin was fair and smooth, the waist, thin but not weak. At first glance, it was full of explosive power.His eyes glided uncontrollably as he observed the man’s movement. Qin Shiyue was frozen in place, his heart almost stopped beating, and a thought flashed in his mind flashed that allowed him to recover his heartbeat whose speed soared to the limit.Ye Si Nian was getting closer and closer, and Qin Shiyue, who only had a theoretical experience, wanted to step forward into his (Ye Si Nian’s)arms, but Qin Shiyue’s brain was blank, and he didn’t know where to start…Intensely attracted to his lover who was stunned, he pressed his naked and exposed skin on the man’s thin shirt and gently rubbed on them.The man’s reaction was very interesting. Ye Si Nian pursed his lips and pushed the man slightly on his shoulder to make him sit down on the large chair.Smiling as Qin Shiyue raised his head to look up at him, Ye Si Nian’s index finger hooked up his chin and he bent to kiss the tense tightly-close thin lip.Effortlessly prying his lover’s lips open, Ye Si Nian invaded his soft tongue constantly wreaking havoc in Qin Shiyue’s mouth. He licked and played with Qin Shiyue’s sensitive mouth before his lover finally reacted.The breathing became more intense, his lover’s strength also increased, Ye Si Nian hummed and pulled away from Qin Shiyue’s mouth and gently licked his lower lip.“I want you, Qin Shiyue.”Looking at his lover’s suddenly large eyes, Ye Si Nian smiled smugly, kissing his earlobe and licking his ears he murmured slowly, “I want you… Qin Shiyue… I want you……”If one could hold back at this time, would he still be a man?!!Qin Shiyue slammed down Ye Si Nian’s thin waist, suppressing his desire. His voice was hoarse with craving, “Stupid rabbit, do you know that you are playing with fire?!”Ye Si Nian raised an eyebrow and replied to the question with action instead.(Visited 1 times, 1 visits today)
2   Cannon Fodder Counterattack System - Chapter 29.8   With his long leg stretched, Ye Si Nian sat on Qin Shiyue’s lap, lowering his head to nibble on his throat, he felt his slight trembling and repressed gasp. He flexibly untied his clothes and put his hands on the well-defined chest.No longer be a man!!Qin Shiyue made a beast-like roar and kissed Ye Si Nian’s fragile neck hard. The hands clinging behind him tore open Ye Si Nian’s towel.=======================The next afternoon Ye Si Nian sat up in bed sourly and examined the various traces all over his body. He was full of regrets.He really underestimated the enemy’s fighting power!The two personalities were frightening! They being virgins who were almost thirty years old was also dreadful! The combination of the two resulted in being tossed from yesterday afternoon to this morning was scary!!!When Qin Shiyue and Pei Yiyuan took turns in battle, who said that having a double personality was amazing? !!Complaining in his heart, Ye Si Nian saw the door being pushed open, and Pei Yiyuan came in with a gentle smile like a spring breeze.“Woken up? Are there any uncomfortable place in your body?” Pei Yiyuan went near the bed and knelt on one knee as he reached out and placed Ye Si Nian into his arms.“No.” Ye Si Nian gave a serious thought about it. He felt that the communication last night was really hearty and he enjoyed himself. It was normal for the muscles to be sore, and it was obvious that he was clean and dry now, so he decided to praise instead, “I felt very good last night!”“It will get better in the future!” The performance of the first time last night was affirmed. Pei YiYuan felt a little proud in his heart. He bowed to kiss Ye Si Nian’s lips. “Yes, Qin Shiyue wanted me to ask how you intend to deal with those two?”Speaking about the incident, the second personality was embarrassed to come out himself to ask. Ye Si Nian’s lips twitched and said: “I decided to sell the apartment.”“That’s it?” Pei Yiyuan raised his eyebrows, he also had no good feelings for the two people.“Don’t underestimate the power of gossip…” Ye Si Nian shook his head with a smile and said, “Otherwise, you just wait and see! Without me, they are well able to kill themselves!”“Then I’ll wait and see.” Pei Yiyuan’s arm wrapped around him as he lifted Ye Si Nian up to carry to the bathroom. He did not care and decided to change to a more important topic, “I just went out for a walk and bought your favourite. Porridge…”(Visited 1 times, 1 visits today)
[...]
Barry the Platipus
  • 9,594
  • 2
  • 6
  • 30
  • Thank you so much for the detailed response, however I can't get this to work. It says that headers is an unresolved reference. It might be worth mentioning that I'm using Linux (specifically kubuntu 22.04) – Renni Stewart Aug 14 '22 at 01:05
  • did you define the headers as above? – Barry the Platipus Aug 14 '22 at 01:20
  • I just have it set as header=header in the parameters like you, I'm not exactly sure how or when they're used, but python recognizes headers as a call from a non-built-in attribute called wsgiref.headers, but I don't think that's what you intended with the code above. – Renni Stewart Aug 14 '22 at 03:54
  • I uhh... I left out the instantiation... I thought it was some built in attribute lol. This is embarrasing. – Renni Stewart Aug 15 '22 at 00:36
  • sorry to ask this, but how can I sort this? Some of the chapters are not intuitively named, like in the example used there are parts like 4-6 and some novels use x.y or have side chapters at the end, so though I'm able to get the chapter, they can't always be int literal and I don't know how to sort dataframes any other way. Is there a way to get the level of the chapter rather than what is stated as the chapter at a base level? I don't really understand that ajax thing you did. – Renni Stewart Aug 18 '22 at 16:18