How to scrape web news and combine paragraphs into each article

Question

I´m scraping the new articles from this site https://nypost.com/search/China+COVID-19/page/2/?orderby=relevance I used for-loop to get the content of each news article, but I couldn´t able to combine paragraphs for each article. My goal is to store each article in a string, and all the strings should be stored in myarticle list.

When I print(myarticle[0]), it gives me all the articles. I expect it should give me one single article.

Any helps would be appreciated!

            for pagelink in pagelinks:
                #get page text
                page = requests.get(pagelink)
                #parse with BeautifulSoup
                soup = bs(page.text, 'lxml')
                containerr = soup.find("div", class_=['entry-content', 'entry-content-read-more'])
                articletext = containerr.find_all('p')
                for paragraph in articletext:
                    #get the text only
                    text = paragraph.get_text()
                    paragraphtext.append(text)
                    
                #combine all paragraphs into an article
                thearticle.append(paragraphtext)
            # join paragraphs to re-create the article 
            myarticle = [''.join(article) for article in thearticle]
    
    print(myarticle[0])

For clarification purpose, the full code is attached below

def scrape(url):
    user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
    request = 0
    urls = [f"{url}{x}" for x in range(1,2)]
    params = {
       "orderby": "relevance",
    }
    pagelinks = []
    title = []
    thearticle = []
    paragraphtext = []
    for page in urls:
        response = requests.get(url=page,
                                headers=user_agent,
                                params=params) 
        # controlling the crawl-rate
        start_time = time() 
        #pause the loop
        sleep(randint(8,15))
        #monitor the requests
        request += 1
        elapsed_time = time() - start_time
        print('Request:{}; Frequency: {} request/s'.format(request, request/elapsed_time))
        clear_output(wait = True)

        #throw a warning for non-200 status codes
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(request, response.status_code))

        #Break the loop if the number of requests is greater than expected
        if request > 72:
            warn('Number of request was greater than expected.')
            break


        #parse the content
        soup_page = bs(response.text, 'lxml') 
        #select all the articles for a single page
        containers = soup_page.findAll("li", {'class': 'article'})
        
        #scrape the links of the articles
        for i in containers:
            url = i.find('a')
            pagelinks.append(url.get('href'))
        #scrape the titles of the articles
        for i in containers:
            atitle = i.find(class_ = 'entry-heading').find('a')
            thetitle = atitle.get_text()
            title.append(thetitle)
            for pagelink in pagelinks:
                #get page text
                page = requests.get(pagelink)
                #parse with BeautifulSoup
                soup = bs(page.text, 'lxml')
                containerr = soup.find("div", class_=['entry-content', 'entry-content-read-more'])
                articletext = containerr.find_all('p')
                for paragraph in articletext:
                    #get the text only
                    text = paragraph.get_text()
                    paragraphtext.append(text)
                    
                #combine all paragraphs into an article
                thearticle.append(paragraphtext)
            # join paragraphs to re-create the article 
            myarticle = [''.join(article) for article in thearticle]
    
    print(myarticle[0])

print(scrape('https://nypost.com/search/China+COVID-19/page/'))

I had a error message ´max retries exceeded´ ; clear_output and warn. are for solving that issue. Later on I will scrape much more pages, clear_output and warn could help. But clear_output and warn are not relevant to this question, you can ignore them :) — Yue Peng, May 19 '20 at 10:53

Niels van Reijmersdal · Accepted Answer · 2020-05-22T13:31:40.660

3

You keep appending to an existing list [], it keeps growing, you need to clear it every loop.

    articletext = containerr.find_all('p')
    for paragraph in articletext:
        #get the text only
        text = paragraph.get_text()
        paragraphtext.append(text)

    #combine all paragraphs into an article
    thearticle.append(paragraphtext)
# join paragraphs to re-create the article 
myarticle = [''.join(article) for article in thearticle]

Should be something like

    articletext = containerr.find_all('p')
    thearticle = [] # clear from the previous loop
    paragraphtext = [] # clear from the previous loop
    for paragraph in articletext:
        #get the text only
        text = paragraph.get_text()
        paragraphtext.append(text)

    thearticle.append(paragraphtext)
    myarticle.append(thearticle)

But you could simplify it more to:

article = soup.find("div", class_=['entry-content', 'entry-content-read-more'])
myarticle.append(article.get_text())

edited May 22 '20 at 13:31

answered May 19 '20 at 11:12

Niels van Reijmersdal

2,038
1
20
36

Hi Niels, do you have any clue why the length of **myarticle** is 100 if I use your approach? It should be 10. – Yue Peng May 20 '20 at 10:04
Yes, you for-loop containers (probably 10) and then for-loop page-links (probably also 10), meaning it will 10 times walk through the 10 page-links. Research step-by-step debugging and learn how to analyse what your code does. – Niels van Reijmersdal May 20 '20 at 10:24
The simplified code works for me. But the longer version seems not working. I need to use the longer version because in another website I'm scraping 'entry-content' contains irrelevant elements, and I need to find 'p' under 'entry-content'. I´ve tried to debug it but I´m not able to do so. Could you please check which step is wrong in the code (under Should be something like)? If I use that code, it returns [None] [None, None] [None, None, None] – Yue Peng May 21 '20 at 22:06
The longer version was just to show that you need to clear the lists if you want to reuse them in loops. Why you get None is probably because any element P it finds has no text. That is hard to say withouth the "another" website. Again I would like to urge you learn to step-by-step debug, so you can see the site and the code, the values of variables next to each other as the code runs. – Niels van Reijmersdal May 21 '20 at 22:13
Seems append() does not return its value, updated the longer code. – Niels van Reijmersdal May 22 '20 at 13:31
Thanks for updating. It works for me! Sure I will definitely learn step-by-step debug. :) – Yue Peng May 22 '20 at 14:34

How to scrape web news and combine paragraphs into each article

1 Answers1

Linked