1

I aim to scrape the 100 news texts using BeautifulSoup and for-loop, and store the texts into the list myarticle. I expect myarticle should only contain the content of the news articles, which I find all have h attribute. However, the result I got contain many irrelevant part, such as : "Thanks for contacting us. We've received your submission." and "This story has been shared 205,105 times. 205,105" and so on.

Another issue is, when I print(myarticle[0]), it gives me many news articles, but I expect it should only give me 1 article.

I would like to know how could I remove the irrelevant part and only keep the main content as we read from the news web. And how could I adjust the code so that when I print(myarticle[0]), it gives me the first news article.

One of the 100 news articles is on this page: https://nypost.com/2020/04/21/missouri-sues-china-over-coronavirus-deceit/

Other news articles I want to scrape are on this site: https://nypost.com/search/China+COVID-19/page/1/?orderby=relevance

Below are the lines of code relevant to my question.

            for pagelink in pagelinks:
                #get page text
                page = requests.get(pagelink)
                #parse with BeautifulSoup
                soup = bs(page.text, 'lxml')
                articletext = soup.find_all('p')
                for paragraph in articletext[:-1]:
                    #get the text only
                    text = paragraph.get_text()
                    paragraphtext.append(text)

                #combine all paragraphs into an article
                thearticle.append(paragraphtext)
    # join paragraphs to re-create the article            
    myarticle = [''.join(article) for article in thearticle]
    #show the first string of the list
    print(myarticle[0])
Yue Peng
  • 101
  • 6

1 Answers1

1
soup.find_all('p')

Here you find all P tag elements in the webpage. P is very common tag used for most text, that is why you find non article text.

I would first find the containing div for just the article and then get the text, something like:

container = soup.find("div", class_=['entry-content', 'entry-content-read-more'])
articletext = container.find_all('p')
Niels van Reijmersdal
  • 2,038
  • 1
  • 20
  • 36
  • I tried your approach, but seems it doesn´t work. I used: `containerr = soup.find_all("div", {"class": "entry-content entry-content-read-more"}) articletext = containerr.find_all('p')` I updated the question description by adding the website, it would be nice if you could further clarify your answer based on that specific website. Thank you! – Yue Peng May 19 '20 at 09:45
  • That is not how you search for multiple classes, read https://stackoverflow.com/questions/18725760/beautifulsoup-findall-given-multiple-classes Updated my example code. – Niels van Reijmersdal May 19 '20 at 09:57
  • container = soup.find() works, intead of container = soup.find_all() Thanks for the information about multiple classes! The text has two classes, which are 'entry-content', 'entry-content-read-more', am I right? – Yue Peng May 19 '20 at 10:22
  • Yes, when ever a class has a space " " between the words it means it has multiple classes, you could search for both or just one of them. Prolly my mistake find() is probably good here. I dont have Beautifullsoup installed, so I am doing this from the documentation. – Niels van Reijmersdal May 19 '20 at 10:28
  • Thanks a lot! I will mark your answer and upvote it. Could you help me solve another relevant question? I posted it here: https://stackoverflow.com/questions/61888917/how-to-scrape-web-news-and-combine-paragraphs-into-each-article – Yue Peng May 19 '20 at 10:36