2

I want to scrape data in a french website with newspaper3k and the result will be only 50 articles. This website has much more than 50 articles. Where am I wrong ?

My goal is to scrape all the articles in this website.

I tried this:

import newspaper

legorafi_paper = newspaper.build('http://www.legorafi.fr/', memoize_articles=False)

# Empty list to put all urls
papers = []

for article in legorafi_paper.articles:
    papers.append(article.url)

print(legorafi_paper.size())

The result of this print is 50 articles.

I don't understand why newspaper3k will only scrape 50 articles and not much more.

UPDATE OF WHAT I TRIED:

def Foo(firstTime = []):
    if firstTime == []:
        WebDriverWait(driver, 30).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"div#appconsent>iframe")))
        firstTime.append('Not Empty')
    else:
        print('Cookies already accepted')


%%time


categories = ['societe', 'politique']


import time
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains

import newspaper
import requests
from newspaper.utils import BeautifulSoup
from newspaper import Article

categories = ['people', 'sports']
papers = []


driver = webdriver.Chrome(executable_path="/Users/name/Downloads/chromedriver 4")
driver.get('http://www.legorafi.fr/')


for category in categories:
    url = 'http://www.legorafi.fr/category/' + category
    #WebDriverWait(self.driver, 10)
    driver.get(url)
    Foo()
    WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.button--filled>span.baseText"))).click()

    pagesToGet = 2

    title = []
    content = []
    for page in range(1, pagesToGet+1):
        print('Processing page :', page)
        #url = 'http://www.legorafi.fr/category/france/politique/page/'+str(page)
        print(driver.current_url)
        #print(url)

        time.sleep(3)

        raw_html = requests.get(url)
        soup = BeautifulSoup(raw_html.text, 'html.parser')
        for articles_tags in soup.findAll('div', {'class': 'articles'}):
            for article_href in articles_tags.find_all('a', href=True):
                if not str(article_href['href']).endswith('#commentaires'):
                    urls_set.add(article_href['href'])
                    papers.append(article_href['href'])


        for url in papers:
            article = Article(url)
            article.download()
            article.parse()
            if article.title not in title:
                title.append(article.title)
            if article.text not in content:
                content.append(article.text)
            #print(article.title,article.text)

        time.sleep(5)
        driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
        driver.find_element_by_xpath("//a[contains(text(),'Suivant')]").click()
        time.sleep(10)
LJRB
  • 199
  • 2
  • 11
  • 1
    `Stackoverflow` has special method (and key shortcut) to format multiline code. – furas Sep 07 '20 at 20:53
  • 2
    Websites may block you after some count, you may be able get in touch with their webadmin who could provide a better collection than you can scrape, perhaps for free if you're doing some science or learning they could benefit from! – ti7 Sep 07 '20 at 20:55
  • Thank you, @ti7 do you know if I can bypass it with python code ? – LJRB Sep 13 '20 at 11:14
  • @LJRB you could use a [proxy or proxies](https://en.wikipedia.org/wiki/Proxy_server), which will allow your program to act as many independent clients rather than a single one. However, asking for better access to the data directly (perhaps as simple as an account which does not have the 50 pages restriction) and citing them in the work you are producing may be all they would ask of you to receive much higher quality access (they are aware that anyone can write a program to read their website, and if you have a website, you will see many bots actively are reading yours). – ti7 Sep 13 '20 at 22:24

1 Answers1

4

UPDATE 09-21-2020

I rechecked your code and it is working correctly, because it is extracting all the articles on the main page of Le Gorafi. The articles on this page are highlights from the category pages, such as societe, sports, etc.

The example below is from the main page's source code. Each of these articles is also listed on the category page sports.

<div class="cat sports">
    <a href="http://www.legorafi.fr/category/sports/">
       <h4>Sports</h4>
          <ul>
              <li>
                 <a href="http://www.legorafi.fr/2020/07/24/chaque-annee-25-des-lutteurs-doivent-etre-operes-pour-defaire-les-noeuds-avec-leur-bras/" title="Voir l'article 'Chaque année, 25% des lutteurs doivent être opérés pour défaire les nœuds avec leur bras'">
                  Chaque année, 25% des lutteurs doivent être opérés pour défaire les nœuds avec leur bras</a>
              </li>
               <li>
                <a href="http://www.legorafi.fr/2020/07/09/frank-mccourt-lom-nest-pas-a-vendre-sauf-contre-beaucoup-dargent/" title="Voir l'article 'Frank McCourt « L'OM n'est pas à vendre sauf contre beaucoup d'argent »'">
                  Frank McCourt « L'OM n'est pas à vendre sauf contre beaucoup d'argent </a>
              </li>
              <li>
                <a href="http://www.legorafi.fr/2020/06/10/euphorique-un-parieur-appelle-son-fils-betclic/" title="Voir l'article 'Euphorique, un parieur appelle son fils Betclic'">
                  Euphorique, un parieur appelle son fils Betclic                 </a>
              </li>
           </ul>
               <img src="http://www.legorafi.fr/wp-content/uploads/2015/08/rubrique_sport1-300x165.jpg"></a>
        </div>
              </div>

It seems that there are 35 unique article entries on the main page.

import newspaper

legorafi_paper = newspaper.build('http://www.legorafi.fr', memoize_articles=False)

papers = []
urls_set = set()
for article in legorafi_paper.articles:
   # check to see if the article url is not within the urls_set
   if article.url not in urls_set:
     # add the unique article url to the set
     urls_set.add(article.url)
     # remove all links for article commentaires
     if not str(article.url).endswith('#commentaires'):
        papers.append(article.url)

 print(len(papers)) 
 # output
 35

If I change the URL in the code above to this: http://www.legorafi.fr/category/sports, it returns the same number of articles as http://www.legorafi.fr. After looking at the source code for Newspaper on GitHub, it seems that the module is using urlparse, which seems to be using the netloc segment of urlparse. The netloc is www.legorafi.fr. I noted that this is a known problem with Newspaper based on this open issue.

To obtain all the articles it becomes more complex, because you have to use some additional modules, including requests and BeautifulSoup. The latter can be called from Newspaper. The code below can be refined to obtain all the articles within the source code on the main page and category pages using requests and BeautifulSoup.

import newspaper
import requests
from newspaper.utils import BeautifulSoup

papers = []
urls_set = set()

legorafi_paper = newspaper.build('http://www.legorafi.fr', 
fetch_images=False, memoize_articles=False)
for article in legorafi_paper.articles:
   if article.url not in urls_set:
     urls_set.add(article.url)
     if not str(article.url).endswith('#commentaires'):
       papers.append(article.url)

 
legorafi_urls = {'monde-libre': 'http://www.legorafi.fr/category/monde-libre',
             'politique': 'http://www.legorafi.fr/category/france/politique',
             'societe': 'http://www.legorafi.fr/category/france/societe',
             'economie': 'http://www.legorafi.fr/category/france/economie',
             'culture': 'http://www.legorafi.fr/category/culture',
             'people': 'http://www.legorafi.fr/category/people',
             'sports': 'http://www.legorafi.fr/category/sports',
             'hi-tech': 'http://www.legorafi.fr/category/hi-tech',
             'sciences': 'http://www.legorafi.fr/category/sciences',
             'ledito': 'http://www.legorafi.fr/category/ledito/'
             }


for category, url in legorafi_urls.items():
   raw_html = requests.get(url)
   soup = BeautifulSoup(raw_html.text, 'html.parser')
   for articles_tags in soup.findAll('div', {'class': 'articles'}):
      for article_href in articles_tags.find_all('a', href=True):
         if not str(article_href['href']).endswith('#commentaires'):
           urls_set.add(article_href['href'])
           papers.append(article_href['href'])

   print(len(papers))
   # output
   155

If you need to obtain the articles listed in the subpages of a category page (politique currently has 120 subpages) then you would have to use something like Selenium to click the links.

Hopefully, this code helps you get closer to achieving your objective.

Life is complex
  • 15,374
  • 5
  • 29
  • 58
  • Thank you for your help @Life is complex , I tried your code but I have the same 53 urls for each category. What can I do ? – LJRB Sep 20 '20 at 20:20
  • Hum...LMK look into this. – Life is complex Sep 20 '20 at 20:45
  • @LJRB please provide some details in your question on what articles you would like to scrape from the website http://www.legorafi.fr or its subpages. Once you do that I can troubleshoot my answer. – Life is complex Sep 21 '20 at 02:56
  • Hello, I would like to scrape all the articles on all categories and on all the pages if it is possible. – LJRB Sep 21 '20 at 07:11
  • Thank you @Life is complex it is very clear and precise. Thank your for your answer. – LJRB Sep 22 '20 at 18:48
  • @LJRB You're welcome. I'm glad that I could help you solve this question. – Life is complex Sep 22 '20 at 18:55
  • I will try to add selenium – LJRB Sep 22 '20 at 18:57
  • @LJRB You can weave selenium and newspaper together, which will give you a more comprehensive solution. – Life is complex Sep 23 '20 at 01:46
  • Thank you for your advices, I'm going to try it. – LJRB Sep 23 '20 at 07:53
  • Hello @Life is complex. If I want to obtain the articles listed in the subpages of a category page I have to add selenium to click on the other pages and apply the same code you advise me above ? – LJRB Sep 26 '20 at 11:28
  • @LJRB You will apply only a part of the code for each page that you navigate too with selenium. I just tested this and it will be the second part of the code that uses BeautifulSoup. – Life is complex Sep 26 '20 at 12:55
  • Can you help me @Life is complex ? I don't know where to put the selenium part. I know that I have to click to go to next page but I don't know how to code this and If I have to do other things with selenium.Do I have to do like a for loop ? – LJRB Sep 26 '20 at 13:13
  • @LJRB Yes, it will be a for loop that will query each page. – Life is complex Sep 26 '20 at 17:06
  • So for each page I have to repeat the code after the comment "Selenium code ? and next repeat the next lines ?" ? I edited the post above – LJRB Sep 26 '20 at 17:32
  • 1
    @LJRB look at these questions to see if they are useful - https://stackoverflow.com/search?q=Selenium++newspaper – Life is complex Sep 26 '20 at 17:47
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/222126/discussion-between-ljrb-and-life-is-complex). – LJRB Sep 26 '20 at 18:11