1

I'm trying to scrape from:

https://www.washingtonpost.com/graphics/politics/trump-claims-database/?noredirect=on&utm_term=.66adf0edf80b

all the dates and texts on the left side enter image description here

I tried so far the following code which only retrieves 17 results and also get some results from the right text.

import requests
from bs4 import BeautifulSoup

r=requests.get('https://www.washingtonpost.com/graphics/politics/trump-claims-database/?noredirect=on&utm_term=.4a34a0231c12')
html=BeautifulSoup(r.content,'html.parser')
results=html.find_all('p','pg-bodyCopy')

My question is:

How can I get a list with all the left text and another list with the date corresponding to the text?

Sample output:

[(Mar 3 2019,After more than two years of Presidential Harassment, the only things that have been proven is that Democrats and other broke the law. The hostile Cohen testimony, given by a liar to reduce his prison time, proved no Collusion! His just written book manuscript showed what he said was a total lie, but Fake Media won't show it. I am an innocent man being persecuted by some very bad, conflicted & corrupt people in a Witch Hunt that is illegal & should never have been allowed to start - And only because I won the Election!)]

EDIT: Just wonder if it possible to retrieve also the source (Twitter,Facebook, etc) as per image

loc=row.find('div',class_='details expanded').text.strip()

Moreno
  • 608
  • 1
  • 9
  • 24

2 Answers2

1

All of the items that you are looking for are not available directly. You can use selenium to click on load more button multiple times to load all the data and the fetch the page source.

Code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path='/home/bitto/chromedriver')
url="https://www.washingtonpost.com/graphics/politics/trump-claims-database/?noredirect=on&utm_term=.777b6a97b73d"#your url here
driver.get(url)
claim_list=[]
date_list=[]
source_list=[]
i=50
while i<=50: #change to 9000 to scrape all the texts
    element=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR ,'button.pg-button')))
    element.click()
    i+=50
#getting the data and printng it out
soup=BeautifulSoup(driver.page_source,'html.parser')
claim_rows=soup.find_all('div',class_='claim-row')
for row in claim_rows:
    date=row.find('div',class_='dateline').text.strip()
    claim=row.find('div',class_='claim').text.replace('"','').strip()
    source=row.find('div',class_='details not-expanded').find_all('p')[1].find('span').text
    claim_list.append(claim)
    date_list.append(date)
    source_list.append(source)

#we will zip it make it easier to view the output
print(list(zip(date_list,claim_list,source_list)))

Output

[('Mar 3 2019', "“Presidential Harassment by 'crazed' Democrats at the highest level in the history of our Country. Likewise, the most vicious and corrupt Mainstream Media that any president has ever had to endure.”", 'Twitter'), ('Mar 3 2019', "“After more than two years of Presidential Harassment, the only things that have been proven is that Democrats and other broke the law. The hostile Cohen testimony, given by a liar to reduce his prison time, proved no Collusion! His just written book manuscript showed what he said was a total lie, but Fake Media won't show it. I am an innocent man being persecuted by some very bad, conflicted & corrupt people in a Witch Hunt that is illegal & should never have been allowed to start - And only because I won the Election!”", 'Twitter'), ('Mar 3 2019', '“The reason I do not want military drills with South Korea is to save hundreds of millions of dollars for the U.S. for which we are not reimbursed. ”', 'Twitter'), ('Mar 3 2019', "“For the Democrats to interview in open hearings a convicted liar & fraudster, at the same time as the very important Nuclear Summit with North Korea, is perhaps a new low in American politics and may have contributed to the 'walk.' Never done when a president is overseas. Shame!”", 'Twitter'), ('Mar 3 2019', '“The most successful first two years for any President. We are WINNING big, the envy of the WORLD.”', 'Twitter'), ('Mar 2 2019', '“Remember you have Nebraska. We won both [Electoral College votes] in Nebraska. We won the half.”', 'Remarks'),...]
Bitto
  • 7,937
  • 1
  • 16
  • 38
  • I'm runing this on Linux Mint 18.3- Firefox 65.0 (64-bit) and when trying to run the code and after changing `driver = webdriver.Chrome()` for `driver = webdriver.Firefox()` I'm getting the error: **Message: 'geckodriver' executable needs to be in PATH** – Moreno Mar 12 '19 at 00:01
  • @Moreno You have to add the path to your geckodriver. See https://stackoverflow.com/questions/40208051/selenium-using-python-geckodriver-executable-needs-to-be-in-path – Bitto Mar 12 '19 at 00:22
  • @Moreno In Ubuntu I temporarily add path like this `export PATH=$PATH:/home/bitto/path/to/gekodriver_folder` – Bitto Mar 12 '19 at 00:23
  • I have tested and after downloading gecko driver it worked perfectly. Just One last question: Is there a way to retrieve also the source (Twitter,Facebook, etc?) I tried with ` loc=row.find('div',class_='details expanded').text.strip()` with no results – Moreno Mar 12 '19 at 19:42
  • @Moreno Yes it is possible. I have edited my answer to include that as well. – Bitto Mar 12 '19 at 19:52
  • Ok, I know we are encouraged not to use this section to say thanks but man you are the very best! Please just last but not least I do not know why no matter what value of i is set, it only returns 100 results. I'm not sure whether it is something with my explorer. Any advice on that? – Moreno Mar 12 '19 at 19:59
  • @Moreno Don't change the value of `i` .Change `i<=50` to `i<=9000`. Note that this may take a very long time ~ 20 mins or so for me. So try for `i<=200` first. – Bitto Mar 12 '19 at 20:01
  • @Moreno Consider accepting this answer if it solved the problem in your question. Thanks! – Bitto Mar 12 '19 at 21:59
1

The data you are looking for is here:

https://www.washingtonpost.com/graphics/politics/trump-claims-database/js/base.js?c=230b1e82e2fc6c49a25a4c6554455c3bf0f527d5-1551707436

It is a JS array named 'claims'. Each entry looks like:

{
  id: "8920",
  date: "Mar 3 2019",
  location: "Twitter",
  claim: "“Presidential Harassment by 'crazed' Democrats at the highest level in the history of our Country. Likewise, the most vicious and corrupt Mainstream Media that any president has ever had to endure.â€",
  analysis: 'The scrutiny of President Trump by the House of Representatives is little different than the probes launched by Republicans of Barack Obama, Democrats of George W. Bush or Republicans of Bill Clinton, just to name of few recent examples. President John Tyler was actually ousted by his party (the Whigs) while Andrew Johnson and Clinton were impeached. As for media coverage, Trump regularly appears to believe it should only be positive. He has offered little evidence the media is "corrupt."',
  pinocchios: null,
  category: "Miscellaneous",
  repeated: null,
  r_id: null,
  full_story_url: null,
  unixDate: "1551589200"
}

Code (I have downloaded the content of the page to my file system - cliams.txt)

I am using demjson in order to make the json string a dict

import demjson
start_str = 'e.exports={claims:'
end_str = 'lastUpdated'
with open('c:\\temp\\claims.txt','r',encoding="utf8") as claims_file:
    dirty_claims = claims_file.read()
    start_str_idx = dirty_claims.find(start_str)
    end_str_idx = dirty_claims.rfind(end_str)
    print('{} {}'.format(start_str_idx,end_str_idx))
    claims_str = dirty_claims[start_str_idx + len(start_str):end_str_idx-1]
    claims = demjson.decode(claims_str)
    for claim in claims:
        print(claim)
balderman
  • 22,927
  • 7
  • 34
  • 52