Make sure you're using user-agent
, because eventually, Google might block request and you'll receive a completely different HTML. Check out what is your user-agent
.
Pass user-agent
:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get(URL, headers=headers)
First iterate over all organic results:
for index, result in enumerate(soup.select('.tF2Cxc')):
# code
# enumerate() was used to provide index values after each iteration
# that will be handy at the saving stage to use them via f-string e.g: file_0,1,2,3..
Check if PDF
is present via if
statement:
if result.select_one('.ZGwO7'):
pdf_file = result.select_one('.yuRUbf a')['href']
# other code
else: pass
To save .pdf
files locally you can use urllib.request.urlretrieve
:
urllib.request.urlretrieve(pdf_file, "YOUR_FOLODER(s)/YOUR_PDF_FILE_NAME.pdf")
# if saving in the same folder, remove "YOUR_FOLDER" part
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml, urllib.request
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "best lasagna recipe:pdf"
}
def get_pdfs():
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for index, result in enumerate(soup.select('.tF2Cxc')):
# check if PDF is present via according CSS class
if result.select_one('.ZGwO7'):
pdf_file = result.select_one('.yuRUbf a')['href']
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
# save PDF
urllib.request.urlretrieve(pdf_file, f"bs4_pdfs/pdf_file_{index}.pdf")
print(f'Saving PDF №{index}..')
else: pass
-------
'''
Saving PDF №0..
Saving PDF №1..
Saving PDF №2..
...
8 pdf's saved to the desired folder
'''
Alternatively, you can achieve this by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't need to figure out how to extract certain parts or elements since it's already done for the end-user.
Code to integrate:
from serpapi import GoogleSearch
import os, urllib.request
def get_pdfs():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "best lasagna recipe:pdf",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
for index, result in enumerate(results['organic_results']):
if '.pdf' in result['link']:
pdf_file = result['link']
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
# save PDF
urllib.request.urlretrieve(pdf_file, f"serpapi_pdfs/pdf_file_{index}.pdf")
print(f'Saving PDF №{index}..')
else: pass
get_pdfs()
-------
'''
Saving PDF №0..
Saving PDF №1..
Saving PDF №2..
...
8 pdf's saved to the desired folder
'''
Also, you can use camelot
library to grab data from .pdf
files.
Disclaimer, I work for SerpApi.