-1

Is it possible to scrape Google for PDF files? Like, to download all ".pdf" files within a certain number of search results for a given term. Webscraping is pretty new to me, though I've been using beautifulsoup4 if it's possible with that.

Thanks in advance.

Infinitus
  • 155
  • 10
  • 1
    You should probably consider Scrapy to complement BeautifulSoup. If by scaping Google you mean making a query to Google and scraping the returned results, it is not easy as this is against Google's user agreement. After a number of queries, Google will detect this unusual activity and start re-routing your webpage request to a separate page that requires manual user interaction (i.e., those CAPTCHA thing), which makes scraping near impossible. However, if you are willing to pay for a Google App Engine account, you might be able to do this legally. Search "Google app engine web scraping". – lightalchemist Jun 11 '20 at 05:07

2 Answers2

1

Make sure you're using user-agent, because eventually, Google might block request and you'll receive a completely different HTML. Check out what is your user-agent.

Pass user-agent:

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get(URL, headers=headers)

First iterate over all organic results:

for index, result in enumerate(soup.select('.tF2Cxc')):
  # code

# enumerate() was used to provide index values after each iteration 
# that will be handy at the saving stage to use them via f-string e.g: file_0,1,2,3..

Check if PDF is present via if statement:

if result.select_one('.ZGwO7'):
  pdf_file = result.select_one('.yuRUbf a')['href']
  # other code
else: pass

To save .pdf files locally you can use urllib.request.urlretrieve:

urllib.request.urlretrieve(pdf_file, "YOUR_FOLODER(s)/YOUR_PDF_FILE_NAME.pdf")
# if saving in the same folder, remove "YOUR_FOLDER" part

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml, urllib.request

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "best lasagna recipe:pdf"
}

def get_pdfs():
    html = requests.get('https://www.google.com/search', headers=headers, params=params)
    soup = BeautifulSoup(html.text, 'lxml')

    for index, result in enumerate(soup.select('.tF2Cxc')):

      # check if PDF is present via according CSS class
      if result.select_one('.ZGwO7'):
        pdf_file = result.select_one('.yuRUbf a')['href']
        
        opener=urllib.request.build_opener()
        opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
        urllib.request.install_opener(opener)

        # save PDF
        urllib.request.urlretrieve(pdf_file, f"bs4_pdfs/pdf_file_{index}.pdf")

        print(f'Saving PDF №{index}..')
      else: pass

-------
'''
Saving PDF №0..
Saving PDF №1..
Saving PDF №2..
...

8 pdf's saved to the desired folder
'''

Alternatively, you can achieve this by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't need to figure out how to extract certain parts or elements since it's already done for the end-user.

Code to integrate:

from serpapi import GoogleSearch
import os, urllib.request

def get_pdfs():
    params = {
      "api_key": os.getenv("API_KEY"),
      "engine": "google",
      "q": "best lasagna recipe:pdf",
      "hl": "en"
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    for index, result in enumerate(results['organic_results']):
      if '.pdf' in result['link']:
        pdf_file = result['link']

        opener=urllib.request.build_opener()
        opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
        urllib.request.install_opener(opener)

        # save PDF
        urllib.request.urlretrieve(pdf_file, f"serpapi_pdfs/pdf_file_{index}.pdf")

        print(f'Saving PDF №{index}..')
      else: pass

get_pdfs()

-------
'''
Saving PDF №0..
Saving PDF №1..
Saving PDF №2..
...

8 pdf's saved to the desired folder
'''

Also, you can use camelot library to grab data from .pdf files.

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35
0

Here's what I would do.

  1. Google allows you to search by file type by adding filetype:[your file type extension (pdf)].

  2. You can bypass the Google search page by using a direct URL and changing the query: https://www.google.com/search?q=these+are+keywords+filetype%3Apdf

  3. You can use BeautifulSoup to find the URL of each search result (relevant question's answer). The most important part is that each search result has a class "g", so you can get the URL from each element that has that class.

  4. From there, you can use BeautifulSoup to find the direct URL to the PDF. The URL will be in the tag type "a" and will be in the form href. Relevant question's answer

I'm not an expert, but maybe this will be enough to set you on your way. Others may chime in with better methods.

TheKingElessar
  • 1,654
  • 1
  • 10
  • 30