0

I would like to scrape titles, abstracts, claims, and inventor names from google patents and add this to an existing csv file. Could you please help me in this? A sample of my code is as follows:

# Create empty lists to store extracted information
claim_list = []

# Define a function to extract application number and claims from a URL and add them to the lists
def add_info_to_lists(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Extract claims
    claims = [claim.get_text(strip=True) for claim in soup.select("li.claim, li.claim-dependent")]
    if claims:
        claim_text = " ".join(claims)
        claim_list.append(claim_text)
    else:
        claim_list.append("N/A")

A similar snippet seems to work with strings (e.g. application numbers), but it does not with other json elements.

Thank you in advance!

antopol
  • 1
  • 1
  • What are those other json elements specifically? If they are captured in `soup`, it must be a simple matter of parsing soup to get those elements as everything is basically text. What is the url? – Saeed Aug 17 '23 at 15:16
  • I am thinking about the title, inventor names, claims and abstracts. The url is https://patents.google.com/?q=(artificial+intelligence)&oq=artificial+intelligence - I downloaded the list as a csv file. – antopol Aug 17 '23 at 21:27

1 Answers1

0

I wasn't able to figure out how to parse the response object from the requests library. But this uses selenium and launching a chrome driver. You will need to do this for each page.

from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
import pandas as pd

options = Options()

url = 'https://patents.google.com/?q=(artificial+intelligence)&oq=artificial+intelligence'
driver = webdriver.Chrome(executable_path=ChromeDriverManager(log_level=0).install(), options=options)
driver.get(url)
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'html.parser')

items_first_line = [x.text.replace('\n', ' ').split('   ') for x in soup.find_all('h4', attrs = {'class': "metadata style-scope search-result-item"})]

locations = [x[0] for x in items_first_line]
patent_numbers = [x[1] for x in items_first_line]
patent_holders = [x[2] for x in items_first_line]
companies = [x[3] for x in items_first_line]

dates = [x.text.replace('\n', ' ').split('   ') for x in soup.find_all('h4', attrs = {'class': "dates style-scope search-result-item"})]

pd.DataFrame( {'locations': locations, 'patent_numbers':patent_numbers, 'patent_holders':patent_holders, 'companies':companies, 'dates' : dates})

Output:

enter image description here

Also, since you are on the search results page, you can't get the entire abstracts. If you want all the info about the patents including the full abstracts, you probably want to navigate to each patent's page and scrape the data there rather than from the search results page. All the hrefs are on the search results page so the job of going to each is easy.

Saeed
  • 1,848
  • 1
  • 18
  • 26
  • Thank you for this, it works well. I am only wondering - I downloaded the csv file from the web search and tried to scrape data from the urls in that file. However, when I try to scrape titles, claims, abstract ecc I get lots of NAs. While it works for application numbers, it does not work for other elements. Any clue on why it is happening? – antopol Aug 18 '23 at 06:32