1

I want to scrape links to patents from a Google Patents Search using BeautifulSoup, but I'm not sure if Google converts their html into javascript, which cannot be parsed through BeautifulSoup, or what the issue is.

Here is some simple code:

url = 'https://patents.google.com/?assignee=Roche&after=priority:20110602&type=PATENT&num=100'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

links = []
for link in soup.find_all('a', href=True):
    print(link['href'])

I also wanted to append the links into the list, but nothing is printed because there are no 'a' tags from the soup. Is there any way to grab the links to all of the patents?

Bernie Zhu
  • 33
  • 4

1 Answers1

2

Data is dynamically render so its hard to get from bs4 so what you can try go to chrome developer mode.

Then go to Network tab you can now find xhr tab reload your web page so there will be links under Name tab from that one link is containing all data as json format

so you can copy the address of that link and you can use requests module make call and now you can extract what so ever data you want

also if you want individual link so it is made of publication_number and you can join it with old link to get links of publications.

import requests
main_url="https://patents.google.com/"
params="?assignee=Roche&after=priority:20110602&type=PATENT&num=100"

res=requests.get("https://patents.google.com/xhr/query?url=assignee%3DRoche%26after%3Dpriority%3A20110602%26type%3DPATENT%26num%3D100&exp=")
main_data=res.json()
data=main_data['results']['cluster']

for i in range(len(data[0]['result'])): 
    num=data[0]['result'][i]['patent']['publication_number']
    print(num)
    print(main_url+"patent/"+num+"/en"+params)

Output:

US10287352B2
https://patents.google.com/patent/US10287352B2/en?assignee=Roche&after=priority:20110602&type=PATENT&num=100
US10364292B2
https://patents.google.com/patent/US10364292B2/en?assignee=Roche&after=priority:20110602&type=PATENT&num=100
US10494633B2
.....

Image: enter image description here

Bhavya Parikh
  • 3,304
  • 2
  • 9
  • 19
  • Some time has passed and I don't expect you to reply. But why isn't google blocking this? Why is it working when on "normal" google search any automation gets instantly blocked? – purple_lolakos Jun 19 '23 at 20:58