Cannot scrape google patent URL through python and Beautiful Soup

Question

I am currently trying to scrape a link to Google Patents on this page, https://datatool.patentsview.org/#detail/patent/10745438, but when I am trying to print out all of the links with an 'a' tag, only an unrelated website comes up.

Here is my code so far:

url = 'https://datatool.patentsview.org/#detail/patent/10745438'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

links = []
print(soup)
for link in soup.find_all('a', href=True):
    print(link['href'])

When I print out the soup, the 'a' tag with the link to the google patents isn't printed, nor is the link in the array. The only thing printed is

http://uspto.gov/
tel:1-800-786-9199
./#viz/relationships
./#viz/locations
./#viz/comparisons

, which is all unnecessary information. Is google protecting their links in some way, or is there any other way I can retrieve the link to the google patent or redirect to the page?

It looks like the links are being protected. If you select `inspect-element` (on the webpage), you'll notice that the links come inside `div class="overlay"`, which doesn't appear in the parsed soup. — sotmot, Jun 02 '21 at 18:48
Is there anyway to access the link even though it's protected? — Zaid Barkat, Jun 02 '21 at 18:52
I see you're trying to print out all the a hrefs there are, but is there a specific link are area of links you're truly trying to capture within that page (an I'm assuming similar pages)? — pedwards, Jun 02 '21 at 19:08
@pedwards Yes, I am trying to get the link that redirects to the google patent, specifically, https://www.google.com/patents/US10745438, in this case. The text states "go to google patent" as the hyperlink. — Zaid Barkat, Jun 02 '21 at 19:23
I was trying to install a package that looked very interesting to me that would perform the function you're trying for. For some reason I was having troubles installing it. But take a look at the requests_html python package. I think it would do exactly what you want. https://pypi.org/project/requests-html/ — pedwards, Jun 02 '21 at 19:54
Another option I was coming across was a package named Selenium, which was recommended by a couple others. Both these packages allow the pages to fully load prior to scraping. — pedwards, Jun 02 '21 at 19:55

score 1 · Answer 1 · answered Jun 02 '21 at 20:45

1

Don't scrape it, just do some link hacking:

url = 'https://datatool.patentsview.org/#detail/patent/10745438'
google_patents_url = 'https://www.google.com/patents/US' + url.rsplit('/', 1)[1]

answered Jun 02 '21 at 20:45

RJ Adriaansen

9,131
2
12
26

Wonderful workaround. But, how did you know this though? Was there any specific documentation/API? – sotmot Jun 03 '21 at 14:04
1

No, I just recognized that the numbers were the same in both links. Apparently this is the patent number, which allows for some convenient link hacking. – RJ Adriaansen Jun 03 '21 at 16:50

Cannot scrape google patent URL through python and Beautiful Soup

1 Answers1