0

Unable to extract field data from the web page, it is not a common web scraping problem. It associated with the javascript as well. I tried with python-requests as well, but unable to solve the problem.

I am trying to extract doi from the webpage. The doi is lying within the javascript. I am able to read the page and the code works up to{print(soup)}. When I am trying to extract the doi value ( in the given code, for the example webpage the doi is as follow: "doi":"10.1109/LAWP.2014.2364296" ) I wanted to print "10.1109/LAWP.2014.2364296" which is extracted from the webpage.

import urllib
from bs4 import BeautifulSoup
web_page = 'https://ieeexplore.ieee.org/document/6933872'
page = urllib.request.urlopen(web_page)
soup = BeautifulSoup(page, 'html.parser')        
print(soup)
soup.body.findAll(text='doi')

When using webpage "https://ieeexplore.ieee.org/document/6933872" the output is 10.1109/LAWP.2014.2364296. How I can?

1 Answers1

1

A possible solution that just skips over the Javascript web scraping issue is to use the IEEE API (https://developer.ieee.org/ ). While they do require registration and approval to get an API key, once you have it it will be much easier to send in a bunch of IEEE article numbers and get back their DOIs and other metadata in a structured way.

dagrenzer
  • 131
  • 1
  • 2