Extract some information in a pdf embedded in a web page using python and requests

Question

I am trying to extract some information in a pdf embedded in a web page using python and requests, And this is exactly the sentence I want to reach « Sciences de la vie et de l’environnement ».

image

Here is the code you wrote :

import time
import requests  
from bs4 import BeautifulSoup

# website to scrap
url = "https://fs.uit.ac.ma/avis-de-soutenance-dune-these-de-doctorat-mme-achachi-hind/"

with requests.session() as s:
    # get the url from requests get method
    html_content = s.get(url, verify=False)
    # Parse the html content
    soup = BeautifulSoup(html_content.content, "html.parser")
    url2 = soup.iframe["src"]
    html_doc = s.get(url2, verify=False).text
    print(html_doc)

Here's some of what print(html_doc),

Print result

When comparing the two pictures, I can't see what's inside in the last picture :

<div id="viewer" class="pdfViewer"></div>

Where inside this line is the writing that I want :

The line I want to reach

@KJ How do I get to this script "" in Python? – RACHID BEN ABDELMALEK Mar 04 '22 at 18:05 — RACHID BEN ABDELMALEK, Mar 04 '22 at 18:05

score 1 · Accepted Answer · answered Mar 04 '22 at 14:58

1

You can access the PDF manually (https://fs.uit.ac.ma/wp-content/uploads/2022/02/AVIS-DE-SOUTENANCE-ACHACHI-HIND.pdf) . There is the url in the iframe and request. In case of there is no way to get the url from the source code, you have to scrape the requests (eg. with BrowserMob)

answered Mar 04 '22 at 14:58

CampingCow

120
10

I want to do this in Python because I have 500 pdf in which to extract the information, – RACHID BEN ABDELMALEK Mar 04 '22 at 15:11
yes, I just mean you can access the url by reading the iframe url. Then download the pdf and process with python https://stackoverflow.com/questions/45795089/how-can-i-read-pdf-in-python – CampingCow Mar 07 '22 at 09:00
Thank you that's what I did, I don't think there is a way to read directly from the web – RACHID BEN ABDELMALEK Mar 08 '22 at 04:52

Extract some information in a pdf embedded in a web page using python and requests

1 Answers1