I am working on a script to extract text from law cases using https://case.law/docs/site_features/api. I have created methods for search and create-xlsx, which work well, but I am struggling with the method to open an online pdf link, write (wb) in a temp file, read and extract the data (core text), then close it. The ultimate goal is to use the content of these cases for NLP.
I have prepared a function (see below) to download the file:
def download_file(file_id):
http = urllib3.PoolManager()
folder_path = "path_to_my_desktop"
file_download = "https://cite.case.law/xxxxxx.pdf"
file_content = http.request('GET', file_download)
file_local = open( folder_path + file_id + '.pdf', 'wb' )
file_local.write(file_content.read())
file_content.close()
file_local.close()
The script works well as it download the file and it created on my desktop, but, when I try to open manually the file on the desktop I have this message from acrobat reader:
Adobe Acrobat Reader could not open 'file_id.pdf' because it is either not a supported file type or because the file has been damager (for example, it was sent as a email attachments and wasn't correctly decoded
I thought it was the Library so I tried with Requests / xlswriter / urllib3... (example below - I also tried to read it from the script to see whether it was Adobe that was the issue, but apparently not)
# Download the pdf from the search results
URL = "https://cite.case.law/xxxxxx.pdf"
r = requests.get(URL, stream=True)
with open('path_to_desktop + pdf_name + .pdf', 'w') as f:
f.write(r.text)
# open the downloaded file and remove '<[^<]+?>' for easier reading
with open('C:/Users/amallet/Desktop/r.pdf', 'r') as ff:
data_read = ff.read()
stripped = re.sub('<[^<]+?>', '', data_read)
print(stripped)
the output is:
document.getElementById('next').value = document.location.toString();
document.getElementById('not-a-bot-form').submit();
with 'wb'and 'rb' instead (and removing the *** stripped *** the sript is:
r = requests.get(test_case_pdf, stream=True)
with open('C:/Users/amallet/Desktop/r.pdf', 'wb') as f:
f.write(r.content)
with open('C:/Users/amallet/Desktop/r.pdf', 'rb') as ff:
data_read = ff.read()
print(data_read)
and the output is :
<html>
<head>
<noscript>
<meta http-equiv="Refresh" content="0;URL=?no_js=1&next=/pdf/7840543/In%20re%20the%20Extradition%20of%20Garcia,%20890%20F.%20Supp.%20914%
20(1994).pdf" />
</noscript>
</head>
<body>
<form method="post" id="not-a-bot-form">
<input type="hidden" name="csrfmiddlewaretoken" value="5awGW0F4A1b7Y6bx
rYBaA6GIvqx4Tf6DnK0qEMLVoJBLoA3ZqOrpMZdUXDQ7ehOz">
<input type="hidden" name="not_a_bot" value="yes">
<input type="hidden" name="next" value="/pdf/7840543/In%20re%20
the%20Extradition%20of%20Garcia,%20890%20F.%20Supp.%20914%20(1994).pdf" id="next">
</form>
<script>
document.getElementById(\'next\').value = document.loc
ation.toString();
document.getElementById(\'not-a-bot-form\').submit();
</script>
<a href="?no_js=1&next=/pdf/7840543/In%20re%20the%20Extradition%20of%20Garcia,%2
0890%20F.%20Supp.%20914%20(1994).pdf">Click here to continue</a>
</body>
</html>
but none are working. The pdf is not protected by a password, and I tried on other website and it doesn't work either.
Therefore, I am wondering whether I have another issue that is not link to the code itself.
Please let me know if you need additional information.
thank you