Weird encoding file format outputted by BeautifulSoup

Question

I would like to access and scrape the data from this link.

where;

new_url='https://www.scopus.com/results/results.uri?sort=plf-f&src=s&imp=t&sid=2c816e0ea43cf176a59117097216e6d4&sot=b&sdt=b&sl=160&s=%28TITLE-ABS-KEY%28EEG%29AND+TITLE-ABS-KEY%28%22deep+learning%22%29+AND+DOCTYPE%28ar%29%29+AND+ORIG-LOAD-DATE+AFT+1591735287+AND+ORIG-LOAD-DATE+BEF+1592340145++AND+PUBYEAR+AFT+2018&origin=CompleteResultsEmailAlert&dgcid=raven_sc_search_en_us_email&txGid=cc4809850a0eff92f629c95380f9f883'

As accessing the new_url via the following line

req = Request(url, headers={'User-Agent': 'Mozilla/5.9'})

produced the error

Webscraping: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop

A set of new line was drafted

req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
raw = opener.open(req).read()

page_soup = soup(raw, 'html.parser')
print(page_soup.prettify())

While no error is thrown out, but the

print(page_soup.prettify())

output some unrecognized text format

6�>�.�t1k�e�LH�.��]WO�?m�^@� څ��#�h[>��!�H8��|��n(XbU<~�k�"��#g+�4�Ǻ�Xv�7�UȢB2� �7�F8�XA��W\�ɚ��^8w��38�@' SH�<_0�B��oy�5Bނ)E��GPq:�ќU�c��ab�h�$<ra� ;o�Q�a@ð�d\�&J3Τa��:�I�etf�a��h�$(M�~��ua�$� n�&9u%ҵ*b��w�j�V��P�D�'z[��)

with a warning

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

I suspect, this can be resolved by encode it using utf-8, which is as below

req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
raw = opener.open(req).read()
with open(raw, 'r', encoding='utf-8') as f:
    page_soup = soup(f, 'html.parser')
    print(page_soup.prettify())

However, the compiler return an error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

May I know what is the problem, appreciate for any insight.

thf9527 · Answer 1 · 2020-06-18T07:42:29.353

1

Try using the requests library

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"}

with requests.Session() as s:
    r = s.get(new_url, headers = headers)

soup = BeautifulSoup(r.text, 'lxml')
print(soup.get_text())

you can still use cookies here

Edit: Updated code to show the use of headers, this would tell the website you are a browser instead of a program - but further login operations I would suggest the use of selenium instead of requests

edited Jun 18 '20 at 07:42

answered Jun 17 '20 at 08:11

thf9527

71
2
11

Thanks for the response @thf9527, may I know how to enable cookies? Apparently the website Denied Access. It seem they know I am accessing using non conventional browser – mpx Jun 17 '20 at 12:17
1

Include a "User-Agent" header for the get request should get you to the login page https://stackoverflow.com/questions/6260457/using-headers-with-the-python-requests-librarys-get-method – thf9527 Jun 18 '20 at 07:34
I would recommend selenium instead of requests for your login operation as it seems to involve javascript. – thf9527 Jun 18 '20 at 07:36

score 0 · Accepted Answer · answered Jun 17 '20 at 08:59

If you want to use urllib library, remove Accept-Encoding from the headers (also specify Accept-Charset just utf-8 for simplicity):

req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'utf-8;q=0.7,*;q=0.3','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})

The result is:

<!DOCTYPE html>
<!--  Form Name: START -->
<html lang="en">
 <!-- Template_Component_Name: id.start.vm -->
 <head>
  <meta charset="utf-8"/>

...etc.

Thanks for the suggestion. It works. But,it seems Scopus restrict user from accessing their website via non-browser approach despite using our institution connection. Accessing the above url stop at the Scopus main landing page. — mpx, Jun 17 '20 at 12:29
Apologies for the -1, it wasn't on purpose and I can't undo it now for some reason — nbeuchat, Jan 21 '22 at 11:31

Weird encoding file format outputted by BeautifulSoup

2 Answers2