0

Here's an extremely simple script(5 lines!) that I wrote. I'd like to fetch HTML data specifically including the subject_text and the price class.

import re
from urllib import request
url = 'https://section.cafe.naver.com/ca-fe/home/search/c-articles?q=%EB%A1%A4%EB%9E%9C%EB%93%9C&ss=ON_SALE'
contents = str(request.urlopen(url).read().decode("utf8"))
print(contents)

But when I print the contents, there seems to be a noscript error. Because it says like this in the output:

<noscript><strong>We're sorry but web-pc doesn't work properly without JavaScript enabled. Please enable it to continue.</strong></noscript>

When I checked the Chrome option, Java script was enabled. I disabled addblock and tried this in incognito mode but none of them worked. All other details in other tags/class are still there but cannot be fetched. Any ideas would be much appreciated. +URL page is not in English but still, you won't have any problems reading the html scripts.

  • 1
    Nothing that you do in Chrome is going to affect what `urllib` gives you. My guess is that the site is looking for evidence that your browser is modern (e.g. testing the user-agent header). You can add such headers to your `urllib` request and pretend to be Chrome or some other modern browser, and maybe get what you need. – kindall Oct 04 '21 at 18:10
  • First of all, please don't call JavaScript "Java script". Next, most probably the page actually loads fully blank, and all the content is then downloaded using JavaScript. If I were you, I'd use any debugging proxy like Fiddler or OWASP-ZAP, and tried to see which HTTP queries would actually retrieve the data I'm looking for, and how can I mimic them using urllib. Finally, I personally think urllib is kinda less powerful compared to requests, it worth checking https://stackoverflow.com/questions/2018026/what-are-the-differences-between-the-urllib-urllib2-urllib3-and-requests-modul – Alex Oct 04 '21 at 18:13
  • Hi @kindall, could you please elaborate on how to add the headers? – trilingualAra Oct 05 '21 at 12:42
  • First step is to make a request to https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending (or similar page, I just Googled "what headers is my browser sending"). That will let you see what headers are being sent. Then you can add them to the `headers` parameter in the `Request` object (you'll need to build your own request rather than using 'urlopen()`). Hope that gets you started. – kindall Oct 05 '21 at 18:32

0 Answers0