0

I am having an issue with BS4/Python 2.7.12 reading links and files that have been URL encoded already when I downloaded them using wget to archive my Drupal website.

For example, a link that exists on the live website would be:

https://mywebsite.org/content/prime's-and-"doubleprimes"-in-it (I know this is incorrect grammar because the 's example is possessive not plural)

The downloaded file would be:

/content/prime%E2%80%99s-and-%E2%80%9Cdoubleprimes%E2%80%9D-in-it

(This is helpful in identifying different typography: http://www.w3schools.com/TAGS/ref_urlencode.asp)

My script loops through each file and flattens the site by adding ".html" to all links. However, in using BS4 to do this, it is actually changing the link path because it seems to try to re-interpret the already URL-encoded links. So as a result it would change the above link to:

/content/prime%2580%2599s-and-%2580%259Cdoubleprimes%2580%259D-in-it

And thus it wouldn't work. You can see the %25 it is trying to use to encode the % signs beginning %E2, for example.

There have been many questions regarding encoding with BS4, but most of them specifically with regard to utf-8 with BS4. I understand that BS4 will automatically read the "soup" into utf-8, but I'm unsure why it is trying to re-URL encode links that are already encoded. I have tried soup = BeautifulSoup(text.read().decode('utf-8','ignore')) as suggested here, which fixed an issue where BS4 was trying to interpret %E2 as a unicode character, however I haven't seen anything for re-encoding of already-URL encoded characters. I have also tried adding formatter="html" to my soup.prettify command, but this did not work either, as the files had already been read and interpreted at that point.

Community
  • 1
  • 1
Kate
  • 133
  • 2
  • 17
  • do you mean something like this `urllib.parse.unquote('/content/prime%25E2%2580%2599s-and-%25E2%2580%259Cdoubleprimes%25E2%2580%259D-in-it')` – furas Oct 21 '16 at 17:25
  • I'm not sure - I just tried it and it said that `urllib` had no attribute `parse` ? – Kate Oct 21 '16 at 17:33
  • it means you use Python2 and I use Python3 :) – furas Oct 21 '16 at 17:33
  • try `urllib.unquote()` – furas Oct 21 '16 at 17:38
  • Ok yes I did mention I was using Python 2.7.12 in my initial question. Trying `urllib.unquote(link)` gives me a link that prints like this: `/content/prime%80%99s-and-%80%9Cdoublepr‌​imes%80%9D-‌​in-it` – Kate Oct 21 '16 at 17:41
  • I tried BS4 (in Python2) and it never change me any url. I can't get your problem - so it is difficult to do something with this. – furas Oct 21 '16 at 17:41
  • Sorry, I was thinking about problem so I didn't notice "2.7.12" in text. – furas Oct 21 '16 at 17:44
  • Did `urllib.unquote()` give you expected url ? Or maybe show your code and file which makes problem. – furas Oct 21 '16 at 17:54
  • Trying urllib.unquote(link) gives me a link that prints like this: `/content/prime%80%99s-and-%80%9Cdoublepr‌​imes%80%9D-‌​in-it` – Kate Oct 21 '16 at 18:50
  • On my computer it prints with `%E2` so it looks like link which you expect. Maybe something wrong with your BS4 or it's happend only on your OS. I use Linux Mint 17, Python 2.7.12, BS 4.5.0 – furas Oct 21 '16 at 20:23

0 Answers0