I am having an issue with BS4/Python 2.7.12 reading links and files that have been URL encoded already when I downloaded them using wget to archive my Drupal website.
For example, a link that exists on the live website would be:
https://mywebsite.org/content/prime's-and-"doubleprimes"-in-it
(I know this is incorrect grammar because the 's example is possessive not plural)
The downloaded file would be:
/content/prime%E2%80%99s-and-%E2%80%9Cdoubleprimes%E2%80%9D-in-it
(This is helpful in identifying different typography: http://www.w3schools.com/TAGS/ref_urlencode.asp)
My script loops through each file and flattens the site by adding ".html" to all links. However, in using BS4 to do this, it is actually changing the link path because it seems to try to re-interpret the already URL-encoded links. So as a result it would change the above link to:
/content/prime%2580%2599s-and-%2580%259Cdoubleprimes%2580%259D-in-it
And thus it wouldn't work. You can see the %25
it is trying to use to encode the %
signs beginning %E2
, for example.
There have been many questions regarding encoding with BS4, but most of them specifically with regard to utf-8 with BS4. I understand that BS4 will automatically read the "soup" into utf-8, but I'm unsure why it is trying to re-URL encode links that are already encoded. I have tried soup = BeautifulSoup(text.read().decode('utf-8','ignore'))
as suggested here, which fixed an issue where BS4 was trying to interpret %E2
as a unicode character, however I haven't seen anything for re-encoding of already-URL encoded characters. I have also tried adding formatter="html"
to my soup.prettify
command, but this did not work either, as the files had already been read and interpreted at that point.