0

I have XML with many weblinks which are URL encoded. I can't to use this XML before I decode all weblinks in it.

I have written such code in python:

import re
from urllib.parse import unquote
from transliterate import translit, get_available_language_codes

myString = """><tr><td style="text-align: center;"><a href="https://somewebsite.com/s1600/%25D0%2593%25D0%259E%25D0%25A0%25D0%259E%25D0%25A1%25D0%259A%25D0%259E%25D0%259F%2B%25D0%2592%25D0%25A0%%25D0%2590.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="470" data-original-width="820" height="366" src="https://somewebsite.com/AAAAAAAAzAM/nhpZKVGvEWAn-UNufwn1npX7aTucSWFUwCLcBGAs/s640/%25D0%2593%25D0%259E%25D0%25A0%25D0%259E%25D0%25A1%25D0%259A%25D0%22%25D0%2598.%2B%25D0%25A1%25D0%2590%25D0%259C%25D0%25AB%25D0%2595%90.jpg" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">;<br /><a name='more'></a><br /><br /><div align="center"><script async="" src="//pagead2.googlesyndication.com/pagead/jshttps://somewebsite.com/-_7TnRcBGpRY/%2597%25D0%259D%25D0%2590%25D0%259A%25D0%25A3%2B%25D0%2597%25D0%259E%25D0%2594%25D0%2598%25D0%2590%25D0%259A%25D0%2590.jpg"""
b = re.findall("(?P<url>https?://[^\s]+)", myString)
c = unquote(unquote(b))
d = translit(c, 'ru', reversed=True)

Now I can: 1. Decode any of link separately 2. Create an array of decoded links

But I have no ideas how can I replace in myString all encoded links (default one) by those which where decoded by me.

I have found a way to receive all decoded links but I don't really know how to replace old ones in myString by new ones.

  • 1
    Do you need a regex here? `html.unescape(myString)` should get you most of the way... – Jon Clements Sep 16 '18 at 18:43
  • Thank you for a comment. It doesn't give proper links unfortunately. –  Sep 16 '18 at 18:52
  • 1
    Nope - there's still more work to be done but I'm guessing that's getting closer to what you're after? – Jon Clements Sep 16 '18 at 18:52
  • 1
    If you have a look at the bs4 docs, you can see you can also replace elements in a soupified string... I'm not clear on exactly what operations you want to perform or which attributes to replace but hopefully you've got a decent enough starting point. – Jon Clements Sep 16 '18 at 19:00
  • 1
    You are looking for a callback method passed as the replacement argument to `re.sub`. And that makes it a dupe of [Passing a function to re.sub in Python](https://stackoverflow.com/questions/18737863/passing-a-function-to-re-sub-in-python). Here is [anoter helpful post](https://stackoverflow.com/questions/39009967/how-to-replace-an-re-match-with-a-transformation-of-that-match). `re.sub(r'https?://\S+', your_escaping_method, text)` – Wiktor Stribiżew Sep 16 '18 at 19:07
  • Indeed: "anoter helpful post" is very close for that I have been looking for. Thank you for a comment. –  Sep 16 '18 at 19:12

1 Answers1

0

You can use html.unescape to get your string more readily parseable, then use BeautifulSoup4 (pip install bs4) to find loop over all tags and do whatever's needed to get the src/href/whatever you specify attributes into shape, then convert the soup object back to a string.

from html import unescape
from urllib.parse import unquote
from bs4 import BeautifulSoup

myString = """&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="https://somewebsite.com/s1600/%25D0%2593%25D0%259E%25D0%25A0%25D0%259E%25D0%25A1%25D0%259A%25D0%259E%25D0%259F%2B%25D0%2592%25D0%25A0%%25D0%2590.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" data-original-height="470" data-original-width="820" height="366" src="https://somewebsite.com/AAAAAAAAzAM/nhpZKVGvEWAn-UNufwn1npX7aTucSWFUwCLcBGAs/s640/%25D0%2593%25D0%259E%25D0%25A0%25D0%259E%25D0%25A1%25D0%259A%25D0%22%25D0%2598.%2B%25D0%25A1%25D0%2590%25D0%259C%25D0%25AB%25D0%2595%90.jpg" width="640" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;div align="center"&gt;&lt;script async="" src="//pagead2.googlesyndication.com/pagead/jshttps://somewebsite.com/-_7TnRcBGpRY/%2597%25D0%259D%25D0%2590%25D0%259A%25D0%25A3%2B%25D0%2597%25D0%259E%25D0%2594%25D0%2598%25D0%2590%25D0%259A%25D0%2590.jpg"""

soup = BeautifulSoup(unescape(myString), 'html.parser')
# loop over all elements and update anything src/href attributes
for tag in soup.find_all():
    for attr in tag.attrs.keys() & {'src', 'href'}:
        # do whatever else with tag[attr] here
        tag[attr] = unquote(unquote(tag[attr]))

output = str(soup)

Gives you:

'&gt;<tr><td style="text-align: center;"><a href="https://somewebsite.com/s1600/ГОРОСКОП+ВР%А.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="470" data-original-width="820" height="366" src=\'https://somewebsite.com/AAAAAAAAzAM/nhpZKVGvEWAn-UNufwn1npX7aTucSWFUwCLcBGAs/s640/ГОРОСК�"И.+САМЫЕ�.jpg\' width="640"/></a></td></tr><tr><td class="tr-caption" style="text-align: center;">;<br/><a name="more"></a><br/><br/><div align="center">&lt;script async="" src="//pagead2.googlesyndication.com/pagead/jshttps://somewebsite.com/-_7TnRcBGpRY/%2597%25D0%259D%25D0%2590%25D0%259A%25D0%25A3%2B%25D0%2597%25D0%259E%25D0%2594%25D0%2598%25D0%2590%25D0%259A%25D0%2590.jpg</div></td></tr>'

Of course - your mileage will vary with how much the parser can make sense of the input to start with.

Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • Thank you for a code but I have already created array of decoded weblinks. The problem lays in area that I don't know how to replace old links in myString with a new ones. –  Sep 16 '18 at 19:03
  • Okay - two ticks – Jon Clements Sep 16 '18 at 19:11