I'm trying to parse HTML page with additional JavaScript or jQuery get class names and than replace it with random characters. I can easily extract class names but replacing it causes trouble. I have this code so far:
class_ids = [tag.split() for tag in re.findall(r'class=(?:"|\')([a-zA-Z0-9-_\s]+)(?:"|\')', html_page)]
class_ids = set([item for sublist in class_ids for item in sublist])
For each class I'll generate corresponding random characters class name (exp. footer : sjrh13li). Simply replacing footer string through file will also replace it in body text, also class names like title will also convert tag <title></title>
to <cjir4331></cjir4331>
. I've tried to replace whole line like class="title"
=> class="cjir4331"
but this doesn't solve problems like class="title huge"
because I need to detect classes title and huge separately and replace them. HTML code is combined with JavaScript code so document.getElementsByClassName('someClass')
must be converted to document.getElementsByClassName('noleretko4356').
Is there any way around this?