Python: Decoding base64 encoded strings within an HTML file and replacing these strings with their decoded counterpart

Question

Please help because this flipping program is my ongoing nightmare!

I have several files that include some base64 encoded strings. Part of one file for examples reads as follows:

charset=utf-8;base64,I2JhY2tydW5uZXJfUV81c3R7aGVpZ2h0OjkzcHg7fWJhY2tydW5uZXJfUV81c3R7ZGlzcGxheTpibG9jayFpbXBvcnRhbnQ7fQ=="

They are always in the format "ANYTHINGbase64,STRING" It is html but I am treating it as one large string and using BeautifulSoup elsewhere. I am using a regex expression 'base' to extract the base64 string, then using base64 module to decode this as per my defined function "debase".

This seems to work ok up to a point: the output of b64encode for some reason adds unnecessary stuff:

b'#backrunner_Q_5st{height:93px;}backrunner_Q_5st{display:block!important;}' with the string the stuff in the middle.

I'm guessing this means in bytes; so I have tried getting my function to encode this as utf8 but basically I am out of my depth.

The end result that I want is for all "base64,STRING" in my html to be decoded and replaced with DECODEDSTRING.

Please help!

import os, sys, bs4, re, base64, codecs
from bs4 import BeautifulSoup

def debase(instr):
    outstring = base64.b64decode(instr)
    outstring = codecs.utf_8_encode(str(outstring))
    outstring.split("'")[1]
    return outstring

base = re.compile('base64,(.*?)"')

for eachArg in sys.argv[1:]:
    a=open(eachArg,'r',encoding='utf8')
    presoup = a.read()
    b = re.findall(base, presoup)
    for value in b:
        re.sub('base64,.*?"', debase(value))
        print(debase(value))


    soup=BeautifulSoup(presoup, 'lxml')
    bname= str(eachArg).split('.')[0]
    a.close()
    [s.extract() for s in soup('script')]
    os.remove(eachArg)
    b=open(bname +'.html','w',encoding='utf8')
    b.write(soup.prettify())
    b.close()

Those characters in the middle are really there in the base64-encoded string; the charset tells you the encoding is utf8 and if you decode those bytes using `utf8` you'll get the "same" result: it's all ASCII. So the question is why those characters are there in the base64-encoded string to begin with. Based on your output I think your decoding attempt works right. — Andras Deak -- Слава Україні, Apr 20 '18 at 11:34
Sorry the bit I don't understand is that there is an extra b' ' and apostrophe that I am trying to remove — lgjmac, Apr 20 '18 at 11:49
So I want to be able to swap it in plain text. But the output seems to be binary or bytes data — lgjmac, Apr 20 '18 at 11:50

Andras Deak -- Слава Україні · Accepted Answer · 2018-04-20T14:06:13.053

Your input is a bit oddly formatted (with a trailing unmatched single quote, for instance), so make sure you're not doing unnecessary work or parsing content in a weird way.

Anyway, assuming you have your input in the form it's given, you have to decode it using base64 in the way you just did, then decode using the given encoding to get a string rather than a bytestring:

import base64

inp = 'charset=utf-8;base64,I2JhY2tydW5uZXJfUV81c3R7aGVpZ2h0OjkzcHg7fWJhY2tydW5uZXJfUV81c3R7ZGlzcGxheTpibG9jayFpbXBvcnRhbnQ7fQ=="'
head,tail = inp.split(';')
_,enc = head.split('=') # TODO: check if the beginning is "charset"
_,msg = tail.split(',') # TODO: check that the beginning is "base64"

plaintext_bytes = base64.b64decode(msg)
plaintext_str = plaintext_bytes.decode(enc)

Now the two results are

>>> plaintext_bytes
b'#backrunner_Q_5st{height:93px;}backrunner_Q_5st{display:block!important;}'
>>> plaintext_str
'#backrunner_Q_5st{height:93px;}backrunner_Q_5st{display:block!important;}'

As you can see, the content of the bytes was already readable, this is because the contents were ASCII. Also note that I didn't remove the trailing quote from your string: base64 is smart enough to ignore what comes after the two equation signs in the content.

In a nutshell, strings are a somewhat abstract representation of text in python 3, and you need a specific encoding if you want to represent the text with a stream of ones and zeros (which you need when you transfer data from one place to another). When you get a string in bytes, you have to know how it was encoded in order to decode it and obtain a proper string. If the string is ASCII-compatible then the encoding is fairly trivial, but once more general characters appear your code will break if you use the wrong encoding.

Quick question: the reason that there is the single trailing quotation mark, is that I was using it to delineate the regex boundaries: is there a way to make the regex stop without knowing which the first none base64 encoded character is in the first place? — lgjmac, Apr 20 '18 at 15:34
@lgjmac the two equation signs at the end denote the padded end of the base64 string. You can always explicitly include the characters you want to match as part of the base64 string in the regex; but honestly I suspect that it should be straightforward to use beautifulsoup itself to extract the information you need (i.e. the tag/field/whatever containing your base64 string without having to use regex), it's an html parser library after all. If you can't do that you can write a nasty regex that only matches the 64 characters used by base64. I'd try the latter. — Andras Deak -- Слава Україні, Apr 20 '18 at 16:16

Python: Decoding base64 encoded strings within an HTML file and replacing these strings with their decoded counterpart

1 Answers1

Linked