4

I have a string:

'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'

And I want:

b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'

But I keep getting:

b'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'

Context

I scraped a string off of a webpage and stored it in the variable un. Now I want to decompress it using BZip2:

bz2.decompress(un)

However, since un is a str object, I get this error:

TypeError: a bytes-like object is required, not 'str'

Therefore, I need to convert un to a bytes-like object without changing the single backslash to an escaped backslash.

Edit 1: Thank you for all the help! @wim I understand what you mean now, but I am at a loss as to how I can retrieve a bytes-like object from my webscraping method:

r = requests.get('http://www.pythonchallenge.com/pc/def/integrity.html')

doc = html.fromstring(r.content)
comment = doc.xpath('//comment()')[0].text.split('\n')[1:3]

pattern = re.compile("[a-z]{2}: '(.+)'")

un = re.search(pattern, comment[0]).group(1)

The packages that I am using are requests, lxml.html, re, and bz2.

Once again, my goal is to decompress un using bz2, but I am having difficulty getting a bytes-like object from my webscraping process.

Any pointers?

Bryan Yao
  • 65
  • 2
  • 7
  • Show the conversion code – Mad Physicist Jul 21 '18 at 15:54
  • You had bytes that were decoded using the incorrect encoding. It's impossible to tell, from the text alone, which incorrect encoding was used - but it looks like it was latin-1. – wim Jul 21 '18 at 16:06
  • It looks like [this](http://www.pythonchallenge.com/pc/def/integrity.html) was a programming challenge and the username/password (huge/file) were intentionally obfuscated in an html comment. In this case, using `un.encode('raw_unicode_escape')` is acceptable! – wim Jul 21 '18 at 20:44

2 Answers2

2

Your bug exists earlier. The only acceptable solution is to change the scraping code so that it returns a bytes object and not a text object. Do not to try and "convert" your string un into bytes, it can not be done reliably.

Do NOT do this:

>>> un = 'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>> bz2.decompress(un.encode('raw_unicode_escape'))
b'huge'

The "raw_unicode_escape" is just a Latin-1 encoding which has a built-in fallback for characters outside of it. This encoding uses \uXXXX and \UXXXXXXXX for other code points. Existing backslashes are not escaped in any way. It is used in the Python pickle protocol. For Unicode characters that cannot be represented as a \xXX sequence, your data will become corrupted.

The web scraping code has no business returning bz2-encoded bytes as a str, so that's where you need to address the cause of the problem, rather than attempting to deal with the symptoms.

wim
  • 338,267
  • 99
  • 616
  • 750
0

If I understand your goal correctly, this can be achieved by:

word = 'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'

my_byte_array = word.encode()

print(my_byte_array)

The result came out to be:

b'BZh91AY&SYA\xc2\xaf\xc2\x82\r\x00\x00\x01\x01\xc2\x80\x02\xc3\x80\x02\x00 \x00!\xc2\x9ah3M\x07<]\xc3\x89\x14\xc3\xa1BA\x06\xc2\xbe\x084'

There is a good discussion about this on this SO post if this isn't enough. They talk about the best ways (according to PEP) to encode UTF-8 Strings to byte arrays and other methods the class involves.

Jonathan
  • 487
  • 1
  • 9
  • 19
  • Now try to `bz2.decompress` those UTF-8 encoded bytes. – wim Jul 21 '18 at 15:56
  • Okay, I am almost certain I don't understand the goal now. Sorry, I saw `"And I want: b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084' "` in @Bryan Yao's post and thought he just wanted a byte type object without the escape characters. – Jonathan Jul 21 '18 at 16:16
  • Also @wim you are correct, I definitely cannot bz2 decompress the byte object- leads to `OSError: Invalid data stream` – Jonathan Jul 21 '18 at 16:20