41

I'm trying to parse a HTML page I retrieved through pyCurl but the pyCurl WRITEFUNCTION is returning the page as BYTES and not string, so I'm unable to Parse it using BeautifulSoup.

Is there any way to convert io.BytesIO to io.StringIO?

Or Is there any other way to parse the HTML page?

I'm using Python 3.3.2.

Shipra
  • 1,259
  • 2
  • 14
  • 26
  • does the naive approach of exhausting the `BytesIO` and then constructing a `StringIO` from the output not satisfy your constraints? – anthony sottile Jul 04 '14 at 04:32

2 Answers2

80

the code in the accepted answer actually reads from the stream completely for decoding. Below is the right way, converting one stream to another, where the data can be read chunk by chunk.

# Initialize a read buffer
input = io.BytesIO(
    b'Inital value for read buffer with unicode characters ' +
    'ÁÇÊ'.encode('utf-8')
)
wrapper = io.TextIOWrapper(input, encoding='utf-8')

# Read from the buffer
print(wrapper.read())
kakarukeys
  • 21,481
  • 10
  • 35
  • 48
29

A naive approach:

# assume bytes_io is a `BytesIO` object
byte_str = bytes_io.read()

# Convert to a "unicode" object
text_obj = byte_str.decode('UTF-8')  # Or use the encoding you expect

# Use text_obj how you see fit!
# io.StringIO(text_obj) will get you to a StringIO object if that's what you need
anthony sottile
  • 61,815
  • 15
  • 148
  • 207
  • 7
    Thanks, it did work. But instead of bytes_io.read() I used bytes_io.getvalue() as the former didn't work. – Shipra Jul 08 '14 at 03:59
  • 1
    ah yeah I assumed your `BytesIO` was at the beginning of the stream. `getvalue` I believe should work regardless where you are :) – anthony sottile Jul 08 '14 at 04:25
  • 4
    Normally you would have to call `bytes_io.seek(0)` before the read() call. As @AnthonySottile mentions, `getvalue` gets around this. – Quantum7 Dec 12 '17 at 13:26
  • 1
    seems to be very inefficient - we need to load all the file in memory to make decode for that. This should work good for small files, but not for the large ones. – Serge Dec 02 '20 at 08:35
  • both of the current answers have that inefficiency -- I could probably update this with an incremental decoder answer but at this point it's not really worth my efforts – anthony sottile Dec 02 '20 at 16:30
  • @AnthonySottile This was my first Programming/Python project ever. So, it was sufficient. Your answer really helped and encourage me to continue coding. Thank you. – Shipra Aug 02 '22 at 11:49