Convert io.BytesIO to io.StringIO to parse HTML page

Question

I'm trying to parse a HTML page I retrieved through pyCurl but the pyCurl WRITEFUNCTION is returning the page as BYTES and not string, so I'm unable to Parse it using BeautifulSoup.

Is there any way to convert io.BytesIO to io.StringIO?

Or Is there any other way to parse the HTML page?

I'm using Python 3.3.2.

does the naive approach of exhausting the `BytesIO` and then constructing a `StringIO` from the output not satisfy your constraints? — anthony sottile, Jul 04 '14 at 04:32

score 80 · Answer 1 · answered Jul 10 '18 at 03:33

80

the code in the accepted answer actually reads from the stream completely for decoding. Below is the right way, converting one stream to another, where the data can be read chunk by chunk.

# Initialize a read buffer
input = io.BytesIO(
    b'Inital value for read buffer with unicode characters ' +
    'ÁÇÊ'.encode('utf-8')
)
wrapper = io.TextIOWrapper(input, encoding='utf-8')

# Read from the buffer
print(wrapper.read())

answered Jul 10 '18 at 03:33

kakarukeys

21,481
10
35
48

Could you please add an example of reading chunk by chunk? – Nairum Jun 30 '20 at 11:28
@AlexeiMarinichenko you can read up on the docs about the methods of TextIOWrapper. Try `wrapper.read(5)`, `wrapper.realine()`. – kakarukeys Jul 01 '20 at 14:17

score 29 · Accepted Answer · answered Jul 04 '14 at 04:35

29

A naive approach:

# assume bytes_io is a `BytesIO` object
byte_str = bytes_io.read()

# Convert to a "unicode" object
text_obj = byte_str.decode('UTF-8')  # Or use the encoding you expect

# Use text_obj how you see fit!
# io.StringIO(text_obj) will get you to a StringIO object if that's what you need

answered Jul 04 '14 at 04:35

anthony sottile

61,815
15
148
207

7

Thanks, it did work. But instead of bytes_io.read() I used bytes_io.getvalue() as the former didn't work. – Shipra Jul 08 '14 at 03:59
1

ah yeah I assumed your `BytesIO` was at the beginning of the stream. `getvalue` I believe should work regardless where you are :) – anthony sottile Jul 08 '14 at 04:25
4

Normally you would have to call `bytes_io.seek(0)` before the read() call. As @AnthonySottile mentions, `getvalue` gets around this. – Quantum7 Dec 12 '17 at 13:26
1

seems to be very inefficient - we need to load all the file in memory to make decode for that. This should work good for small files, but not for the large ones. – Serge Dec 02 '20 at 08:35
both of the current answers have that inefficiency -- I could probably update this with an incremental decoder answer but at this point it's not really worth my efforts – anthony sottile Dec 02 '20 at 16:30
@AnthonySottile This was my first Programming/Python project ever. So, it was sufficient. Your answer really helped and encourage me to continue coding. Thank you. – Shipra Aug 02 '22 at 11:49

Convert io.BytesIO to io.StringIO to parse HTML page

2 Answers2

Linked