I am using Google Protocol Buffers and Python to decode some large data files--200MB each. I have some code below that shows how to decode a delimited stream and it works just fine. However it uses the read()
command which loads the whole file into memory and then iterates over it.
import feed_pb2 as sfeed
import sys
from google.protobuf.internal.encoder import _VarintBytes
from google.protobuf.internal.decoder import _DecodeVarint32
with open('/home/working/data/feed.pb', 'rb') as f:
buf = f.read() ## PROBLEM-LOADS ENTIRE FILE TO MEMORY.
n = 0
while n < len(buf):
msg_len, new_pos = _DecodeVarint32(buf, n)
n = new_pos
msg_buf = buf[n:n+msg_len]
n += msg_len
read_row = sfeed.standard_feed()
read_row.ParseFromString(msg_buf)
# do something with read_metric
print(read_row)
Note that this code comes from another SO post, but I don't remember the exact url. I was wondering if there was a readlines()
equivalent with protocol buffers that allows me to read in one delimited message at a time and decode it? I basically want a pipeline that is not limited by the RAM I have to load the file.
Seems like there was a pystream-protobuf
package that supported some of this functionality, but it has not been updated in a year or two. There is also a post from 7 years ago that asked a similar question. But I was wondering if there was any new information since then.
python example for reading multiple protobuf messages from a stream