2

Based on the answer to this question I'm thinking that I've provided my .pb file with a "faulty decoder".

This is the data I'm trying to decode.

This is my .proto file.

Based on the ListPeople.java example provided in the Java tutorial documentation, I tried to write something similar to start picking apart that data, I wrote this:

import cc.refectorie.proj.relation.protobuf.DocumentProtos.Document;
import cc.refectorie.proj.relation.protobuf.DocumentProtos.Document.Sentence;

import java.io.FileInputStream;
import java.io.IOException;
import java.io.PrintStream;


public class ListDocument
{
    // Iterates though all people in the AddressBook and prints info about them.
    static void Print(Document document)
    {
        for ( Sentence sentence: document.getSentencesList() )
        {
            for(int i=0; i < sentence.getTokensCount(); i++)
            {
                System.out.println(" getTokens(" + i + ": " + sentence.getTokens(i) );
            }
        }
    }

    // Main function:  Reads the entire address book from a file and prints all
    //   the information inside.
    public static void main(String[] args) throws Exception {
        if (args.length != 1) {
            System.err.println("Usage:  ListPeople ADDRESS_BOOK_FILE");
            System.exit(-1);
        }

        // Read the existing address book.
        Document addressBook =
                Document.parseFrom(new FileInputStream(args[0]));

        Print(addressBook);
    }
}

But when I run that I get this error

Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.
    at com.google.protobuf.InvalidProtocolBufferException.invalidEndTag(InvalidProtocolBufferException.java:94)
    at com.google.protobuf.CodedInputStream.checkLastTagWas(CodedInputStream.java:174)
    at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:194)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:210)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:215)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
    at cc.refectorie.proj.relation.protobuf.DocumentProtos$Document.parseFrom(DocumentProtos.java:4770)
    at ListDocument.main(ListDocument.java:40)

so, as I said above I think that has to do with me not properly defining the decoder. Is there some way to look at the .proto file I'm trying to use and figure out a way to just read off all that data?

Is there some way to look at that .proto file and see what I'm doing wrong?

These are the first few lines of the file I want to read:

Ü
&/guid/9202a8c04000641f8000000003221072&/guid/9202a8c04000641f80000000004cfd50NA"Ö

S/m/vinci8/data1/riedel/projects/relation/kb/nyt1/docstore/2007-joint/1850511.xml.pb„€€€øÿÿÿÿƒ€€€øÿÿÿÿ"PERSON->PERSON"'inverse_false|PERSON|on bass and|PERSON"/inverse_false|with|PERSON|on bass and|PERSON|on"7inverse_false|, with|PERSON|on bass and|PERSON|on drums"$inverse_false|PERSON|IN NN CC|PERSON",inverse_false|with|PERSON|IN NN CC|PERSON|on"4inverse_false|, with|PERSON|IN NN CC|PERSON|on drums"`str:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"]str:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Rstr:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON"Adep:[NMOD]->|PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON"dir:->|PERSON|->-><-<-<-|PERSON"Sstr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"Adep:PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Pstr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Adep:PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Estr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON*ŒThe occasion was suitably exceptional : a reunion of the 1970s-era Sam Rivers Trio , with Dave Holland on bass and Barry Altschul on drums ."¬
S/m/vinci8/data1/riedel/projects/relation/kb/nyt1/docstore/2007-joint/1849689.xml.pb†€€€øÿÿÿÿ…€€€øÿÿÿÿ"PERSON->PERSON"'inverse_false|PERSON|on bass and|PERSON"/inverse_false|with|PERSON|on bass and|PERSON|on"7inverse_false|, with|PERSON|on bass and|PERSON|on drums"$inverse_false|PERSON|IN NN CC|PERSON",inverse_false|with|PERSON|IN NN CC|PERSON|on"4inverse_false|, with|PERSON|IN NN CC|PERSON|on drums"cstr:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"`str:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Ustr:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON"Cdep:[NMOD]->|PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON"dir:->|PERSON|->-><-<-<-|PERSON"Vstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"Cdep:PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Sstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Cdep:PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Hstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON*ÊTonight he brings his energies and expertise to the Miller Theater for the festival 's thrilling finale : a reunion of the 1970s Sam Rivers Trio , with Dave Holland on bass and Barry Altschul on drums .â
&/guid/9202a8c04000641f80000000004cfd50&/guid/9202a8c04000641f8000000003221072NA"Ù

EDIT


This is a file another researcher used to parse these files, so I was told, is it possible that I could use this?

package edu.stanford.nlp.kbp.slotfilling.multir;

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.zip.GZIPInputStream;

import edu.stanford.nlp.kbp.slotfilling.classify.MultiLabelDataset;
import edu.stanford.nlp.kbp.slotfilling.common.Log;
import edu.stanford.nlp.kbp.slotfilling.multir.DocumentProtos.Relation;
import edu.stanford.nlp.stats.ClassicCounter;
import edu.stanford.nlp.stats.Counter;
import edu.stanford.nlp.util.ErasureUtils;
import edu.stanford.nlp.util.HashIndex;
import edu.stanford.nlp.util.Index;

/**
 * Converts Hoffmann's data in protobuf format to our MultiLabelDataset
 * @author Mihai
 *
 */
public class ProtobufToMultiLabelDataset {
  static class RelationAndMentions {
    String arg1;
    String arg2;
    Set<String> posLabels;
    Set<String> negLabels;
    List<Mention> mentions;

    public RelationAndMentions(String types, String a1, String a2) {
      arg1 = a1;
      arg2 = a2;
      String [] rels = types.split(",");
      posLabels = new HashSet<String>();
      for(String r: rels){
        if(! r.equals("NA")) posLabels.add(r.trim());
      }
      negLabels = new HashSet<String>(); // will be populated later
      mentions = new ArrayList<Mention>();
    }
  };

  static class Mention {
    List<String> features;
    public Mention(List<String> feats) {
      features = feats;
    }
  }

    public static void main(String[] args) throws Exception {
      String input = args[0];

      InputStream is = new GZIPInputStream(
        new BufferedInputStream
        (new FileInputStream(input)));

      toMultiLabelDataset(is);
      is.close();
    }

    public static MultiLabelDataset<String, String> toMultiLabelDataset(InputStream is) throws IOException {
      List<RelationAndMentions> relations = toRelations(is, true);
      MultiLabelDataset<String, String> dataset = toDataset(relations);
      return dataset;
    }

    public static void toDatums(InputStream is,
        List<List<Collection<String>>> relationFeatures,
        List<Set<String>> labels) throws IOException {
      List<RelationAndMentions> relations = toRelations(is, false);
      toDatums(relations, relationFeatures, labels);
    }

    private static void toDatums(List<RelationAndMentions> relations,
        List<List<Collection<String>>> relationFeatures,
      List<Set<String>> labels) {
    for(RelationAndMentions rel: relations) {
      labels.add(rel.posLabels);
      List<Collection<String>> mentionFeatures = new ArrayList<Collection<String>>();
      for(int i = 0; i < rel.mentions.size(); i ++){
        mentionFeatures.add(rel.mentions.get(i).features);
      }
      relationFeatures.add(mentionFeatures);
    }
    assert(labels.size() == relationFeatures.size());
    }

    public static List<RelationAndMentions> toRelations(InputStream is, boolean generateNegativeLabels) throws IOException {
      //
      // Parse the protobuf
      //
    // all relations are stored here
    List<RelationAndMentions> relations = new ArrayList<RelationAndMentions>();
    // all known relations (without NIL)
    Set<String> relTypes = new HashSet<String>();
    Map<String, Map<String, Set<String>>> knownRelationsPerEntity =
      new HashMap<String, Map<String,Set<String>>>();
    Counter<Integer> labelCountHisto = new ClassicCounter<Integer>();
    Relation r = null;
    while ((r = Relation.parseDelimitedFrom(is)) != null) {
      RelationAndMentions relation = new RelationAndMentions(
          r.getRelType(), r.getSourceGuid(), r.getDestGuid());
      labelCountHisto.incrementCount(relation.posLabels.size());
      relTypes.addAll(relation.posLabels);
      relations.add(relation);

      for(int i = 0; i < r.getMentionCount(); i ++) {
        DocumentProtos.Relation.RelationMentionRef mention = r.getMention(i);
        // String s = mention.getSentence();
        relation.mentions.add(new Mention(mention.getFeatureList()));
      }

      for(String l: relation.posLabels) {
        addKnownRelation(relation.arg1, relation.arg2, l, knownRelationsPerEntity);
      }
    }
    Log.severe("Loaded " + relations.size() + " relations.");
    Log.severe("Found " + relTypes.size() + " relation types: " + relTypes);
    Log.severe("Label count histogram: " + labelCountHisto);

    Counter<Integer> slotCountHisto = new ClassicCounter<Integer>();
    for(String e: knownRelationsPerEntity.keySet()) {
      slotCountHisto.incrementCount(knownRelationsPerEntity.get(e).size());
    }
    Log.severe("Slot count histogram: " + slotCountHisto);
    int negativesWithKnownPositivesCount = 0, totalNegatives = 0;
    for(RelationAndMentions rel: relations) {
      if(rel.posLabels.size() == 0) {
        if(knownRelationsPerEntity.get(rel.arg1) != null &&
           knownRelationsPerEntity.get(rel.arg1).size() > 0) {
          negativesWithKnownPositivesCount ++;
        }
        totalNegatives ++;
      }
    }
    Log.severe("Found " + negativesWithKnownPositivesCount + "/" + totalNegatives +
        " negative examples with at least one known relation for arg1.");

    Counter<Integer> mentionCountHisto = new ClassicCounter<Integer>();
    for(RelationAndMentions rel: relations) {
      mentionCountHisto.incrementCount(rel.mentions.size());
      if(rel.mentions.size() > 100)
        Log.fine("Large relation: " + rel.mentions.size() + "\t" + rel.posLabels);
    }
    Log.severe("Mention count histogram: " + mentionCountHisto);

    //
    // Detect the known negatives for each source entity
    //
    if(generateNegativeLabels) {
      for(RelationAndMentions rel: relations) {
        Set<String> negatives = new HashSet<String>(relTypes);
        negatives.removeAll(rel.posLabels);
        rel.negLabels = negatives;
      }
    }

    return relations;
    }

    private static MultiLabelDataset<String, String> toDataset(List<RelationAndMentions> relations) {
    int [][][] data = new int[relations.size()][][];
    Index<String> featureIndex = new HashIndex<String>();
    Index<String> labelIndex = new HashIndex<String>();
    Set<Integer> [] posLabels = ErasureUtils.<Set<Integer> []>uncheckedCast(new Set[relations.size()]);
    Set<Integer> [] negLabels = ErasureUtils.<Set<Integer> []>uncheckedCast(new Set[relations.size()]);

    int offset = 0, posCount = 0;
    for(RelationAndMentions rel: relations) {
      Set<Integer> pos = new HashSet<Integer>();
      Set<Integer> neg = new HashSet<Integer>();
      for(String l: rel.posLabels) {
        pos.add(labelIndex.indexOf(l, true));
      }
      for(String l: rel.negLabels) {
        neg.add(labelIndex.indexOf(l, true));
      }
      posLabels[offset] = pos;
      negLabels[offset] = neg;
      int [][] group = new int[rel.mentions.size()][];
      for(int i = 0; i < rel.mentions.size(); i ++){
        List<String> sfeats = rel.mentions.get(i).features;
        int [] features = new int[sfeats.size()];
        for(int j = 0; j < sfeats.size(); j ++) {
          features[j] = featureIndex.indexOf(sfeats.get(j), true);
        }
        group[i] = features;
      }
      data[offset] = group;
      posCount += posLabels[offset].size();
      offset ++;
    }

    Log.severe("Creating a dataset with " + data.length + " datums, out of which " + posCount + " are positive.");
    MultiLabelDataset<String, String> dataset = new MultiLabelDataset<String, String>(
        data, featureIndex, labelIndex, posLabels, negLabels);
    return dataset;
    }

    private static void addKnownRelation(String arg1, String arg2, String label,
        Map<String, Map<String, Set<String>>> knownRelationsPerEntity) {
      Map<String, Set<String>> myRels = knownRelationsPerEntity.get(arg1);
      if(myRels == null) {
        myRels = new HashMap<String, Set<String>>();
        knownRelationsPerEntity.put(arg1, myRels);
      }
      Set<String> mySlots = myRels.get(label);
      if(mySlots == null) {
        mySlots = new HashSet<String>();
        myRels.put(label, mySlots);
      }
      mySlots.add(arg2);
    }
}
Community
  • 1
  • 1
smatthewenglish
  • 2,831
  • 4
  • 36
  • 72
  • Well what generated the file you're passing in? Can you provide a *small* file which demonstrates the problem? The tgz you've linked to contains 4 separate files - which of them is causing the issue? (The code you've got looks fine - other than the error message and the method name - looks fine.) – Jon Skeet Apr 09 '15 at 07:38
  • When you say 'what generated' you're alluding to a file like the AddPerson.java file from the example isn't it? I have to say I don't really know, since this is data from a research paper, the results of which I'm trying to replicate. I just posted the text of the .pb file I'm trying to read, the first few lines of it anyway. – smatthewenglish Apr 09 '15 at 07:48
  • No, I'm referring to the file you're passing in. A proto data file isn't text, it's binary data... but if you don't know what generated it, it's hard to know whether or not the code is right. (You're trying to parse the whole file as a single `Document` object... is that what you expected?) – Jon Skeet Apr 09 '15 at 08:06
  • yeah, the .pb file is the one I'm passing in, which is what I've posted there, the first few lines of it. and the file that generated it is the equivalent of the AddPerson.java file from the documentation, but I don't have that file. but I suppose I would more or less be able to tell what's in it based on that .proto file, isn't it? – smatthewenglish Apr 09 '15 at 08:17
  • Not very easily. If I were you, I'd contact whoever is supplying the data to ask for more details about the file generator. – Jon Skeet Apr 09 '15 at 08:19
  • But that's why I'm just trying to read one simple thing from it, just to make it work, I should be able to do that shouldn't I? But in fact it's not working. but you say the file looks good. can you think of a possile reason why i might be getting that error? – smatthewenglish Apr 09 '15 at 08:20
  • No, I said the source looks fine *if* the file is written out as a single `Document` message. I have no idea whether or not that's the case, other than to suspect it's not due to the error you're getting. Again, I'd recommend you get in touch with the source of the data. – Jon Skeet Apr 09 '15 at 08:21
  • @S.Matthew_English To find your problem we need to see the code that _writes_ the data, in addition to the code that reads it. It doesn't help to paste your raw data into the question like you have because the data is not text, it's binary. You can't put binary data into text (like a StackOverflow post); you'll lose information when you do. So we can't tell anything from the data you pasted. (This is the most common problem, BTW: people take binary protobufs and try to put them into Strings or other text, which corrupts the data.) – Kenton Varda Apr 09 '15 at 08:47
  • 1
    Which of the 4 files are you trying to process? testNegative.pb? testPositive.pb? trainNegative.pb? trainPositive.pb? – Marc Gravell Apr 09 '15 at 08:56
  • @MarcGravell well, ideally I'd like to look at all of them – smatthewenglish Apr 09 '15 at 09:06
  • @KentonVarda The code that reads it, or rather doesn't read it, but that I am trying to make read it is the first big block of code I posted. For the time being I don't have access to the code that wrote the binary but I do have the .proto file, isn't it possible to use that to write a small program to read out some of the data? – smatthewenglish Apr 09 '15 at 09:08

2 Answers2

3

Updated; the confusion here is two points:

  • the root object is Relation, not Document (in fact, only Relation and RelationMentionRef are even used)
  • the pb file is actually multiple objects, each varint-delimited, i.e. prefixed by their length expressed as a varint

As such, Relation.parseDelimitedFrom should work. Processing it manually, I get:

test-multiple.pb, 96678 Relation objects parsed
testNegative.pb, 94917 Relation objects parsed
testPositive.pb, 1950 Relation objects parsed
trainNegative.pb, 63596 Relation objects parsed
trainPositive.pb, 4700 Relation objects parsed

Old; outdated; exploratory:

I extracted your 4 documents and ran them through a little test rig:

        ProcessFile("testNegative.pb");
        ProcessFile("testPositive.pb");
        ProcessFile("trainNegative.pb");
        ProcessFile("trainPositive.pb");

where ProcessFile first dumps the first 10 bytes as hex, and then tries to process it via a ProtoReader. Here's the results:

Processing: testNegative.pb
dc 16 0a 26 2f 67 75 69 64 2f
> Document
Unexpected end-group in source data; this usually means the source data is corru
pt

Yep; agreed; DC is wire-type 4 (end-group), field 27; your document does not define field 27, and even if it did: it is meaningless to start with an end-group.

Processing: testPositive.pb
d5 0f 0a 26 2f 67 75 69 64 2f
> Document
250: Fixed32, Unexpected field
14: Fixed32, Unexpected field
6: String, Unexpected field
6: Variant, Unexpected field
Unexpected end-group in source data; this usually means the source data is corru
pt

Here we can't see the offending data in the hex dump, but again: there initial fields look nothing like your data and the reader readily confirms that the data is corrupt.

Processing: trainNegative.pb
d1 09 0a 26 2f 67 75 69 64 2f
> Document
154: Fixed64, Unexpected field
7: Fixed64, Unexpected field
6: Variant, Unexpected field
6: Variant, Unexpected field
Unexpected end-group in source data; this usually means the source data is corru
pt

Same as above.

Processing: trainPositive.pb
cf 75 0a 26 2f 67 75 69 64 2f
> Document
1881: 7, Unexpected field
Invalid wire-type; this usually means you have over-written a file without trunc
ating or setting the length; see http://stackoverflow.com/q/2152978/23354

CF 75 is a two-byte varint with wire-type 7 (which is not defined in the specification).

Your data is well and truly garbage. Sorry.


And with the bonus round of test-multiple.pb from comments (after gz decompression):

Processing: test-multiple.pb
dc 16 0a 26 2f 67 75 69 64 2f
> Document
Unexpected end-group in source data; this usually means the source data is corru
pt

This starts identically to testNegative.pb, and hence fails for exactly the same reason.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • Hmm, that's very interesting and potentially headache inducing. Nevertheless thank you very much. Is it possible that perhaps you could also please check [this data](http://www.ftpstatus.com/file_properties.php?sname=cs.nyu.edu&fid=19)? – smatthewenglish Apr 09 '15 at 09:14
  • @S.Matthew_English sure; downloading now – Marc Gravell Apr 09 '15 at 09:14
  • @S.Matthew_English btw; I also tried running the other files through gz decompress in case they had just been named confusingly, but the "magic numbers" were wrong, so they aren't gz data. – Marc Gravell Apr 09 '15 at 09:23
  • Did anyone check if the data is perhaps in delimited format, i.e. a stream of messages prefixed with varint sizes? @S.Matthew_English try using `parseDelimitedFrom()` instead of `parseFrom()`. This is just a guess, but it's fairly common. – Kenton Varda Apr 09 '15 at 09:24
  • if the `filename` values are things like `/guid/9202a8c04000641f8000000003221072`, then @KentonVarda is correct about the varint length prefix; I'm getting more problems deeper, but looking... – Marc Gravell Apr 09 '15 at 09:28
  • @S.Matthew_English see ^^^ – Marc Gravell Apr 09 '15 at 09:28
  • I just added an edit to my original question with some code that someone told me they used to parse the data – smatthewenglish Apr 09 '15 at 09:29
  • @S.Matthew_English so, yeah, the root object is `Relation` not `Document`, and it is varint-prefixed. Just finishing up tweaks to process that... – Marc Gravell Apr 09 '15 at 09:37
  • @KentonVarda thank you for your suggestion. I just tried that, so before i got the error message `Protocol message end-group tag did not match expected tag.` now, when using `parseDelimitedFrom()` I get `Protocol message tag had invalid wire type.`, have you ever gotten such an error message before? – smatthewenglish Apr 09 '15 at 09:37
  • @S.Matthew_English you need to change to parsing Relation, not Document; the only two entities used in that data are Relation and RelationMentionRef; if I process the data as delimited `Relation` objects, it works fine – Marc Gravell Apr 09 '15 at 09:43
  • @MarcGravell please forgive me, i'm very new to protocol buffers. I'm sort of unclear about what you have done. is it that you wrote a small script to parse it? How did you get the idea to look for `relation` instead of `document`, from that code I posted in the edit? – smatthewenglish Apr 09 '15 at 09:44
  • @S.Matthew_English I wrote C# rig to parse the data as though the root object is `Relation` and each item is varint prefixed - the same as `Relation.parseDelimitedFrom` does - but manually rather than from a .proto: see http://pastie.org/10082095; results are here http://pastie.org/10082097 – Marc Gravell Apr 09 '15 at 09:50
  • @S.Matthew_English see edit, btw, where I've summarised at the top – Marc Gravell Apr 09 '15 at 09:53
  • @MarcGravell wow thats fantastic. i was so scared that the data was all turned to crap, now i feel so relieved. so what i'm planning to do is, change that code into it's java equivalent because i'm not so skillful in c# and then just try to iteratively drill down until i get all of the content of those files into human readable form, do you think that should work? – smatthewenglish Apr 09 '15 at 09:59
  • @MarcGravell this is sort of tangential but I have a document which I've been told is what those files will look like, more or less, once they are extracted from the .pb encoding, is it possible, using what he have discovered, to serialize that data? – smatthewenglish Apr 09 '15 at 10:00
  • @S.Matthew_English all you should need is `Relation.parseDelimitedFrom`, if you're using generated code. I can't advice on the C# <===> Java, because I do virtually no Java ;p Your last comment is a little vague and ambiguous, so all I can say is "probably"... – Marc Gravell Apr 09 '15 at 10:08
3

I know it's been over two years, but here I provide a general way to read this delimited protocol buffers in python. The function you mention: parseDelimitedFrom, is not available in the python implementation of the protocol buffers. But here is small workaround for whoever might need it. This code is an adaptation of that found in: https://www.datadoghq.com/blog/engineering/protobuf-parsing-in-python/

def read_serveral_pbfs(filename, class_of_pb):
result = []
with open(filename, 'rb') as f:
    buf = f.read()
    n = 0
    while n < len(buf):
        msg_len, new_pos = _DecodeVarint32(buf, n)
        n = new_pos
        msg_buf = buf[n:n+msg_len]
        n += msg_len
        read_data = class_of_pb()
        read_data.ParseFromString(msg_buf)
        result.append(read_data)

return result

and a usage example using one of the files of the OP:

import Document_pb2
from google.protobuf.internal.encoder import _VarintBytes
from google.protobuf.internal.decoder import _DecodeVarint32
filename = "trainPositive.pb"
relations = read_serveral_pbfs(filename,Document_pb2.Relation)
syats
  • 71
  • 4