1

We are currently working on trying to generate a new code using antlr. We have a grammar file that pretty much can recognize everything. Now, our problem is that we want to be able to create code again using the tokens that we generate to create this new file.

We have a .txt file with our tokens that looks like this:

[@0,0:6='       ',<75>,channel=1,1:0]
[@1,7:20='IDENTIFICATION',<6>,1:7]
[@2,21:21=' ',<75>,channel=1,1:21]
[@3,22:29='DIVISION',<4>,1:22]
[@4,30:30='.',<3>,1:30]
[@5,31:40='\n       \t ',<75>,channel=1,1:31]
[@6,41:50='PROGRAM-ID',<16>,2:9]
[@7,51:51='.',<3>,2:19]
[@8,52:52=' ',<75>,channel=1,2:20]
[@9,53:59='testpro',<76>,2:21]
[@10,60:60='.',<3>,2:28]
[@11,61:70='\n       \t ',<75>,channel=1,2:29]
[@12,71:76='AUTHOR',<31>,3:9]
[@13,77:77='.',<3>,3:15]

Or is there another way to create the old code using tokens?

Thanks in advance, Viktor

  • 1
    Iterate over the tokens and dump the token text? – Lucas Trzesniewski Apr 14 '15 at 09:28
  • One solution we have come up with is to go thorough this .txt file show above and just split up the relevant data and save to a new file that will be our new code. But the question is are there another way to do this? Because this way isn't using tokens. – Viktor Persson Apr 14 '15 at 09:36
  • Why would you use the .txt file at all? You can have the token stream from the lexer directly. – Lucas Trzesniewski Apr 14 '15 at 09:41
  • Our idea is that parsing and "rebuilding" the code should be able to be two steps, independently of each-other. Our example code we are using is huge and it take some time to parse. Then when the parsing is done we want to be able to save the parsing and even send the data to another computer or whatever and be able to rebuild the code without parsing the code again. So we need to save the tokens to the hard-drive and be able to open it again on the same or another computer. – Viktor Persson Apr 14 '15 at 09:47
  • You need to add information to your question, using the edit link under the question, rather than just leaving it in the comments. Can you also fully describe the process? You take an old program, parse it, and then...? How big is big? What COBOL and OS (or COBOLs and OSes)? – Bill Woodger Apr 14 '15 at 09:49
  • I see. Then I suppose you have to somehow save the whole parse tree. And the format is up to you. Everything you need is in the parse tree (except for off-channel and skipped tokens). – Lucas Trzesniewski Apr 14 '15 at 09:50
  • Lucas: Yeah, thats the plan. The problem is the saving of the tokens, as showed above we save it like that, but the thing is why bother creating tokens again when we can just use the strings and make our code again. But it feels like a cheap way since we isnt using "tokens", we can just take the code "data" and make the code. Shouldnt there be a better way? Antlr v4 is the one we are using and there is some better ways in v3, but I see no better solution then ours to make the code back from tokens, just strike me as a bad choice. – Viktor Persson Apr 14 '15 at 10:06
  • 1
    Why don't you just copy the COBOL code from source to target? – Bill Woodger Apr 15 '15 at 12:30
  • I have a hard time believing that parsing even a big COBOL file that ANTLR4 is "slow". Can OP provide us with some evidence, and how he defines that performance as "slow"? If it *isnt* slow, this OP has simply created a problem for himself, and @BillWoodger's solution is fine. – Ira Baxter Apr 20 '15 at 07:55
  • In particular, OP is likely to discover that reparsing the source is as fast as serializing the tokens or the tree, and then "parsing" those. – Ira Baxter Apr 20 '15 at 08:05
  • This all kind of begs the key question: why does OP insist on processing the COBOL program in two stages? Why not parse it on the first machine, do whatever is needed on the 2nd machine right there? Then the OP's problem of copying the parse goes away. – Ira Baxter Apr 20 '15 at 08:06
  • It appears that OP wants to regenerate the source simply to hand it to the second machine to parse *If* his first stage processing *modifies* the tree or the token sequence, then regeneration isn't so easy. See this SO answer for how to regenerate source text from trees: http://stackoverflow.com/questions/5832412/compiling-an-ast-back-to-source-code/5834775#5834775 (I've done this with full IBM Enterprise COBOL). – Ira Baxter Apr 20 '15 at 08:09

1 Answers1

0

The most straight forward way to make the lexer output portable is to serialize the tokenized output of the lexer for transport and storage. You could equally serialize the entire parser generated parse tree. In either case, you will be capturing the full text of the source input.

The intrinsic complexity of the lexer stream object is a single class. The parse tree object complexity is also quite small, involving just a handful of standard classes. Consequently, the complexity of the serialization & deserialization is almost entirely a linear function of size of the parsed source input.

Google Gson is a simple-to-use, relatively fast Java object serialization library.

If your parser is generating some intermediate representation of the parsed source input, you could directly transport the IR using a defined record serialization library like Google FlatBuffers to save & restore IR model instances.

GRosenberg
  • 5,843
  • 2
  • 19
  • 23
  • OP already has a way to store tokens. While this might outline "more standard" ways to do that, AFAICT he has already succeeded. It doesn't answer the key question of OP: "how to regenerate new code (from tokens)". – Ira Baxter Apr 20 '15 at 07:50
  • Given the original question and clarifying comment, portability of the tokens (original Q) or parsed output (comment) is a requirement and the OP is looking for an alternative to using his initial toString representation of the tokens (which the OP recognizes as problematic and is asking for alternatives). Also, from the comment, the OP indicates that the code regen itself is not the key problem, but being able to reconstitute the parse tree without re-parsing (in preparation for regen) is the actual key problem. The answer provided accurately and appropriately addresses the OP's question. – GRosenberg Apr 21 '15 at 02:56