0

I've written a program that reads in a Java file including comments and outputs the file without comments.

I consider both line comments // and block comments /* */. However, I only use files that don't contain these four characters in any other way: no string literals and no Unicode escape sequences. It only works for files that use these characters exclusively for comments. Can this programme be called a parser? The grammar (either // and then something or /* and then something and then */) is regular, right?

I am really only using switch case statements, i.e. implementing a finite state machine. There's no tree built and no stack. I thought that a program is only a parser when it deals with context free languages and at least has a stack, i.e. implements a pushdown automaton. But I have the feeling that the term parser is used rather freely.

To clarify: I'm not looking for ways to get this programme to work with any Java file, I'm just interested in the correct terminology.

martijnn2008
  • 3,552
  • 5
  • 30
  • 40
user3813234
  • 1,580
  • 1
  • 29
  • 44
  • Please [edit] your question and include some code. –  May 15 '15 at 10:38
  • Read these: http://stackoverflow.com/questions/2933192/whats-the-best-way-to-explain-parsing-to-a-new-programmer Hope it will clear your doubts. – Akash Rajbanshi May 15 '15 at 10:49
  • 1
    are you considering the match /* or // inside strings ? what if your file contain someting like System.out.println("Am I Evil ? /* Yes I am */"); . Btw, thanks for making me remember good old lex and yacc times – BigMike May 15 '15 at 10:49
  • 3
    "To parse" is indeed used extremely loosely to mean "to process source files", and a program to strip comments definitely qualifies under that loosest definition. But it would be more apt to call it a "preprocessor" or "lexer", since in C, C++ and other languages of that family, the first thing to run is a preprocessor that, amongst other things, strips comments. As for "lexer", your program splits the source file into tokens, one type of which is line and block comments, but doesn't go beyond that; So it's also quite appropriate to say that you "lex" the source file. – Iwillnotexist Idonotexist May 15 '15 at 10:49
  • @IwillnotexistIdonotexist +1, lexing a file is a bit more complex than parsing it. – BigMike May 15 '15 at 10:51
  • 1
    for pointing out the difference between lexing a file (according to its grammar) and simply parsing some text. I don't agree a simple comment stripper can be considered a lexer thou, but evidencing the difference between parsing and lexing is probably the best hint/help for OP. – BigMike May 15 '15 at 10:55
  • Recall that in C# we have `int.Parse()`, `double.Parse()` etc (they are not even parsing a file, just parsing a given string), I think your program is definitely a "parser" – Earth Engine May 15 '15 at 10:56
  • 1
    @BigMike: "Lexing a file is (...) more complex than parsing it"? Not for standard interpretations of lexing and parsing. You have it backwards. – Ira Baxter May 15 '15 at 12:45
  • 1
    @EarthEngine: "parse" used sloppily means "read for content". The compiler field (the source of standard definitions of these in computing) uses the term much more narrowly, to mean "discover the structure of a source file in great detail", usually driven by some context-free grammar somewhere. C#'s "parse" applied to digit strings is IMHO a sloppy use; it would have been better to call those functions "convert". In the same sloppy way, OP's program could be called a parser. It doesn't serve any purpose to do so. – Ira Baxter May 15 '15 at 12:47

1 Answers1

2

No, removal of comments from a Java code involves only a regular expression (a finite state automaton) and can't be called a "parser".... A DFA (deterministic finite automaton) is an important component in a programming language compiler because some pre-processing such as comment removal, identifier (variable/function/class names) identification can be done with DFAs. In fact, compiler developers widely make use of the lex tool (a DFA generator) to implement programming language specific DFAs, e.g. the DFA for comment identification in C and C++ are different.

The next step is to generate intermediate code for a given high level code. For that one has to make use of context-free grammars. It is common to use a shift-reduce parser to build up an annotated parse tree for the code. The most common tool used for this task is the yacc.

Debasis
  • 3,680
  • 1
  • 20
  • 23
  • 2
    I think it would be more accurate to say that "you can build a parser by hand" http://stackoverflow.com/questions/2245962/is-there-an-alternative-for-flex-bison-that-is-usable-on-8-bit-embedded-systems/2336769#2336769 "or you can build a parser with tool, of which yacc is commonly used, but see this list" http://en.wikipedia.org/wiki/Comparison_of_parser_generators – Ira Baxter May 15 '15 at 12:43