10

I am researching ways, tools and techniques to parse code files in order to support syntax highlighting and intellisence in an editor written in c#.

Does anyone have any ideas/patterns & practices/tools/techiques for that.

EDIT: A nice source of info for anyone interested:

Parsing beyond Context-free grammars ISBN 978-3-642-14845-3

sTodorov
  • 5,435
  • 5
  • 35
  • 55
  • possible duplicate of [Parser for C#](http://stackoverflow.com/questions/81406/parser-for-c) – Gabe Oct 24 '10 at 15:38
  • 1
    Are you trying to parse C# or write a parser in C#? – Gabe Oct 24 '10 at 15:38
  • 1
    @Gabe, both. I am trying to write a parser in c# which will parse xml, c# hopefully something else :) – sTodorov Oct 24 '10 at 15:42
  • @Gabe, thanks suggested post btw – sTodorov Oct 24 '10 at 15:47
  • 1
    If you want to parse multiple languages, have you looked at ANTLR? – Gabe Oct 24 '10 at 15:50
  • @Gabe, ANTLR looks very promising and yes, i want to be able to parse multiple languages by creating grammar files or something like that. – sTodorov Oct 24 '10 at 15:56
  • 4
    This rather depends on how sophisticated you want it to be. If you want the full Visual Studio experience you'll need a full parser, but if you just want simple keyword/string highlighting (like StackOverflow provides) then you don't want a parser. All you need is a simple tokenizer that can distinguish between strings and identifiers, and a list of keywords. – arx Oct 24 '10 at 15:58
  • @arx, I would like the VS intellisence experience. – sTodorov Oct 24 '10 at 16:08
  • What you are doing is tough thing, I have done it. It is a very bad path.... – leppie Oct 24 '10 at 16:22
  • @leppie ... haha thanks. I am starting to sense that as well. – sTodorov Oct 24 '10 at 16:23
  • @sTodorov: If you are not doing this for academic purposes, it is kinda pointless (well, you will get recognition, but not much else) – leppie Oct 24 '10 at 16:25
  • 1
    @sTodorov: Anyways, what I am trying to say is that you need some kind of resilient parser that knows how to backtrack with least effort. Most parsergens like yacc, etc, can be modified for this behavior albeit with different flavors of efficiency. – leppie Oct 24 '10 at 16:30
  • @leppie, thanks, I am going to look into yacc as well – sTodorov Oct 24 '10 at 16:31
  • @sTodorov: If going for yacc, look at GPPG which is yacc-based (but for C#). This is the same parser I modified for a 'resilient' backtracking parser for C# in xacc.ide (although not 100% correct, but OK for highlighting and syntax tree purposes). – leppie Oct 24 '10 at 16:35

3 Answers3

6

My favourite parser for C# is Irony: http://irony.codeplex.com/ - i have used it a couple of times with great success

Here is a wikipedia page listing many more: http://en.wikipedia.org/wiki/Compiler-compiler

Rob Fonseca-Ensor
  • 15,510
  • 44
  • 57
3

There are two basic aproaches:
1) Parse the entire solution and everything it references so you understand all the types involved in the code
2) Parse locally and do your best to guess what types etc are.

The trouble with (2) is that you have to guess, and in some circumstances you just can't tell from a code snippet exactly what everything is. But if you're happy with the sort oif syntax highlighting shown on (e.g.) Stack Overflow, then this approach is easy and quite effective.

To do (1) then you need to do one of (in decreasing order of difficulty):

  • Parse all the source code. Not possible if you reference 3rd party assemblies.
  • Use reflection on the compiled code to garner type information you can use when parsing the source.
  • Use the host IDE's (if avaiable - so not applicable in your case!) code element interfaces to provide the information you need
Jason Williams
  • 56,972
  • 11
  • 108
  • 137
  • 1
    OP wants to parse multiple languages. There's the "small" problem of actually getting working grammars for the languages you want to process. Legacy langauges are hard to do this for, because the standards committees have been decorating them with goo; check out IBM Enterprise COBOL or Fortran 2005. Modern langauges are a little easier but even they have pressure to add stuff; try parsing modern VB.net. I've got 15 years into building parsers using unifed instructure for a wide range of languages (including those I mentioned) and I'm not hardly done yet :-{ – Ira Baxter Oct 24 '10 at 17:48
  • 1
    @Ira: OP doesn't make it very clear what languages are required, but most of my answer stands equally well for any language. But you're right, it's a very nontrivial problem. Visual Studio Intellisense has been developed for many years by an experienced team, and only really works well in .net languages - beyond basic syntax highlighting, the support is pretty poor in most other languages, which is a good indicator of the difficulty of the problem the OP be attempting to address. – Jason Williams Oct 24 '10 at 18:24
  • @Ira the feat you are trying to accomplish sounds very serious. I wish you all the success with it. However, what I am researching is mostly support for C#, Ruby, Python, VB. net, java. I can only imagine the difficulties involved with parsing legacy languages – sTodorov Oct 24 '10 at 19:35
  • @Jason, I think for now I will concentrate on researching parsing C# and python because of the difference in the structure, e.g. curly brackets and indentation – sTodorov Oct 24 '10 at 19:41
  • @sTodorov: I've done all the langauges you've mentioned except for Ruby and that's in progress. If you want to parse these languages fully you need pretty much all that machinery that I've used in some form or another. If all you want is syntax highlighting, you can a good-enough job with just regular expression matching, because syntax highlighting doesn't have be always right to be useful. – Ira Baxter Oct 24 '10 at 20:14
  • @sTorodov: I guess what you're looking for is a code model that is flexible enough to represent code elements from the different languages you support, so you can add a parsing "layer" that maps the specific source code to a generic description that can be used for intellisense/colouring. My addin (AtomineerUtils) parses C-like languages (C, C++, C#, Java) in this way, and I was surprised how little work it took to add support for VB - there are surprisingly few differences in the parsing once you look past the superficial syntax, so most internal processing methods didn't need to change. – Jason Williams Oct 24 '10 at 20:27
  • @Ira: It definetely sounds like syntax highlightinhg is the better choice for now as it does not involve so much complexities. I will have a look at how the reg exp engines can work for me in that regard. BTW, sorry for prying, but the DMS software toolkit seems very interesting. – sTodorov Oct 24 '10 at 20:54
  • @Jason: Yess, I am looking for something exactly like the code model you suggest. Thanks for the pointers and I will have a look in this direction. – sTodorov Oct 24 '10 at 20:57
  • @sTodorov: No need to pry, check out my bio (I assume you have) and the website link there, you can find out plenty about DMS. – Ira Baxter Oct 24 '10 at 23:14
1

You could take a look at how http://www.icsharpcode.net/ did it. They wrote a book doing just that, Dissecting a C# Application: Inside SharpDevelop, it even has a chapter called

Implement a parser to provide syntax highlighting and auto-completion as users type

Jonas Elfström
  • 30,834
  • 6
  • 70
  • 106