3

I have what I imagine will be a fairly involved technical challenge: I want to be able to reliably alpha-rename identifiers in multiple languages (as many as possible). This will require special consideration for each language, and I'm asking for advice for how to minimize the amount of work I need to do by sharing code. Something like a unified parsing or abstract syntax framework that already has support for many languages would be great.

For example, here is some python code:

def foo(x):
    def bar(y):
        return x+y
    return bar

An alpha renaming of x to y changes the x to a y and preserves semantics. So it would become:

def foo(y):
    def bar(y1):
        return y+y1
    return bar

See how we needed to rename y to y1 in order to keep from breaking the code? That is why this is a hard problem. It seems like the program would have to have a pretty good knowledge of what constitutes a scope, rather than just doing, say, a string search and replace.

I would also like to preserve as much of the formatting as possible: comments, spacing, indentation. But that is not 100% necessary, it would just be nice.

Any tips?

luqui
  • 59,485
  • 12
  • 145
  • 204
  • 1
    So basically you're looking for a fully-fledged multi-language parser? – Oliver Charlesworth Apr 29 '11 at 20:15
  • It's possible. I only need information about the intension of names and what they refer to. The other semantics of the language are irrelevant. But AFAICT it does require parsing. – luqui Apr 29 '11 at 20:47
  • A parser alone won't even do it - he would have to emulate scope resolution for each identifier, which for some languages would mean taking anonymously imported scopes into account. This sounds like a virtually impossible task if there are no constraints on what languages should be supported. – 500 - Internal Server Error Apr 29 '11 at 20:47
  • @500 - Internal Server Error, if that is your *real* name, yeah I forgot to specify some constraints. I don't intend to solve the halting problem. So I'm interested in reasonable static resolution. The identifiers in most languages can be statically resolved, except for when "evil things" are going on, and we've designed our project so that evil things will usually not be in play, and it's ok if they break. As this is a hard problem, I don't expect to solve it completely, I just want a solution that is "pretty good". It could be that dumb string substitution is the best trade-off. – luqui Apr 29 '11 at 20:59

3 Answers3

5

To do this safely, you need to be able to to determine

  • all the identifiers (and those things that are not, e.g., the middle of a comment) in your code
  • the scopes of validity for each identifer
  • the ability to substitute a new identifier for an old one in the text
  • the ability to determine if renaming an identifier causes another name to be shadowed

To determine identifiers accurately, you need a least a langauge-accurate lexer. Identifiers in PHP look different than the do in COBOL.

To determine scopes of validity, you have to be determine program structure in practice, since most "scopes" are defined by such structure. This means you need a langauge-accurate parser; scopes in PHP are different than scopes in COBOL.

To determine which names are valid in which scopes, you need to know the language scoping rules. Your language may insist that the identifier X will refer to different Xes depending on the context in which X is found (consider object constructors named X with different arguments). Now you need to be able to traverse the scope structures according to the naming rules. Single inheritance, multiple inheritance, overloading, default types all will pretty much require you to build a model of the scopes for the programs, insert the identifiers and corresponding types into each scope, and then climb from the point of encounter of an identifier in the program text through the various scopes according to the language semantics. You will need symbol tables, inheritance linkages, ASTs, and the ability to navigage all of these. These structures are different from PHP and COBOL, but they share lots of common ideas so you likely need a library with the common concept support.

To rename an identifier, you have to modify the text. In a million lines of code, you need to point carefully. Modifying an AST node is one way to point carefully. Actually, you need to modify all the identifiers that correspond to the one being renamed; you have to climb over the tree to find them all, or record in the AST where all the references exist so they can be found easily. After modifyingy the tree you have to regenerate the source text after modifying the AST. That's a lot of machinery; see my SO answer on how to prettyprint ASTs preseriving all of the stuff you reasonably suggest should be preserved. (Your other choice is to keep track in the AST of where the text for the string is, and the read/patch/write the file.)

Before you update the file, you need to check that you haven't shadowed something. Consider this code:

 {  local x;
     x=1;
    {local y;
     y=2;
      {local z;
         z=y
         print(x);
      }
    }
 }

We agree this code prints "1". Now we decide to rename y to x. We've broken the scoping, and now the print statement which referred conceptually to the outer x refers to an x captured by the renamed y. The code now prints "2", so our rename broke it. This means that one must check all the other identifiers in scopes in which the renamed variable might be found, to see if the new name "captures" some name we weren't expecting. (This would be legal if the print statement printed z).

This is a lot of machinery.

Yes, there is a framework that has almost all of this as well as a number of robust language front ends. See our DMS Software Reengineering Toolkit. It has parsers producing ASTs, prettyprinters to produce text back from ASTs, generic symbol table management machinery (including support for multiple inheritance), AST visiting/modification machinery. Ithas prettyprinting machinery to turn ASTs back into text. It has front ends for C, C++, COBOL and Java that implement name and type resolution (e.g. instanting symbol table scopes and identifier to symbol table entry mappings); it has front ends for many other langauges that don't have scoping implemented yet.

We've just finished an exercise in implementing "rename" for Java. (All the above issues of course appeared). We about about to start one for C++.

Community
  • 1
  • 1
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
1

You could try to create Xtext based implementations for the involved languages. The Xtext framework provides reliable infrastructure for cross language rename refactoring. However, you'll have to provide a grammar a at least a "good enough" scope resolution for each language.

Sebastian Zarnekow
  • 6,609
  • 20
  • 23
  • Interesting, looks promising. I was looking for frameworks because hopefully people have already written grammars for their favorite languages, so that's my hope here. I hope I don't have to be running the eclipse gui tho. I'll look into it, thanks for the link. – luqui Apr 29 '11 at 21:50
  • Xtext is an Eclipse based framework so you would definitly run an Eclipse UI. A syntax zoo with implementations of general purpose language is currently not available. – Sebastian Zarnekow Apr 29 '11 at 21:53
-1

Languages mostly guarantee tokens will be unique, whatever the context. A naive first approach (and this will break many, many pieces of code) would be:

cp file file.orig
sed -i 's/\b(newTokenName)\b/TEMPTOKEN/g' file
sed -i 's/\b(oldTokenName)\b/newTokenName/g' file

With GNU sed, this will break on PHP. Rewriting \b to a general token match, like ([^a-zA-Z~$-_][^a-zA-Z0-9~$-_]) would work on most C, Java, PHP, and Python, but not Perl (need to add @ and % to the token characters. Beyond that, it would require a plugin architecture that works for any language you wanted to add. At some point, there will be two languages whose variable and function naming rules will be incompatible, and at that point, you'll need to do more and more in the plugin.

David Souther
  • 8,125
  • 2
  • 36
  • 53
  • Yeah I'm beginning to think that it may be easiest to do this. Even though it is dumb, and it has the possibility of substituting within strings (hmm, and the likelihood of a function or variable name appearing in a string is high, because the function will be "about" that thing). A lexer may be more approachable than a full parser, and then this technique could work. – luqui Apr 29 '11 at 21:04
  • I added the copy line. At this point, it'd be easy to wrap into a shell script- sed the files that have the token (using the same rules to grep), and diff file file.orig. In practical usage, that would be a good combination between quick use and programmer control. Theoretically, a "perfect" solution would need to use much of GCC's front-end. – David Souther Apr 29 '11 at 21:22
  • @luqui: yes, this is easiest. As with virtually all "easiest" solutions, it isn't a good one. – Ira Baxter Apr 29 '11 at 22:11