Python pygments lexer state preservation

Question

Running pygments default lexer on the following c++ text: class foo{};, results in this:

(Token.Keyword, 'class')
(Token.Text, ' ')
(Token.Name.Class, 'foo')
(Token.Punctuation, '{')
(Token.Punctuation, '}')
(Token.Punctuation, ';')

Note that the toke foo has the type Token.Name.Class.

If i change the class name to foobar i want to be able to run the default lexer only on the touched tokens, in this case original tokens foo and {.

Q: How can i save the lexer state so that tokenizing foobar{ will give tokens with type Token.Name.Class?

Having this feature would optimize syntax highlighting for large source files that suffered a change (user is typing text) right in the middle of the file for example. There seems no documented way of doing this and no information on how to do this using the default pygments lexers.

Are there any other syntax highlighting systems that have support for this behavior ?

EDIT:

Regarding performance here is an example: http://tpcg.io/ESYjiF

Have you looked at the performance impacts? I mean most lexer would do a full parse as such a small delta may even mean a complete change or break in the rest of the tokens. Like changing `foo` to `foo {` which introduces another bracket and rest whole code meaning will actually change. So In any case it may not be a great idea — Tarun Lalwani, Jun 22 '18 at 09:03
@Tarun Lalwani On a decent machine with a 200 kb file (which is indeed large) i get 0.5MS total lexer time. With code formatter i get 0.5 seconds. While the lexer time is "acceptable", the total processing has unacceptable performance (at least to my standards) — Raxvan, Jun 22 '18 at 09:30
@Tarun Lalwani I added a test code also with a 33 kb file. The lexer result seems to be a generator so that's why the initial lexer time is very small , however iterating over the tokens reveals the total time spent parsing the code. — Raxvan, Jun 22 '18 at 12:06
The feature you want to implement is called Rename Symbol,you can find it in vs code when you press F2.it's can be done by rename the entry in the global stringtable if you work with something like Flex. — obgnaw, Jun 27 '18 at 12:32

score 6 · Answer 1 · answered Jun 23 '18 at 15:57

From my understanding of the source code what you want is not possible.

I won't dig and try to explain every single relevant lines of code, but basically, here is what happend:

Your Lexer class is pygments.lexers.c_cpp.CLexer, which inherits from pygments.lexer.RegexLexer.
pygments.lex(lexer, code) function do nothing more than calling get_tokens method on lexer and handle errors.
lexer.get_tokens basically parse source code in unicode string and call self.get_tokens_unprocessed
get_tokens_unprocessed is defined by each Lexer in your case the relevant method is pygments.lexers.c_cpp.CFamilyLexer.get_tokens_unprocessed.
CFamilyLexer.get_tokens_unprocessed basically get tokens from RegexLexer.get_tokens_unprocessed and reprocess some of them.

Finally, RegexLexer.get_tokens_unprocessed loop on defined token types (something like (("function", ('pattern-to-find-c-function',)), ("class", ('function-to-find-c-class',)))) and for each type (function, class, comment...) find all matches within the source text, then process the next type.

This behavior make what you want impossible because it loops on token types, not on text.

To make more obvious my point, I added 2 lines of code in the lib, file: pygments/lexer.py, line: 628

for rexmatch, action, new_state in statetokens:
    print('looking for {}'.format(action))
    m = rexmatch(text, pos)
    print('found: {}'.format(m))

And ran it with this code:

import pygments
import pygments.lexers

lexer = pygments.lexers.get_lexer_for_filename("foo.h")
sample="""
class foo{};
"""
print(list(lexer.get_tokens(sample)))

Output:

[...]
looking for Token.Keyword.Reserved
found: None
looking for Token.Name.Builtin
found: None
looking for <function bygroups.<locals>.callback at 0x7fb1f29b52f0>
found: None
looking for Token.Name
found: <_sre.SRE_Match object; span=(6, 9), match='foo'>
[...]

As you can see, the token types are what the code iterate on.

Taking that and (as Tarun Lalwani said in comments) the fact that a single new character can break the whole source-code structure, you cannot do better than re-lexing the whole text at each update.

After checking the implementation, you are right, however a changed token will never affect the types of other previous tokens. Tokens after the change can change indeed, however this can be also checked to minimize regex matching. It seems a very big waste to regex match everything every time. Also it seems that some IDEs (like CLion) really have an issue with this, syntax highlighting takes hours (at least on CLion). — Raxvan, Jun 25 '18 at 07:03
You are right. It's intersting, let me dig a little bit and come back to you — Arount, Jun 25 '18 at 08:00
I did not have time to try it, but saving the stack along side the token type should do the trick for tokens before a change. For the tokens after the change it's more complicated, i'm thinking to keep also a hash for the remaining text content that's after token value. With that hash you should be able to stop regex matching when you encounter a token that matches the hash from a "original" token. If you can provide a working prototype the bounty is yours :). — Raxvan, Jun 25 '18 at 08:55
in the real world ,they just store the state when meet line ends,then if a line change,just start with last line doesn't change. @Raxvan — obgnaw, Jun 27 '18 at 12:38

Python pygments lexer state preservation

1 Answers1