0

I've been trying to make a tokenizer for a language as an exercise . for example I'm trying to tokenize the code below

num vecsum(vec A)
{
num n;
n = (5 + 2);
return n;
}

and I've been trying to use this regex

re.findall("[\'\w\-]+",text)

but I get outputs like this one : vecsum(vec

while I want to get it like this : ["vecsum" , "(" , "vec" ]

I want it to understand that even if there isn't white space there it should split stuff like " ; " and " ( "

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • `'` is already part of `\S`, no need to add that separately. – Martijn Pieters Nov 11 '18 at 17:28
  • Make sure to use the `r` before strings that have literal backslashes. – trincot Nov 11 '18 at 17:35
  • Do you mean this: `findall(r"\w+|[^\w\s]+", text)`? Or similar: `r"\S+?(?:\b|$)"`. – trincot Nov 11 '18 at 17:38
  • @trincot: that still puts symbols together into one token: `if(foo){bar++;baz--}` would result in `['if', '(', 'foo', '){', 'bar', '++;', 'baz', '--}']`; note the joining of `++;` and `--}`. – Martijn Pieters Nov 11 '18 at 17:54
  • Just trying to get some info from PO who apparently chooses to remain silent. ;-) – trincot Nov 11 '18 at 17:58
  • sorry i'm answering a bit late . the code @trincot put here still puts some of the tokens together . – Mahdi Talebi Nov 11 '18 at 18:17
  • @mahditalebi, could you specify which language you are parsing, so we can know what constitutes a token? If it is C, are you looking for parsing the *complete* language, including comments, literals (think of `0b01`, `4e-9`, strings)? If a subsection only, please specify your requirements. – trincot Nov 11 '18 at 18:41
  • http://nit.rudi.ir/cdts.pdf this link is in persian but the BNF for the TSLANG language is at the end of the file.. – Mahdi Talebi Nov 11 '18 at 18:54

1 Answers1

2

Tokenizing a C-like language requires more work than just splitting on whitespace (which is what you are doing right now).

There are at least 3 types of tokens in such languages:

  • Single character tokens, such as (, ), ;, + and =
  • Multi-character tokens, identifiers, keywords, numbers.
  • Strings; everything between an opening quote and the matching closing quote (with support for escapes, and varying degrees of support for special kinds of strings that could, say, contain newlines).

I'm ignoring comments here, which are either defined as running from a starting sequence until the end of a line (# ..., // ..., etc.) or from starting sequence to ending sequence any number of lines away (/* .... */).

You could define a regex that can tokenize the first two types, and then perhaps use the output of that to handle strings (when you get a " token find the next " token without a \ token right in front, then take everything in between (whitespace and all) as the string).

Such a tokenizer the needs at least two groups, for single characters and multi-character tokens. The multi-character tokens are a further group of options:

r'(?:[\\(){}[\]=&|^+<>/*%;.\'"?!~-]|(?:\w+|\d+))'

I used operators in C and C++ on Wikipedia as a guide on what single-character tokens to look for.

For your sample input, this produces:

['num', 'vecsum', '(', 'vec', 'A', ')', '{', 'num', 'n', ';', 'n', '=', '(', '5', '+', '2', ')', ';', 'return', 'n', ';', '}']

If you must parse multi-symbol operators as single tokens, then you also must include these as separate patterns in the regex, e.g.

(
    r'(?:==|!=|>=|<=|&&|\|\||<<|>>|->|::|\+\+|--|+=|-='
    r'|\*=|/=|%=|<<=|>>=|&=|\^=|\|='
    r'|[\\(){}[\]=&|^+<>/*%;.\'"?!~-]|(?:\w+|\d+))'
)

but then you are half-way to a full-blown tokenizer that defines patterns for each type of literal and keyword and you may as well start breaking out this huge regex into such constituent parts. See the Python tokenize module source code for an example of such a tokenizer; it builds up a large regex from component parts to produce typed tokens.

The alternative is to stick with the super-simple 2-part tokenizer regex and use re.finditer() to make decisions about tokens in context. With a start and end position in the string, you can detect that = was directly preceded by =, and so know you have the == comparison operator rather than two assignments. I used this in a simple parser for a SQLite full-text search query language before, (look for the _terms_from_query() method in the code for this answer), if you want to see an example.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • as u said i currently have this : re.findall("\w+(?:'\w+)?|[^\w\s]",text) which works , but it doesn't recognise multi character tokens like == or <= etc – Mahdi Talebi Nov 11 '18 at 18:13
  • i tested your code and still had the same problem . it recognised == as 2 different tokens – Mahdi Talebi Nov 11 '18 at 18:15
  • @mahditalebi I didn’t say my approach would recognise multi-character operators. You’d have to add those as separate groups *before* the single-character operator pattern. Serious regex-based tokensizers use a large number of specific patterns. – Martijn Pieters Nov 11 '18 at 18:34