0

I'm trying to capture tokens from a pseudo-programming-language script, but the +-*/, etc are not captured.

I tried this:

[a-z_]\w*|"([^"\r\n]+|"")*"|\d*\.?\d*|\+|\*|\/|\(|\)|&|-|=|,|!

For example i have this code:

for i = 1 to 10
test_123 = 3.55 + i- -10 * .5
next
msg "this is a ""string"" with quotes in it..."

in this part of code the regular expression has to highlight: valid variablenames, strings enclosed with quotes, operators like (),+-*/! numbers like 0.1 123 .5 10.

the result of the regular expression has to be:

'for', 'i', '=', '1', 'to', '10', 'test_123', '=', '3.55', '+' etc....

the problem is that the operators are not selected if i use this regular expression...

Tom
  • 221
  • 1
  • 4
  • 16
  • That's good. I hope it works for you. May if you tell us the problem..., rather than your solution doesn't work. Post the data and what you want. – BambiLongGone Dec 08 '14 at 21:47
  • What are you trying to match with that regular expression? Examples usually help illustrate the problem. – Ansgar Wiechers Dec 08 '14 at 22:34
  • BambiLongGone and Ansgar Wiechers, sorry if my question is a little cryptic... making a tokenizer for a self invented programming language, my vbscript need to split the lines of code by strings enclosed by quotes, numbers, valid variable names and operators, the problem is that the operators "+","-","*","/","!" etc... are not matched by the regular expression... – Tom Dec 08 '14 at 23:20
  • Again, an example would help illustrate what you're trying to do. Also, [related](http://stackoverflow.com/a/11906022/1630171). As for edits to your question: reasons for the edits can be seen in the edit history. The URL was probably removed, because it didn't add anything substantial to clarify your question. – Ansgar Wiechers Dec 09 '14 at 10:53
  • Ansgar Wiechers, ok, i'll edit my question with a clear example. – Tom Dec 09 '14 at 16:36

2 Answers2

0

We don't know your requirements, but it seems that in your regex you are capturing only a few non \n, \r etc...

try something like this, grouping the tokens you want to capture:

'([a-z_]+)|([\.\d]+)|([\+\-\*\/])|(\=)|([\(\)\[\]\{\}])|(['":,;])'

EDIT: With the new information you wrote in your question, I adjusted the regex to this new one, and tried it with python. I don't know vbscript.

import re

test_string = r'''for i = 1 to 10:
test_123 = 3.55 + i- -10 * .5
next
msg "this is a 'string' with quotes in it..."'''

patterb = r'''([\da-z_^\.]+|[\.\d]+|[\+\-\*\/]|\=|[\(\)\[\]\{\}]|[:,;]|".*[^"]"|'.*[^']')'''

print(re.findall(pattern, test_string, re.MULTILINE))

And this is the list with the matches:

['for', 'i', '=', '1', 'to', '10', ':', 'test_123', '=', '3.55', '+', 'i', '-', '-', '10', '*', '.5', 'next', 'msg', '"this is a \'string\' with quotes in it..."']

I think it captures all you need.

chapelo
  • 2,519
  • 13
  • 19
  • thanks chapelo, i wonder if it must be grouped or if that's optional? also, does it matter what the order of a regular expression is? i was messing around and if i put the expression like this its works?? \+|\*|\/|\(|\)|&|-|=|,|!|[a-z_]\w*|"([^"\r\n]+|"")*"|\d*\.?\d* – Tom Dec 09 '14 at 01:36
  • i think the error is in the part where i search for numbers like: .1 or 123 or 6.66 or 1. – Tom Dec 09 '14 at 03:11
  • @Tom Check the edit, with an adjusted and tested regex. I had to change your string a bit to accept ' between "", or " between ''. – chapelo Dec 09 '14 at 20:05
  • chapelo, its better but still has a few bugs in it, e.g. "123x" is not a correct number. quoted strings can't be multiline at least in my scripting language(but you couldn't know that one so i'm not saying that's not your fault). can you explain this part please? "[\da-z_^\.]" the "^" stands for 'not' in a character class right? so it says not to allow the "." in variable names? thanks for the help btw :) – Tom Dec 09 '14 at 22:27
0

This fits my needs i guess:

"([^"]+|"")*"|[\-+*/&|!()=,]|[a-z_]\w*|(\d*\.)?\d*

but only white space must be left over so i have to find a way to capture everything else that is not white space to if its not any of the other options in my regular expression.

characters like "$%µ°" are ignored even when i put "|." after my regular expression :(

Tom
  • 221
  • 1
  • 4
  • 16