1

Is there a way I can use the following (undocumented) re.Scanner to find everything inside of double quotes in order to classify such a match as a string?

    scanner = re.Scanner([
(r"[-10-9]+", lambda scanner, token:("INTEGER", int(token))),
(r"[A-Za-z]+", lambda scanner, token:("NAME", str(token))),
(r"[:true::false:]+", lambda scanner, token:("BOOL", token)),
(r"[:error:]+", lambda scanner, token:("ERROR", token)),
(r'.', lambda scanner, token: None),
])
Justin O Barber
  • 11,291
  • 2
  • 40
  • 45
user2757849
  • 227
  • 3
  • 4
  • 14

1 Answers1

1

You can simply add a string regex to the scanner like this:

>>> import re
>>> scanner = re.Scanner([
(r"[-10-9]+", lambda scanner, token:("INTEGER", int(token))),
(r"[A-Za-z]+", lambda scanner, token:("NAME", str(token))),
(r"[:true::false:]+", lambda scanner, token:("BOOL", token)),
(r"[:error:]+", lambda scanner, token:("ERROR", token)),
(r'".*?"', lambda scanner, token:("STRING", token)),  # added STRING regex
(r'.', lambda scanner, token: None),
])

Now you can test it:

>>> i = '"string"'  # simulated input
>>> t = '"this is a very long string with whitespace"'  # another simulated input
>>> scanner.scan(i)
([('STRING', '"string"')], '')  # ([(token_label, match)], remainder_of_string)
>>> scanner.scan(t)
([('STRING', '"this is a very long string with whitespace"')], '')
Justin O Barber
  • 11,291
  • 2
  • 40
  • 45
  • Hm...not really say I had user input and I typed "string" and then passed it to the scanner, how do I say ok everything that has double quotes is a string. – user2757849 Mar 30 '14 at 00:25
  • Yes! One last question is there a way I can get it to read through whitespace? So right now it will return "string" is it's a string but if I had an input of say "this is a very long string with whitespace" it takes each individual word and it's not a unified string. If this doesn't make sense mabye I can clarify a bit better – user2757849 Mar 30 '14 at 00:41
  • @user2757849 I think I see what you mean. I assume you are talking about the `NAME` regex now. Note the edit above. – Justin O Barber Mar 30 '14 at 00:48
  • Sort of If I had that string "this is a very long string with whitespace" is there any way I can identify that it's a string even with the whitespaces, so "this is a very long string" would be evaluated as a string as of right now it evaluates them as names. – user2757849 Mar 30 '14 at 00:52
  • 1
    @user2757849 Do you mean to include the double-quotation marks or not? In other words, which input do you want? (1) `"this is a very long string with whitespace"` or (2) `this is a very long string with whitespace`? In my first edit, (1) would have been a `STRING` and (2) would have been a `NAME` (but in a list of individual words). – Justin O Barber Mar 30 '14 at 00:56
  • I would like the quotes to be a part of the string yes. – user2757849 Mar 30 '14 at 00:57
  • yep it would be string – user2757849 Mar 30 '14 at 01:02
  • A problem with this approach I think is that if you feed `"hello", "this is some test"` to the lexer, it will return `hello", "this is some test` as an entire string as a result. Regular expressions are in most cases greedy. – Willem Van Onsem Mar 30 '14 at 02:50
  • 1
    @CommuSoft Thanks for the comment. In this case, however, I have already made the regex not greedy, so that `scanner.scan('"hello", "this is some test"')` will in fact return `([('STRING', '"hello"'), ('STRING', '"this is some test"')], '')`. The pattern I used for STRING is `r'".*?"'`. Still, you raise an interesting consideration for the OP, since the OP will then have to deal with two strings instead of one. Thanks again for the comment. – Justin O Barber Mar 30 '14 at 03:22
  • @πόδαςὠκύς hi there, is there any way that I could apply re.scanner to my issue here https://stackoverflow.com/questions/58915263/extract-sentences-in-nested-parentheses-using-python as well...? –  Nov 18 '19 at 14:02