13

If I have a string

'x+13.5*10x-4e1'

how can I split it into the following list of tokens?

['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']

Currently I'm using the shlex module:

str = 'x+13.5*10x-4e1'
lexer = shlex.shlex(str)
tokenList = []
for token in lexer:
    tokenList.append(str(token))
return tokenList

But this returns:

['x', '+', '13', '.', '5', '*', '10x', '-', '4e1']

So I'm trying to split the letters from the numbers. I'm considering taking the strings that contain both letters and numbers then somehow splitting them, but not sure about how to do this or how to add them all back into the list with the others afterwards. It's important that the tokens stay in order, and I can't have nested lists.

In an ideal world, e and E would not be recognised as letters in the same way, so

'-4e1'

would become

['-', '4e1']

but

'-4x1'

would become

['-', '4', 'x', '1']

Can anybody help?

Tim
  • 41,901
  • 18
  • 127
  • 145
Martin Thetford
  • 133
  • 1
  • 1
  • 6

3 Answers3

18

Use the regular expression module's split() function, to split at

  • '\d+' -- digits (number characters) and
  • '\W+' -- non-word characters:

CODE:

import re

print([i for i in re.split(r'(\d+|\W+)', 'x+13.5*10x-4e1') if i])

OUTPUT:

['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']

If you don't want to separate the dot (as a floating-point number in the expression) then you should use this:

  • [\d.]+ -- digit or dot characters (although this allows you to write: 13.5.5

CODE:

print([i for i in re.split(r'([\d.]+|\W+)', 'x+13.5*10x-4e1') if i])

OUTPUT:

['x', '+', '13.5', '*', '10', 'x', '-', '4', 'e', '1']
Peter Varo
  • 11,726
  • 7
  • 55
  • 77
  • Now `13.5` is torn into separate parts too; this can do with some refining. :-P – Martijn Pieters Aug 19 '13 at 11:24
  • @MartijnPieters but that's exactly what the OP was looking for! – Peter Varo Aug 19 '13 at 11:24
  • I suspect the OP *missed* that the floating point number was split too, actually. – Martijn Pieters Aug 19 '13 at 11:25
  • I was looking for it, but long-term it's actually easier with it stuck together. Being new to programming I was just taking what I had from shlex and using another function to stick the decimals back together. So although that will need changing it'll end up being simpler. Thanks! – Martin Thetford Aug 19 '13 at 13:19
1

Another alternative not suggested here, is to using nltk.tokenize module

redrubia
  • 2,256
  • 6
  • 33
  • 47
0

Well, the problem seems not to be quite simple. I think, a good way to get robust (but, unfortunately, not so short) solution is to use Python Lex-Yacc for creating a full-weight tokenizer. Lex-Yacc is a common (not only Python) practice for this, thus there can exist ready grammars for creating a simple arithmetic tokenizer (like this one), and you have just to fit them to your specific needs.

Tigran Saluev
  • 3,351
  • 2
  • 26
  • 40