How can I split a string into tokens?

Question

If I have a string

'x+13.5*10x-4e1'

how can I split it into the following list of tokens?

['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']

Currently I'm using the shlex module:

str = 'x+13.5*10x-4e1'
lexer = shlex.shlex(str)
tokenList = []
for token in lexer:
    tokenList.append(str(token))
return tokenList

But this returns:

['x', '+', '13', '.', '5', '*', '10x', '-', '4e1']

So I'm trying to split the letters from the numbers. I'm considering taking the strings that contain both letters and numbers then somehow splitting them, but not sure about how to do this or how to add them all back into the list with the others afterwards. It's important that the tokens stay in order, and I can't have nested lists.

In an ideal world, e and E would not be recognised as letters in the same way, so

'-4e1'

would become

['-', '4e1']

but

'-4x1'

would become

['-', '4', 'x', '1']

Can anybody help?

Indeed, the `shlex` module is not the best choice here; it is a shell-syntax lexer and parser, not a mathematical expression parser. — Martijn Pieters, Aug 19 '13 at 11:18
@alecxe: the `-` is seen as a unary operator here, *resulting* in negative numbers. — Martijn Pieters, Aug 19 '13 at 11:18
Related: [Math Expression Evaluation](http://stackoverflow.com/q/1545403) — Martijn Pieters, Aug 19 '13 at 11:20
Presumably `13.5` should not be split either; it's one floating-point number, not two numbers with a `.` operator in between. — Martijn Pieters, Aug 19 '13 at 11:29
Out of curiosity, why both an explicit `*` and implicit multiplication (`10x` is `10*x` really)? That makes parsing just that little bit harder to have to pick those out too. — Martijn Pieters, Aug 19 '13 at 11:34
I'm building a linear equation solver, and since most people would wirte '10x' rather than '10*x' I'm making it work with the former before I try and make it work for the latter. — Martin Thetford, Aug 19 '13 at 13:46

Peter Varo · Accepted Answer · 2013-08-19T11:30:53.630

18

Use the regular expression module's split() function, to split at

'\d+' -- digits (number characters) and
'\W+' -- non-word characters:

CODE:

import re

print([i for i in re.split(r'(\d+|\W+)', 'x+13.5*10x-4e1') if i])

OUTPUT:

['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']

If you don't want to separate the dot (as a floating-point number in the expression) then you should use this:

[\d.]+ -- digit or dot characters (although this allows you to write: 13.5.5

CODE:

print([i for i in re.split(r'([\d.]+|\W+)', 'x+13.5*10x-4e1') if i])

OUTPUT:

['x', '+', '13.5', '*', '10', 'x', '-', '4', 'e', '1']

edited Aug 19 '13 at 11:30

answered Aug 19 '13 at 11:18

Peter Varo

11,726
7
55
77

Now `13.5` is torn into separate parts too; this can do with some refining. :-P – Martijn Pieters Aug 19 '13 at 11:24
@MartijnPieters but that's exactly what the OP was looking for! – Peter Varo Aug 19 '13 at 11:24
I suspect the OP *missed* that the floating point number was split too, actually. – Martijn Pieters Aug 19 '13 at 11:25
I was looking for it, but long-term it's actually easier with it stuck together. Being new to programming I was just taking what I had from shlex and using another function to stick the decimals back together. So although that will need changing it'll end up being simpler. Thanks! – Martin Thetford Aug 19 '13 at 13:19

score 1 · Answer 2 · answered May 08 '14 at 20:00

1

Another alternative not suggested here, is to using nltk.tokenize module

answered May 08 '14 at 20:00

redrubia

2,256
6
33
47

score 0 · Answer 3 · answered Aug 19 '13 at 11:44

Well, the problem seems not to be quite simple. I think, a good way to get robust (but, unfortunately, not so short) solution is to use Python Lex-Yacc for creating a full-weight tokenizer. Lex-Yacc is a common (not only Python) practice for this, thus there can exist ready grammars for creating a simple arithmetic tokenizer (like this one), and you have just to fit them to your specific needs.

How can I split a string into tokens?

3 Answers3

Linked

Related