23

I am looking for pythonic way to split a sentence into words, and also store the index information of all the words in a sentence e.g

a = "This is a sentence"
b = a.split() # ["This", "is", "a", "sentence"]

Now, I also want to store the index information of all the words

c = a.splitWithIndices() #[(0,3), (5,6), (8,8), (10,17)]

What is the best way to implement splitWithIndices(), does python have any library method that I can use for that. Any method that helps me calculate the indices of the word would be great.

user462455
  • 12,838
  • 18
  • 65
  • 96

2 Answers2

28

Here is a method using regular expressions:

>>> import re
>>> a = "This is a sentence"
>>> matches = [(m.group(0), (m.start(), m.end()-1)) for m in re.finditer(r'\S+', a)]
>>> matches
[('This', (0, 3)), ('is', (5, 6)), ('a', (8, 8)), ('sentence', (10, 17))]
>>> b, c = zip(*matches)
>>> b
('This', 'is', 'a', 'sentence')
>>> c
((0, 3), (5, 6), (8, 8), (10, 17))

As a one-liner:

b, c = zip(*[(m.group(0), (m.start(), m.end()-1)) for m in re.finditer(r'\S+', a)])

If you just want the indices:

c = [(m.start(), m.end()-1) for m in re.finditer(r'\S+', a)]
Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
  • @f-j What does '*match' mean here? Thanks. – zfz Dec 06 '12 at 03:37
  • That is called [unpacking argument lists](http://docs.python.org/2/tutorial/controlflow.html#unpacking-argument-lists), or the splat operator. Basically `foo(*[a, b])` will be equivalent to `foo(a, b)`. – Andrew Clark Dec 06 '12 at 07:52
10

I think it's more natural to return the start and end of the corresponding splices. eg (0, 4) instead of (0, 3)

>>> from itertools import groupby
>>> def splitWithIndices(s, c=' '):
...  p = 0
...  for k, g in groupby(s, lambda x:x==c):
...   q = p + sum(1 for i in g)
...   if not k:
...    yield p, q # or p, q-1 if you are really sure you want that
...   p = q
...
>>> a = "This is a sentence"
>>> list(splitWithIndices(a))
[(0, 4), (5, 7), (8, 9), (10, 18)]

>>> a[0:4]
'This'
>>> a[5:7]
'is'
>>> a[8:9]
'a'
>>> a[10:18]
'sentence'
John La Rooy
  • 295,403
  • 53
  • 369
  • 502