1

Given the index of a word in a string starting at zero ("index" is position two in this sentence), and a word being defined as that which is separated by whitespace, I need to find the index of the first char of that word.

My whitespace regex pattern is "( +|\t+)+", just to cover all my bases (except new line chars, which are excluded). I used split() to separate the string into words, and then summed the lengths of each of those words. However, I need to account for the possibility that more than once whitespace character is used between words, so I can't simply add the number of words minus one to that figure and still be accurate every time.

Example:

>>> example = "This is an example sentence"
>>> get_word_index(example, 2)
8
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Steele Farnsworth
  • 863
  • 1
  • 6
  • 15

2 Answers2

2

Change your regular expression to include the whitespace around each word to prevent it from being lost. The expression \s*\S+\s* will first consume leading whitespace, then the actual word, then trailing spaces, so only the first word in the resulting list might have leading spaces (if the string itself started with whitespace). The rest consist of the word itself potentially followed by whitespace. After you have that list, simply find the total length of all the words before the one you want, and account for any leading spaces the string may have.

def get_word_index(s, idx):
    words = re.findall(r'\s*\S+\s*', s)
    return sum(map(len, words[:idx])) + len(words[idx]) - len(words[idx].lstrip())

Testing:

>>> example = "This is an example sentence"
>>> get_word_index(example, 2)
8
>>> example2 = ' ' + example
>>> get_word_index(example2, 2)
9
TigerhawkT3
  • 48,464
  • 6
  • 60
  • 97
  • While I awaited your response, I came up with my own solution that didn't pass my unit tests. However, your solution failed my unit tests the exact same way. I'm therefore going to assume that you're right and that my unit tests are not. Thank you! – Steele Farnsworth Mar 05 '19 at 02:28
0

Maybe you could try with:

your_string.index(your_word)

documentation

bojan
  • 56
  • 1
  • 3