3

Hello I'm new at python and I'm writing a module which should take a string as input and the output should be a list of each word, number or symbol with no white spaces. i.e (' 10 sweet apples') --> ['ten', 'sweet', 'apples']. To do so I have a start value which marks the current index number and a end value which increments as long as the next thing in the string is a letter or digit. So far I've successfully added words, numbers, symbols ect to a list which is to be returned at the end of the for loop.

my problem occurs when I'm at the last index number. I have this code :

def tokenize (lines):
    tokenizedList = []
    for line in lines:
        endValue = 0
        startValue = 0
        while startValue < len(line):   

            if line[endValue].isalpha():
                while line[endValue].isalpha():
                    endValue = endValue + 1
                word = line[startValue : endValue]
                tokenizedList.append(word)
                startValue = endValue
                
            elif line[endValue].isdigit():
                while line[endValue].isdigit():
                    endValue = endValue + 1
                word = line[startValue : endValue]
                tokenizedList.append(word)
                startValue = endValue
            
            elif line[endValue].isspace():
                while line[endValue].isspace():
                    startValue += 1
                    endValue = startValue
            
            else:
                endValue += 1
                word = line[startValue : endValue]
                tokenizedList.append(word)
                startValue = endValue
    
        return tokenizedList

since the while loops in the if-statements increments endValue, it will eventually be out of range of the index. I can't figure out how to stop this error from occuring and how the while loop should be altered so it knows when to stop checking for the last letter. Any ideas?

jrd1
  • 10,358
  • 4
  • 34
  • 51

1 Answers1

0

You could simply use the built in split method:

tokenizedList = ' my 3 words'.split(' ')

returns ['my', '3', 'words']

However, if you want to stick to your code, you could add another condition before you increase endValue:

if line[endValue].isalpha():
    while line[endValue].isalpha() and endValue < len(line)-1:
        endValue += 1
    word = line[startValue : endValue]

Don't forget to change the code for the digits accordingly.

Lukr
  • 659
  • 3
  • 14
  • thank you! but does any of these solutions include the final word in my list? besides the one implementing the split-method – Johanna Alm Jul 15 '20 at 15:47
  • aah I see you append the list right after increasing the endValue. Sorry missed that line, I will edit my answer in a minute. – Lukr Jul 15 '20 at 15:57
  • or i misunderstood, i thought a "break" would exit the method completely, thank you for your help!! it works now :) – Johanna Alm Jul 15 '20 at 15:57
  • break just 'breaks' the inner loop. another note: you could also set endValue to -1 when it becomes higher than len(line). `line[-1]` will return the last element of `line`. (-2 the second last and so on) – Lukr Jul 15 '20 at 16:02
  • I just edited the answer again to make more obvious which lines has to be changed and simplified the logic at that point. – Lukr Jul 15 '20 at 16:09
  • haven't been able to make this work in such way so the code includes the last letter of the string. If a string is 8 characters long, the incrementing will stop once endValue reaches 7 (since 7 < 7 is false), cutting of the last letter when slicing the string. And using <= doesn't work either because string[8] is out of index... pretty stuck atm – Johanna Alm Aug 03 '20 at 12:28
  • If your string has 8 characters, than the highest possible endValue will be 7. After it reached 7 the `<7` condition will forbid to increase it further, that's correct. 7 is the last possible index of a string with 8 characters, because the indices in python (and many other languages) start with 0. So the first character is `string[0]` and the last (for 8 char string) is `string[7]`. There is no `string[8]` because it would be the neinth character in the word, which does not exist. – Lukr Aug 05 '20 at 08:39