How to split string into words, even if words have punctuation, in Python

Question

I'm trying to make a French translator using one long dictionary. I want to split a string into words, even if the words have punctuation.

I've tried adding items to dictionaries with punctuation attached to it, e.g. ["Hello!": "Bonjour!"], but that would take quite a long time, and there may be a more compact and simple way to do it.

Code:

frtext = "__"
FRTEXT = []


french = {

    "hello": "bonjour",
    "Hello": "Bonjour",
    "What": "Qu'est-ce que"
}



text = input("Enter text: ")
TEXT = text.split()

for x in range(len(TEXT)):

    if TEXT[x] in french:
        frtext = french[TEXT[x]]

    FRTEXT.append(frtext)

Expected Output:

 ["Hello!"]
 ["Bonjour!"]

Actual Output:

 ["Hello!"]
 ["__""]

Is there a way to do this, and if there is, how do you do it? Any answers will be greatly appreciated.

`re.split(r'\s+', text)`, [shlex](https://docs.python.org/3/library/shlex.html) might be helpful, but please clarify your question a bit further as to what the input is (remove the call to `input()` because it's not clear what was typed in and it's not relevant to the question). — ggorlen, Jun 10 '19 at 15:42
Just a nitpick: `for x in range(len(something)): something[x]` is a code smell. Prefer `for word in text.split(): if word in french: frtext = french[word]` — Adam Smith, Jun 10 '19 at 15:43

Vitor Falcão · Answer 1 · 2019-06-10T16:01:52.780

Check out this, it helps you ignore the case of the letters. For the punctuation, you could just remove it, anything not inside the range a-z or A-Z gets removed from the text.

A little change so if there's no valid translation he appends the original anyway:

for word in TEXT:
    word = word.lower()
    if word in french:
        frtext = french[word]
    else:
       frtext = word

    FRTEXT.append(frtext)

An improvement to your code:

frword = ''
frtext = []


translator = {
    'hello': 'bonjour',
    'what': 'qu\'est-ce que'
}

text = input('Enter text: ')

for word in text.split():
    word = word.lower()
    word = translator.get(word, word)
    frtext.append(word)

print(' '.join(frtext))

Removing punctuation would be simple:

import string

final_text = ''
letters = string.ascii_lowercase + string.ascii_uppercase + ' '
for letter in text:
    if letter in letters:
        final_text += letter

Then you process final_text.

Of course, this is a solution for something simple, going further would require more knowledge and using other technologies like NLP.

score 2 · Accepted Answer · answered Jun 10 '19 at 15:42

For complex work with texts there is a good idea to use NLTK. It has many good text algoritms that can be applied to simplify text processing (note that it is rather big library):

import nltk

text = 'Hello! Hello hello, Hello and hello! Hello!'

tokenizer = nltk.WordPunctTokenizer()
tokenizer.tokenize(text)

['Hello',
 '!',
 'Hello',
 'hello',
 ',',
 'Hello',
 'and',
 'hello',
 '!',
 'Hello',
 '!']

score 1 · Answer 3 · answered Jun 10 '19 at 15:46

Following strictly your code:

for x in range(len(TEXT)):

    if TEXT[x] in french:
        frtext = french[TEXT[x]]

    FRTEXT.append(frtext)

Your append call is being made outside of the if condition. So, you're going to append the words that match the dictionary key, but will also append the "__" string if the TEXT[x] doesn't match the dictionary key.

How to split string into words, even if words have punctuation, in Python

3 Answers3