Seperating/tokenize dots from words but not from digits in Python

Question

I'm trying to separate dots in German sentences from words but not from digits, e.g.:

"Der 17. Januar war ein toller Tag. Heute ist es auch schön."

should end in

"Der 17. Januar war ein toller Tag . Heute ist es auch schön . "

But I can't find a solution for this. I tried to use the re module in Python without success.

line = re.sub(r'[^0-9]+\.', ' . ', line)

would just end in

"Der 17. Januar war ein toller Ta . Heute ist es auch schö . "

Maybe an XY-problem. If this is supposed to be part of an NLP pipeline, you should use a proper tokenizer. Like in [NLTK](https://stackoverflow.com/a/15057966/1346276) or [spacy](https://spacy.io/docs/usage/processing-text) (I know that spacy comes with a German model built-in; not sure about NLTK.). — phipsgabler, Oct 24 '17 at 15:36

Ajax1234 · Answer 1 · 2017-10-24T15:39:01.510

2

You have to use a positive lookbehind in your regex:

import re
s = "Der 17. Januar war ein toller Tag. Heute ist es auch schön."
final_string = re.sub("(?<=[a-zA-Z])\.(\s|$)", ' . ', s)
print(final_string)

Output:

Der 17. Januar war ein toller Tag . Heute ist es auch schön .

edited Oct 24 '17 at 15:39

answered Oct 24 '17 at 15:32

Ajax1234

69,937
8
61
102

Thanks! The regex did work very well and in addition I could easily modify it for a few similar problems :) – ke_let Oct 26 '17 at 07:40

score 1 · Answer 2 · answered Oct 24 '17 at 16:13

Just in case, you don't want to use regex. Here is an alternative.

def tokenize_using_dot(s_input):
    s_list = s_input.split()

    for idx in range(len(s_list)):
        if s_list[idx][-1] == '.' and not s_list[idx][0:-1].isdigit():
            s_list[idx] = s_list[idx].replace('.', ' .')
    return' '.join(s_list)


s = "Der 17. Januar war ein toller Tag. Heute ist es auch schön."
print(tokenize_using_dot(s))

output:

 Der 17. Januar war ein toller Tag . Heute ist es auch schön .

As @phg commented, it would be a good idea to use a proper tokenizer from nltk suit for these type of tasks.

Seperating/tokenize dots from words but not from digits in Python

2 Answers2