0

To split text without spaces, one can use wordninja, please see How to split text without spaces into list of words. Here is the code to do the job.

sent = "Test12  to separate mergedwords butkeeprest asitis, say 1/2/2021 or 1.2.2021."

import wordninja
print(' '.join(wordninja.split(sent)))

output: Test 12 to separate merged words but keep rest as it is say 1 2 2021 or 1 2 2021

The wordninja looks great and works well for splitting those merged text. My question here is that how I can split text without spaces but keep the dates (and punctuations) as they are. An ideal output will be:

Test 12 to separate merged words but keep rest as it is, say 1/2/2021 or 1.2.2021

Your help is much appreciated!

Sam S.
  • 627
  • 1
  • 7
  • 23
  • 2
    You can't really do this without some kind of lexicon/dictionary to know what an actual word is. – Tim Biegeleisen Dec 02 '21 at 04:54
  • Why not do a basic split with [`sent.split()`](https://docs.python.org/3/library/stdtypes.html#str.split) or [`re.split(r"[\W,/]+", sent)`](https://docs.python.org/3/library/re.html#re.split) and take it from there? These are Python builtins. – Jens Dec 02 '21 at 04:56
  • 2
    Just spitballing but you could maybe find the locations of dates in the original string with regex, use wordninja on every part of the string that isn't a date, then combine the different segments? – 0x263A Dec 02 '21 at 05:58
  • @Jens I believe the idea here is that the words OP is trying to split could be arbitrarily combined so splitting them with builtins would be... painful – 0x263A Dec 02 '21 at 06:01

2 Answers2

1

The idea here is to split our string into a list at every instance of a date then iterate over that list preserving items that matched the initial split pattern and calling wordninja.split() on everything else. Then recombine the list with join.

import re
def foo(s):
    return 'ninja'

string = 'Test12  to separate mergedwords butkeeprest asitis, say 1/2/2021 or 1.2.2021.'
pattern = re.compile(r'([0-9]{1,2}[/.][0-9]{1,2}[/.][0-9]{1,4})')

# Split the string up by things matching our pattern, preserve rest of string.
string_isolated_dates = re.split(pattern, string)

# Apply wordninja to everything that doesn't match our date pattern, join it all together. OP should replace foo in the next line with wordninja.split()
wordninja_applied = ' '.join([el if pattern.match(el) else foo(el) for el in string_isolated_dates])

print(wordninja_applied)

Output:

 ninja 1/2/2021 ninja 1.2.2021 ninja

Note: I replaced your function wordninja.split() with foo() just because I don't feel like downloading yet another nlp library. But my code demonstrates modifying the original string while preserving the dates.

0x263A
  • 1,807
  • 9
  • 22
0

Finally I got the following code, based on comments under my post (Thanks for comments):

import re
sent = "Test12  to separate mergedwords butkeeprest asitis, say 1/2/2021 or 1.2.2021."
sent = re.sub(","," ",sent)
corrected = ' '.join([' '.join(wordninja.split(w)) if w.isalnum() else w for w in sent.split(" ")])
print(corrected) 

output: Test 12  to separate merged words but keep rest as it is say 1/2/2021 or 1.2.2021.

It is not a straightforward solution, but works.

Sam S.
  • 627
  • 1
  • 7
  • 23
  • Oh that’s not quite what I meant by my comment, if I get a chance would it help you for me to write up an answer? – 0x263A Dec 05 '21 at 00:27
  • Thanks, 0x263A, Yes, it is good if you could please write your answer. – Sam S. Dec 05 '21 at 23:43