2

I have a document with some lines that have spaced out letters which I want to remove.

The problem is, that the strings are not following all the same rules. So I have some with just one space, also between the words and some with two or three speaces between the words

Examples:

"H e l l o g u y s"
"H e l l o  g u y s"
"H e l l o   g u y s"

all the above should be converted to --> "Hello guys"

"T h i s i s P a g e 1"  -->  "This is Page 1"

I wrote a script to remove every second space but not if next letter is numeric or capital. It's working almost OK, since the processed text is German and almost every time the words begin with capital letters... almost. Anyways I'm not satisfied with it. So I'm asking if there is a neat function for my problem.

text = text.strip()                     # remove spaces from start and end
out = text
if text.count(' ') >= (len(text)/2)-1:
    out = ''
    idx = 0
    for c in text:
        if c != ' ' or re.match('[0-9]|\s|[A-Z0-9ÄÜÖ§€]', text[idx+1]) or (idx > 0 and text[idx-1] == '-'):
            out += c
        idx += 1
text = out
sjr
  • 9,769
  • 1
  • 25
  • 36
Karl Adler
  • 15,780
  • 10
  • 70
  • 88
  • 17
    You picked yourself an impossibly hard task. What spaces should be removed from `"E x p e r t s e x c h a n g e"`? – Martijn Pieters Dec 08 '14 at 16:58
  • 1
    yes, that's of course bit tricky... maybe I need to use some kind of dictionairy... I don't need a 100% solution, but it should work for most of strings. – Karl Adler Dec 08 '14 at 17:00
  • Yes... some data will help you. You don't have any hint to tell the code how to join the letters.. – João Paulo Dec 08 '14 at 17:02
  • For the easier examples I'd recommend finding the longest sequence of spaces and using that as a separator, then use Ali Gajani's answer for each word to remove the remaining spaces. – MrFancypants Dec 08 '14 at 17:03
  • Slightly off topic but where are these strings coming from? – Holloway Dec 08 '14 at 17:03
  • oh well, I just got it... E x p e r t - s e x - c h a n g e and E x p e r t s - e x c h a n g e good example ;) I'm also satisfied, if I have a bullet proof answer that this is not possible. But I can't belief, that I'm the first one with this problem, but I also have noo glue what terms to search for. – Karl Adler Dec 08 '14 at 17:03
  • You can use a dictionary like [this](http://wordlist.aspell.net/) – Ali Gajani Dec 08 '14 at 17:05
  • Maybe you can give preference for bigger words, cause small words you can find anywhere... – João Paulo Dec 08 '14 at 17:06
  • 2
    You might have to go into NLP and see n-gram probabilities for great accuracy. – Ali Gajani Dec 08 '14 at 17:08

5 Answers5

2

You can check whether a word is a english word and then split the words. You could use a dedicated spellchecking library like PyEnchant.

For example:

import enchant
d = enchant.Dict("en_US")
d.check("Hello")

This will be a good starter. But there is the problem with "Expertsexchange".

lakshmen
  • 28,346
  • 66
  • 178
  • 276
2

Converting "H e l l o g u y s" might be very hard or not under the scope of this site. but if you wont to convert the strings like "H e l l o g u y s" or other that the number of spaces between words is different from spaces between letters you can use a the following code :

>>> import re
>>> s1="H e l l o  g u y s"
>>> s2="H e l l o   g u y s"
>>> ' '.join([''.join(i.split()) for i in re.split(r' {2,}',s2)])
'Hello guys'
>>> ' '.join([''.join(i.split()) for i in re.split(r' {2,}',s1)])
'Hello guys'

this code use a regular expression (' {2,}') for split the words . that split the string from where that have more than 2 spaces !

Mazdak
  • 105,000
  • 18
  • 159
  • 188
2

Not the most original answer but I've seen that your problem almost matches this one. I have taken unutbu's answer, slightly modified it to solve your queries with enchant. If you have any other dictionary, you can use that instead.

import enchant
d = enchant.Dict("en_US") # or de_DE

def find_words(instring, prefix = ''):
    if not instring:
        return []

    if (not prefix) and (d.check(instring)):
        return [instring]
    prefix, suffix = prefix + instring[0], instring[1:]
    solutions = []
    # Case 1: prefix in solution
    if d.check(prefix):
        try:
            solutions.append([prefix] + find_words(suffix, ''))
        except ValueError:
            pass
    # Case 2: prefix not in solution
    try:
        solutions.append(find_words(suffix, prefix))
    except ValueError:
        pass
    if solutions:            
        return sorted(solutions,
                      key = lambda solution: [len(word) for word in solution],
                      reverse = True)[0]

    else:
        raise ValueError('no solution')

inp = "H e l l o   g u y s T h i s i s P a g e 1" 
newInp = inp.replace(" ", "")

print(find_words(newInp))

This outputs:

['Hello', 'guys', 'This', 'is', 'Page', '1']

The linked page certainly is a good starting point for some pragmatic solutions. However, I think a proper solution should use n-grams. This solution could be modified to make use of multiple whitespaces as well, since they might indicate the presence of a word boundary.

Edit: You can also have a look at Generic Human's solution using a dictionary with relative word frequencies.

Community
  • 1
  • 1
runDOSrun
  • 10,359
  • 7
  • 47
  • 57
  • This is interesting solution and interesting link. But just removing all spaces makes the problem in some cases harder to solve because if we have two or three spaces to separate the words important information is gonna lost. Maybe a combination of yours and @Kasra s answer will do it... – Karl Adler Dec 08 '14 at 19:18
  • 1
    Yes, as said. You should use that additional information as well as n-grams for a (near-)optimal solution. – runDOSrun Dec 08 '14 at 19:20
  • I tried that and it works really good, but I recognized that there are many names in the string, which are of course not in the dictionairy and it tries to split it up in a lot of short words. E.g. `Joachim Gauck` turns in to `Joachim G au c k` so an additional custom word list may solve this. – Karl Adler Dec 08 '14 at 20:45
  • You can create your own dictionary by writing a script that takes the most frequent words from websites and run it on wikipedia.de or some German news site etc. Something like this would work: http://stackoverflow.com/questions/17910060/retrun-most-common-words-in-a-website-such-that-word-count-5 – runDOSrun Dec 08 '14 at 21:07
1

Demo

This is an algorithm that could do it. Not battle-tested, but just an idea.

d = ['this', 'is', 'page', 'hello', 'guys']
m = ["H e l l o g u y s", "T h i s i s P a g e 1", "H e l l o   g u y s", "H e l l o  g u y s"]
j = ''.join(m[0].split()).lower()

temp = []
fix = []


for i in j:
    temp.append(i)
    s = ''.join(temp) 

    if s in d:
        fix.append(s)       
        del temp[:]
    
    if i.isdigit():
        fix.append(i)
    
print(' '.join(fix))

Prints the following:

this is page 1, hello guys with your supplied test inputs.

Extending

You can use this dictionary which has words on each line, convert it to a list and play around from there.

Issues

As Martjin suggested, what would you do when you encounter "E x p e r t s e x c h a n g e". Well, in such scenarios, using n-gram probabilities would be an appropriate solution. For this you would have to look into NLP (Natural Language Processing) but I assume you don't want to go that far.

Community
  • 1
  • 1
Ali Gajani
  • 14,762
  • 12
  • 59
  • 100
  • The algorithm seems to take the shortest matching word from dictionairy and ignoring everything that's not available in dictionairy. That's too much loose of information for me. – Karl Adler Dec 08 '14 at 19:46
  • Yes, I anticipate that if you use a real dictionary, you will run into bigger challenges like he, hell, hello. This is a hard problem, you need NLP at some point. – Ali Gajani Dec 08 '14 at 19:47
-2

You cannot do this - the situation where valid word boundaries are represented the same way as spaces which should be removed is theoretically the same situation where you have no spaces at all in the text.

So you can "reduce" your problem to the problem of re-inserting word boundary spaces in a text with no spaces at all - which is just as impossible, because even with a dictionary containing every valid word - which you do not have -, you can either go for a greedy match and insert too few spaces, or go for a non-greedy match and insert too many.

ch3ka
  • 11,792
  • 4
  • 31
  • 28
  • maybe - but you'll never have an exhaustive dictionary, nor can the greediness be adjusted well enough to be practical - especially in German, where there are infinitely many valid words. – ch3ka Dec 08 '14 at 17:11
  • 3
    "You cannot do this" is a bit too much. It can be done, just not easily and with 100% confidence. There will be a remaining uncertainty about the detected word boundaries, but this task is definitely solvable (= giving satisfying solutions for most runs) by a number of statistic approaches. – runDOSrun Dec 08 '14 at 17:34
  • For every algorithm imaginable, there exists unlimited input which produces a wrong result. That's - at least to me - identical with "unsolvable". – ch3ka Dec 08 '14 at 17:40
  • 1) A brute force algorithm can solve it, thus it's solvable: The input is finite, the dictionary is finite, the number of possible word boundaries is finite (expand from 1 to n-1 iteratively), thus the number of potential matches is finite. 2) even though problem is solvable, it might be 'very hard'. However, even Traveling Salesman Problem is solvable by statistical approaches in a number of scenarios. In some scenarios you KNOW you don't have the worst case input, so you can rely on the average case or a simple approximation is sufficient. – runDOSrun Dec 08 '14 at 17:48
  • 1
    a brute force algorithm can emit the correct solution, but it cannot tell it apart from the oh-so-many incorrect ones. – ch3ka Dec 08 '14 at 17:50
  • Not 100% correctly but with a certain likelihood! E x p e r t s e x c h a n g e -> whether this is 2 or 3 words can be inferred by n-grams. If exchange and experts are words encountered more often than sex in the specific context of words, it's more likely. – runDOSrun Dec 08 '14 at 17:58