Remove spaces from words and generate exact words

Question

I am using python and I am looking a way where I can arrange the words in a meaning full seance and can improve the readability. Sample words are

H o w  d o  s m a l l  h o l d e r  f a r m e r s  f i t  i n t o  t h e  b i g  p i c t u r e  o f  w o r l d  f o o d  p r o d u c t i o n

Output
How do small holder farmers fit into the big picture of world food production

This one way to remover one time white spaces, where the line has two spaces it will keep the one.

Can anyone suggest more ways .

Edit

See this text line

Inn ovative  b usines s  m odels  and  financi ng  m e chanisms  for  pv  de ploym ent  in  em ergi ng  regio ns

This is my problem so I simply can't remove spaces. One Idea match every set of characters with dictionary and found the write words. May be

So the original text already has two spaces where one is needed, and one space where none is needed? I don't understand the question — Adelin, Jan 03 '18 at 07:13
In the example you posted each character is separated by one whitespace. If that's not the case, I suggest you edit your question and make sure the example is correct. — Savir, Jan 03 '18 at 07:14
@BorrajaX There are also two spaces between words on the original. There is a problem of formating. Maybe the most straighforward way for visualization would be to replace spaces with dots. — joaquin, Jan 03 '18 at 07:16
Also, you seem to mention one way (you're asking for suggestion on **more** ways) Care to post how that way looks like? — Savir, Jan 03 '18 at 07:17
@joaquin Ah... I was trying to select with the mouse, and it was selecting only one whitespace. It's better now that is enclosed as code (@BearBrown 's edit) Now the problem is that I don't see a whitespace in the *sm* in `sm a l l h o l d e r` or that there's only one whitespace between `f o o d p r o d u c t i o n` **:-D** — Savir, Jan 03 '18 at 07:21
Instead of updating your questions with more specific requirements, take the time to explain *exactly* what you're after **and** what you've tried. — Sayse, Jan 03 '18 at 07:45

score 7 · Answer 1 · answered Jan 03 '18 at 07:17

7

import re 

a = 'H o w   d o   sm a l l h o l d e r   f a r m e r s  f i t   i n t o   t h e   b i g   p i c t u r e   o f   w o r l d   f o o d p r o d u c t i o n'

s = re.sub(r'(.) ',r'\1',a)

print(s)

How do smallholder farmers fit into the big picture of world foodproduction

answered Jan 03 '18 at 07:17

Exprator

26,992
6
47
59

6

Clever! For those that don't know how it works but are curious, `(.) ` matches any character that is followed by a space, `( )` captures that character, and this captured character is accessed later on as `\1`. – Adelin Jan 03 '18 at 07:26
or `re.sub(r'\b \b|\B ', r'', a)`, `re.sub(r'\b \b| \B', r'', a)` – Avinash Raj Jan 03 '18 at 07:36
An note that your regex won't handle the more than two spaces. – Avinash Raj Jan 03 '18 at 07:40
Out of curiosity, how does it compare with `' '.join(''.join(word.split()) for word in a.split(' '))` ? – alvas Jan 03 '18 at 08:36

score 1 · Answer 2 · answered Jan 03 '18 at 07:24

You can take every 2 characters and then either strip the spaces or append a space for those that are supposed to be a space....

>>>''.join([string[i:i+2].strip() or ' ' for i in range(0, len(string), 2)])
'How do smallholder farmers fit into the big picture of world foodproduction'

IMCoins · Answer 3 · 2018-01-03T07:59:37.653

0

Edit_2 : **Question has changed and is a bit more tricky. I let this answer to the last problem, but it is not the actual one

CURRENT PROBLEM

Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns

I am advising you use some real word dictionnary. This is a SO thread.

You would, then, take your sentence (here Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns), and split it using spaces (seemingly, you only have this character in common).

Here is the pseudo-code solution :

iterating through the string list:
    keeping the currWord index
    while realWord not found:
        checking currWord in dictionnary.
        if realWord is not found:
            join the nextWord to the currWord
        else:
            join currWord to the final sentence

Doing this, and keeping the currWord index you're at, you can log where you have a problem and add some new rules for your word splitting. You might know you have a problem if a certain threshold is reached (for instance : word 30 characters long ?).

LAST PROBLEM

Edit : You're right @Adelin, I should have commented.

If I may, a simpler program where you understand what's going on and/or if you dislike the use of regex for simple uniform cases:

def raw_char_to_sentence(seq):
    """ Splits the "seq" parameter using 'space'. As words are separated with two spaces,
        "raw_char_to_sentence" transforms this list of characters into a full string
        sentence.
    """
    char_list = seq.split(' ')

    sentence = ''
    word = ''
    for c in char_list:
        # Adding single character to current word.
        word += c
        if c == '':
            # If word is over, add it to sentence, and reset the current word.
            sentence += (word + ' ')
            word = ''

    # This function adds a space at the end, so we need to strip it.
    return sentence.rstrip()

temp = "H o w  d o  s m a l l h o l d e r  f a r m e r s f i t  i n t o  t h e  b i g  p i c t u r e  o f  w o r l d  f o o d p r o d u c t i o n"
print raw_char_to_sentence(temp)
# outputs : How do smallholder farmersfit into the big picture of world

edited Jan 03 '18 at 07:59

answered Jan 03 '18 at 07:26

IMCoins

3,149
1
10
25

2

I don't see how this is simpler at all. – Adelin Jan 03 '18 at 07:27
When you are learning programming and ask how to split a sentence using **simple spaces**, you need to understand, in my humble opinion, the logic behind some operations, and not only use pre-formatted functions. He might not understand how a regex would work, neither how `.join` or `list comprehensions` (that are used in complement of some logic operator `or`) work. :) – IMCoins Jan 03 '18 at 07:30
I agree with you that regular expressions, list comprehensions, one liners and so on can be confusing for beginners, but while your solution uses basic string manipulation, it's not clear how it works, even for a beginner, and makes it not being *simpler*. You could try adding some comments, to improve your answer and make it more understandable for beginners. – Adelin Jan 03 '18 at 07:37
You're right @Adelin. I should have followed my opinion more rigorously. – IMCoins Jan 03 '18 at 07:46
@Adelin : By the way, he edited his question that is now more complex. – IMCoins Jan 03 '18 at 08:02
Yes, and even with that string, the regular expression way still works. That's why even though an answer *looks confusing*, a beginner should still be pointed towards the correct approach, not necessarily the simple one. It's up to him to find the motivation to understand other *apparently* more complex solutions – Adelin Jan 03 '18 at 08:07
Maybe I'm wrong, but I tried the sentence `Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns` with his regex on my computer just now, and it didn't work. – IMCoins Jan 03 '18 at 08:08
1

Regex are a language per se. Python and other programming languages provide regex capabilities but regex gramatics is 'independent' on the language. A Regex is not just a 'pre-formatted function'. When you are learning how to process text programmatically or when using a text editor or a text search facility, it is a good idea to learn about regexes. My two cents – joaquin Jan 03 '18 at 08:09
Because whoever reformatted the string, forgot to add two spaces as words delimiters – Adelin Jan 03 '18 at 08:09
Yes, but he just got his question wrong with his basic formatting ^^' – IMCoins Jan 03 '18 at 08:10
@joaquin Thanks for the input, and I agree with you. I just wanted to give an alternative code. We're not challenging each others with our answers, just complementing and learning ! :) – IMCoins Jan 03 '18 at 08:12
@IMCoins I liked your suggestion. Thanks – Jaswinder Jan 03 '18 at 08:16

alvas · Accepted Answer · 2018-01-03T08:31:07.097

First get a list of words (aka vocabulary). E.g. nltk.corpus.words:

>>> from nltk.corpus import words
>>> vocab = words.words()

Or

>>> from collections import Counter
>>> from nltk.corpus import brown
>>> vocab_freq = Counter(brown.words()

Convert the input into space-less string

>>> text = "H o w d o sm a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n"
>>> ''.join(text.lower().split())                                                                                                      'howdosmallholderfarmersfitintothebigpictureofworldfoodproduction'

Assumptions:

The longer a word, the more it looks like a word
Words that are not in the vocabulary is not a word

Code:

from collections import Counter 

from nltk.corpus import brown

text = "H o w d o s m a l l h o l d e r f a r m e r s f i t i n t o t h e b i g p i c t u r e o f w o r l d f o o d p r o d u c t i o n"
text = "Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns"
s = ''.join(text.lower().split())

vocab_freq = Counter(brown.words())

max_word_len = 10

words = []
# A i-th pointer moving forward.
i = 0
while i < len(s):
    for j in reversed(range(max_word_len+1)):
        # Check if word in vocab and frequency is > 0.
        if s[i:i+j] in vocab_freq and vocab_freq[s[i:i+j]] > 0:
            words.append(s[i:i+j])
            i = i+j
            break

[out]:

how do small holder farmers fit into the big picture of world food production

Assumption 2 is heavily dependent on the corpus/vocabulary you have so you can combine more corpora to get better results:

from collections import Counter 

from nltk.corpus import brown, gutenberg, inaugural, treebank

vocab_freq = Counter(brown.words()) + Counter(gutenberg.words()) +  Counter(inaugural.words()) + Counter(treebank.words()) 

text = "Inn ovative b usines s m odels and financi ng m e chanisms for pv de ploym ent in em ergi ng regio ns"
s = ''.join(text.lower().split())


max_word_len = 10

words = []
# A i-th pointer moving forward.
i = 0
while i < len(s):
    for j in reversed(range(max_word_len+1)):
        print(s[i:i+j])
        # Check if word in vocab and frequency is > 0.
        if s[i:i+j] in vocab_freq and vocab_freq[s[i:i+j]] > 0:
            words.append(s[i:i+j])
            i = i+j
            break

[out]:

innovative business models and financing mechanisms for p v deployment in emerging regions

Remove spaces from words and generate exact words

4 Answers4