3

Lets say I have a dataframe like this:

ID    Name       Description
0     Manny      V e  r y calm
1     Joey       Keen and a n a l y t i c a l
2     Lisa       R a s h and careless
3     Ash        Always joyful

I want to remove all the spaces between each letter in the Description column without completely removing all the necessary spaces between words.

Is there a simple way to this in Pandas?

The Dodo
  • 711
  • 5
  • 14
  • Are the spaced out words always followed or preceded by a word with no spaces between the letters? – duncster94 Nov 21 '18 at 20:38
  • No. It varies. Sometimes it may and sometimes it may not. @duncster94 – The Dodo Nov 21 '18 at 20:42
  • Do you have a vocabulary you can use? Or can these words be effectively anything? – duncster94 Nov 21 '18 at 20:48
  • They can be anything. No patterns at all. Each description is unique and independent from all the other descriptions. – The Dodo Nov 21 '18 at 20:50
  • I don't see how this can be done. For example, the string 'v e r y c a l m' can't be distinguished as two words (not with Pandas anyway). – duncster94 Nov 21 '18 at 20:52
  • You can use module "enchant" to check if the word is English word or not. But it will fail in case `a n a l y t i c s`. because the substring `an` is also an English word. – Sanchit Kumar Nov 21 '18 at 21:08

1 Answers1

5

This is a tricky problem, but one approach that may get you most of the way there is to use negative and positive lookbehinds/lookaheads to encode a few basic rules.

The following example would likely work well enough given what you've described. It will incorrectly combine characters from consecutive "real" words that have been exploded into separated characters, but if that's rare this will probably be fine. You could add additional rules to cover more edge cases.

import re
import pandas as pd

s = pd.Series(['V e  r y calm', 'Keen and a n a l y t i c a l',
'R a s h and careless', 'Always joyful'])

regex = re.compile('(?<![a-zA-Z]{2})(?<=[a-zA-Z]{1}) +(?=[a-zA-Z] |.$)')
s.str.replace(regex, '')

0              Very calm
1    Keen and analytical
2      Rash and careless
3          Always joyful
dtype: object

This regex effectively says:

Look for sequences of spaces and replace spaces, but only if there is one letter before them. If there are two letters, don't do anything (i.e., a 2-letter word). But more specifically, actually only replace a space if there is a letter after the last space in the sequence, or any character that terminates the string.

Nick Becker
  • 4,059
  • 13
  • 19