1

I have a bunch of text that have each characters separated by a white space like so:

D i g i t a l  a u d i o  &  v i d e o

How do I fix it so that it becomes:

Digital audio & video

The words may not necessary be in title case case so it's not possible to separate the words by that alone.

Furthermore, some parts of the text are 'merged' in a way whereby it sits right beside one another like

'h o wm a n y w o r ds'

so this one should be

how many words

I think it might need some language processing but I'm not sure where to start

DmcZx
  • 67
  • 10
  • Are you sure it's `'D i g i t a l a u d i o & v i d e o'` and not `'D i g i t a l a u d i o & v i d e o'`? If it was the latter you could just remove every other character: `txt1 = 'D i g i t a l a u d i o & v i d e o'; txt2 = txt1[::2]; print(txt2)` – Stef Mar 26 '23 at 16:13
  • 1
    How did you get this string? Could you `print(repr(that_string))`? I can see getting a string like this from reading UTF-16-encoded text incorrectly, and the "spaces" are really null characters, e.g.: `print('Digital'.encode('utf-16le').decode('latin1'))` -> `D i g i t a l` – Mark Tolonen Mar 26 '23 at 16:24
  • @Stef Most of them are spaced out evenly but some 'words' are sort of merged together like 'h o wm a n y w o r ds' – DmcZx Mar 28 '23 at 05:42
  • @MarkTolonen I got it from reading a PDF file – DmcZx Mar 28 '23 at 05:42
  • 1
    @DmcZx In that case you might be interested in [How to split text without spaces into list of words?](https://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words). The best answers explain strategies to find words that have been merged together. – Stef Mar 28 '23 at 08:27
  • 1
    @Stef That looks like the proper solution. Thank you! – DmcZx Apr 13 '23 at 07:24
  • 1
    As shared by Stef, the solution that helped was [How to split text without spaces into list of words?](https://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words) – DmcZx Apr 13 '23 at 07:25

2 Answers2

1

Replace two spaces with some unique placeholder string, remove all spaces, then replace the placeholder with one space.

message = 'D i g i t a l  a u d i o  &  v i d e o'
message = message.replace('  ', 'PLACEHOLDER')
message = message.replace(' ', '')
message = message.replace('PLACEHOLDER', ' ')
John Gordon
  • 29,573
  • 7
  • 33
  • 58
  • Fortunately you can just do ``message.strip(" ", "")`` since you have double spaces between words – Rykari Mar 26 '23 at 15:15
  • @Rykari What? That doesn't work. Did you actually try it? – John Gordon Mar 26 '23 at 15:18
  • @JohnGordon This would work for the example but I have some text that are merged like 'h o wm a n y w o r ds' so this would partially fix the text for me – DmcZx Mar 28 '23 at 05:52
0

This is what you should do:

my_string = "D i g i t a l  a u d i o  &  v i d e o"
new_string = ' '.join([''.join(word.split()) for word in my_string.split('  ')])

This script breaks the sentence into words by splitting double-spaces. Then it takes each word (with separated letters), split the word into letters and join them together with no space. Finally the words are joined together again with single space.

Amir Sher
  • 11
  • 2
  • I used something like this to remove the spaces that are consistent. However, there are words that are broken and also merged in a way that looks like 'h o wm a n y w o r ds' – DmcZx Mar 28 '23 at 05:47