Replace spaces with non-breaking spaces according to a specific criterion

Question

I want to clean up files that contain bad formatting, more precisely, replace "normal" spaces with non-breaking spaces according to a given criterion.

For example:

If in a sentence, I have:

"You need to walk 5 km."

I need to replace the space between 5 and km with a non-breaking space.

So far, I have managed to do this:

import os

unites = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']

# iterate and read all files in the directory
for file in os.listdir():
    # check if the file is a file
    if os.path.isfile(file):
        # open the file
        with open(file, 'r', encoding='utf-8') as f:
            # read the file
            content = f.read()
            # search for exemple in the file
            for i in unites:
                if i in content:
                    # find the next character after the unit
                    next_char = content[content.find(i) + len(i)]
                    # check if the next character is a space
                    if next_char == ' ':
                        # replace the space with a non-breaking space
                        content = content.replace(i + ' ', i + '\u00A0')

But this replace all the spaces in the document and not the ones that I want. Can you help me?

EDIT

after UlfR's answer which was very useful and relevant, I would like to push my criteria further and make my "search/replace" more complex.

Now I would like to search for characters before/after a word in order to replace spaces with non-breaking spaces. For example :

I want to search for the phrase "Can the search be hypothetical?" I would like the space between hypothetical and ? to be replaced by a non-breaking space.
Otherwise also "In the search it is necessary to refer to the "{figure 1.12}" I would like the space between {, figure and } to be a non-breaking space but also the space between figure and 1.12 to be a non-breaking space (so all spaces in this case).

I've tried to do this :

units = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']
units_before_after = ['{']

nbsp = '\u00A0'

rgx = re.sub(r'(\b\d+)(%s) (%s)\b'%(units, units_before_after),r'\1%s\2'%nbsp,text))

print(rgx)

But I'am having some trouble, do you have any ideas to share ?

Your algorithm incorrect for general text, e.g confused for 'kmeans' and 'km', Consider you need to check 'before unit' character based on your question. — Mohammad Shokouhi Gol, Oct 10 '22 at 13:19

score 1 · Accepted Answer · answered Oct 10 '22 at 14:06

You should use re to do the replacement. Like so:

import re

text = "You need to walk 5 km or 500000 cm."
units = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']
nbsp = '\u00A0'

print(re.sub(r'(\b\d+) (%s)\b'%'|'.join(units),r'\1%s\2'%nbsp,text))

Both the search and replace patterns are dynamically built, but basically you have a pattern that matches:

At the beginning of something \b
1 or more digits \d+
One space
One of the units km|m|cm|...
At the end of something \b

Then we replaces the all that with the two groups with the nbsp-string between them.

See re for more info on how to us regular expressions in python. Its well worth the invested time to learn the basics since its a very powerful and useful tool!

Have fun :)

Thank you, regex on python are great! :) such powerful tool. I'm trying to make another pattern but I'm having some trouble, I will update the question, if you could perhaps help me... — Satanas, Oct 10 '22 at 15:36

Replace spaces with non-breaking spaces according to a specific criterion

1 Answers1