I want to clean up files that contain bad formatting, more precisely, replace "normal" spaces with non-breaking spaces according to a given criterion.
For example:
If in a sentence, I have:
"You need to walk 5 km."
I need to replace the space between 5 and km with a non-breaking space.
So far, I have managed to do this:
import os
unites = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']
# iterate and read all files in the directory
for file in os.listdir():
# check if the file is a file
if os.path.isfile(file):
# open the file
with open(file, 'r', encoding='utf-8') as f:
# read the file
content = f.read()
# search for exemple in the file
for i in unites:
if i in content:
# find the next character after the unit
next_char = content[content.find(i) + len(i)]
# check if the next character is a space
if next_char == ' ':
# replace the space with a non-breaking space
content = content.replace(i + ' ', i + '\u00A0')
But this replace all the spaces in the document and not the ones that I want. Can you help me?
EDIT
after UlfR's answer which was very useful and relevant, I would like to push my criteria further and make my "search/replace" more complex.
Now I would like to search for characters before/after a word in order to replace spaces with non-breaking spaces. For example :
- I want to search for the phrase "Can the search be hypothetical?" I would like the space between hypothetical and ? to be replaced by a non-breaking space.
- Otherwise also "In the search it is necessary to refer to the "{figure 1.12}" I would like the space between {, figure and } to be a non-breaking space but also the space between figure and 1.12 to be a non-breaking space (so all spaces in this case).
I've tried to do this :
units = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']
units_before_after = ['{']
nbsp = '\u00A0'
rgx = re.sub(r'(\b\d+)(%s) (%s)\b'%(units, units_before_after),r'\1%s\2'%nbsp,text))
print(rgx)
But I'am having some trouble, do you have any ideas to share ?