I have a simple problem. I have a number of text files where words have been divided (hyphenated) at the end of lines. Something like this:
toward an emotionless evalu-
ation of objectively gained
I would like to get rid of the hyphenation and join the words again. This could be done simply and quickly using the replace()
function. However in some cases there are a few extra line breaks after the hyphen. Like this:
end up as a first rate con-
tribution, but that was not
Rather than pile up several calls to replace()
, I've just switched to regular expressions and used re.sub('\-\n+', '', text)
:
def replace_hyphens(text):
return re.sub('\-\n+', '', text)
This works quite well but I was wondering how I could achieve the same result with a function coded directly in Python. This is what I came up with:
def join_hyphens(text):
processed = ''
i = 0
while i < len(text):
if text[i] == '-':
while text[i+1] == '\n':
i += 1
i += 1
processed += text[i]
i += 1
return processed
But of course the performance is abysmal compared to regular expressions. If I time it over 100 iterations on a rather long string here are the results.
join_hyphens done in 2.398ms
replace_hyphens done in 0.021ms
What would be the best way to improve performance while still using native Python code ?
Edit: Switching to a list, as suggested, significantly improves performance but still fares poorly compared to regular expressions :
def join_hyphens(text):
processed = []
i = 0
while i < len(text):
if text[i] == '-':
while text[i+1] == '\n':
i += 1
i += 1
processed.append(text[i])
i += 1
return ''.join(processed)
Gives:
join_hyphens done in 1.769ms
replace_hyphens done in 0.020ms