In the following input, I am trying to replace the numbers and \n
with ''
and ' '
respectively.
THE SONNETS\n\n 1\n\nFrom fairest creatures we desire increase,\nThat thereby beauty’s rose might never die,\nBut as the riper should by time decease,\nHis
she hies, 1189\nAnd yokes her silver doves; by whose swift aid\nTheir mistress mounted through the empty skies,\nIn her light chariot quickly is convey’d; 1192\n Holding their course to Paphos, where their queen\n Means to immure herself and not be seen.\n'
The input_var
is read from a file that has above content.
file_name = 'sample.txt'
file = open(folder+file_name, mode='r', encoding='utf8')
input_var = file.read()
file.close
The screenshot of file is attached.
The data in file is
THE SONNETS
1
From fairest creatures we desire increase,
That thereby beauty’s rose might never die,
But as the riper should by time decease,
His
she hies, 1189
And yokes her silver doves; by whose swift aid
Their mistress mounted through the empty skies,
In her light chariot quickly is convey’d; 1192
Holding their course to Paphos, where their queen
Means to immure herself and not be seen.
For identifying numbers I have the used the regex [\s]{3,}\d{1,}\\n
(there have to be at least 3 spaces before the number. (see this link for testing of regex).
I am using the following code to replace the regular expression and \n
both that I have got from a few answers in stackoverflow.
Code 1 -
# Remove the numbers in sonnets and at the end of lines
pattern = {r'[\s]{3,}\d{1,}\\n' : '',
r'\\n' : ' '
}
regex = re.compile('|'.join(map(re.escape, pattern.keys( ))))
output_var = regex.sub(lambda match: pattern[match.group(0)], input_var)
Code 2 -
rep = dict((re.escape(k), v) for k, v in pattern.items())
pattern_test = re.compile("|".join(rep.keys()))
output_var = pattern_test.sub(lambda m: rep[re.escape(m.group(0))], input_var)
Code 3 -
for i, j in pattern.items():
output_var = input_var.replace(i, j)
where input_var
has the above mentioned text. All three do not replace anything.
I have also tried
pattern = {r'[\s]{3,}\d{1,}\n' : '',
r'\n' : ' '
}
but it does not replace anything.
pattern = {'[\s]{3,}\d{1,}\n' : '',
'\n' : ' '
}
replaces only \n
and the output is like
THE SONNETS 1 From fairest creatures we desire increase, That thereby beauty’s rose might never die, But as the riper should by time decease, His
The regular expression is not identified in the dictionary and it is, I think, being taken as literal string rather than regular expression. How can I specify the regular expression in the dictionary? The answers I have found in stackoverflow use strings rather than regular expression like the answer provided for this question.
The expected outcome is
THE SONNETS From fairest creatures we desire increase, That thereby beauty’s rose might never die, But as the riper should by time decease, His
she hies,And yokes her silver doves; by whose swift aid Their mistress mounted through the empty skies, In her light chariot quickly is convey’d; Holding their course to Paphos, where their queen Means to immure herself and not be seen. '