Using Python 3.3. I want to do the following:
- replace special alphabetical characters such as e acute (é) and o circumflex (ô) with the base character (ô to o, for example)
- remove all characters except alphanumeric and spaces in between alphanumeric characters
- convert to lowercase
This is what I have so far:
mystring_modified = mystring.replace('\u00E9', 'e').replace('\u00F4', 'o').lower()
alphnumspace = re.compile(r"[^a-zA-Z\d\s]")
mystring_modified = alphnumspace.sub('', mystring_modified)
How can I improve this? Efficiency is a big concern, especially since I am currently performing the operations inside a loop:
# Pseudocode
for mystring in myfile:
mystring_modified = # operations described above
mylist.append(mystring_modified)
The files in question are about 200,000 characters each.