If you need to do this kind of characters "normalisation" you may consider implementing a codec for the Codec registry.
The implementation is similar as the one proposed by @RomanPerekhrest with a table of substitution characters.
Implementing a codec
Import the codecs
module, give a name to your codec (avoid existing names).
Create the encoding table (the one you'll use when you do u"something".encode(...)
:
import codecs
NAME = "normalize"
_ENCODING_TABLE = {
u'\u2002': u' ',
u'\u2003': u' ',
u'\u2004': u' ',
u'\u2005': u' ',
u'\u2006': u' ',
u'\u2010': u'-',
u'\u2011': u'-',
u'\u2012': u'-',
u'\u2013': u'-',
u'\u2014': u'-',
u'\u2015': u'-',
u'\u2018': u"'",
u'\u2019': u"'",
u'\u201a': u"'",
u'\u201b': u"'",
u'\u201c': u'"',
u'\u201d': u'"',
u'\u201e': u'"',
u'\u201f': u'"',
}
The table above can "normalize" spaces, hyphens, quotation marks.
This is where normalisation rules go…
Then, implement the function used to normalize your string:
def normalize_encode(input, errors='strict'):
output = u''
for char in input:
output += _ENCODING_TABLE.get(char, char)
return output, len(input)
You can also implement the decoding, but you need to reverse the _ENCODING_TABLE
,
the best practice is to prepare the reversed table and fill the missing characters later.
_DECODING_TABLE = {v: k for k, v in _ENCODING_TABLE.items()}
# missing characters...
def normalize_decode(input, errors='strict'):
output = u''
for char in input:
output += _DECODING_TABLE.get(char, char)
return output, len(input)
Now, everything is ready, you can implements the codec protocol:
class Codec(codecs.Codec):
def encode(self, input, errors='strict'):
return normalize_encode(input, errors)
def decode(self, input, errors='strict'):
return normalize_decode(input, errors)
class IncrementalEncoder(codecs.IncrementalEncoder):
def encode(self, input, final=False):
assert self.errors == 'strict'
return normalize_encode(input, self.errors)[0]
class IncrementalDecoder(codecs.IncrementalDecoder):
def decode(self, input, final=False):
assert self.errors == 'strict'
return normalize_decode(input, self.errors)[0]
class StreamWriter(Codec, codecs.StreamWriter):
pass
class StreamReader(Codec, codecs.StreamReader):
pass
def getregentry():
return codecs.CodecInfo(name=NAME,
encode=normalize_encode,
decode=normalize_decode,
incrementalencoder=IncrementalEncoder,
incrementaldecoder=IncrementalDecoder,
streamreader=StreamReader,
streamwriter=StreamWriter)
How to register the newly created codec?
If you have several normalisation codecs, the best practice is to gather
them in the __init__.py
file of a dedicated package
(for instance: my_app.encodings
.
# -*- coding: utf-8 -*-
import codecs
import normalize
def search_function(encoding):
if encoding == normalize.NAME:
return normalize.getregentry()
return None
# Register the search_function in the Python codec registry
codecs.register(search_function)
Whenever you need your codec, you write:
import my_app.encodings
normalize = my_app.encodings.normalize.NAME
def my_function():
normalized = my_string.encode(normalize)