7

I want to delete all the characters "\L" that I find when i read the file. I tried to use this function when I read a line:

def cleanString(self, s):
            if isinstance(s, str):
                    s = unicode(s,"iso-8859-1","replace")
                    s=unicodedata.normalize('NFD', s)
                    return s.encode('ascii', 'ignore')

But it doesn't delete this character. Does someone know how to do it?

I tried using the replace function as well, but it is not better:

s = line.replace("\^L","")

Thanks for your answers.

Kyle Falconer
  • 8,302
  • 6
  • 48
  • 68
Kvasir
  • 1,197
  • 4
  • 17
  • 31

3 Answers3

4

Probably you have not the literal characters ^ and L, but something that is displayed as ^L.

This would be the form feed character.

So do s = line.replace('\x0C', '').

glglgl
  • 89,107
  • 13
  • 149
  • 217
  • HO I can't believe that it was that simple. Thank you i was on that problem since this morning ;) – Kvasir Jun 18 '14 at 15:02
2

^L (codepoint 0C) is an ASCII character, so it won't be affected by an encoding to ASCII. You could filter out all control characters using a small regex (and, while you're at it, filter out everything non-ASCII as well):

import re
def cleanString(self, s):
    if isinstance(s, str):
        s = unicode(s,"iso-8859-1","replace")
        s = unicodedata.normalize('NFD', s)
        s = re.sub(r"[^\x20-\x7f]+", "", s)  # remove non-ASCII/nonprintables
        return str(s)                        # No encoding necessary
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
2

You almost had it correct, you just need a different representation for ^L.

s = line.replace("\x0c", "")

Here's a function that will return the representation of any control character.

def cc(ch):
    return chr(ord(ch) & 0x1f)

>>> cc('L')
'\x0c'

Some control characters have alternate representations, the common ones being '\r' for ^M and '\n' for ^J. These are listed in a chart in the documentation for string literals based on the name given in an ASCII control code chart.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622