1

I have this string extracted from a file:

my_string = '\x01\x00\x0e\x00\xff\xff\xffPepe A\x00\xc4\x93\x00\x00100000\x00\xff\xff\xffNu\xf1ez Jim\xe9nez\x00\xf41\x00'

I need to clean that string by removing all non-alphanumeric characters or blanks, so it looks like this:

['Pepe A','100000','Nuñez Jiménez','1']

So far I have tried with the following code:

split_string = re.split(r'[\x00-\x0f]', my_string)
result_list = filter(None, split_string)

But I do not get the result I need. Could someone give me some idea? I'm using Python.

Dayana
  • 1,500
  • 1
  • 16
  • 29
  • 1
    Possible duplicate of [Stripping everything but alphanumeric chars from a string in Python](https://stackoverflow.com/questions/1276764/stripping-everything-but-alphanumeric-chars-from-a-string-in-python) – Sohaib Farooqi Mar 07 '18 at 14:45
  • The problem is you have decided that you want to see some characters in the range \x7f to \xff (for example, you want to have \xe9 interpreted as é) but not others (for example, you don't want to have \xf4 interpreted as ô or \xff as ÿ). You're going to have to decide which characters in the ISO 8859-1 encoding are ones you want to see, and which you want to regard as garbage. That's something that can't be done automatically. – BoarGules Mar 07 '18 at 15:07

1 Answers1

3

Something like this will get you close:

Code:

re.split(r'ÿÿÿ|AÄ|ô', ''.join(ch for ch in my_string if ch.isalnum() or ch == ' ')))

Test Code:

import re

my_string = '\x01\x00\x0e\x00\xff\xff\xffPepe A\x00\xc4\x93\x00\x00100000' \
            '\x00\xff\xff\xffNu\xf1ez Jim\xe9nez\x00\xf41\x00'

print(re.split(r'ÿÿÿ|AÄ|ô', ''.join(ch for ch in my_string
                                    if ch.isalnum() or ch == ' ')))

Results:

['', 'Pepe ', '100000', 'Nuñez Jiménez', '1']
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135