0

Trying to format this string and strip out the non-ascii characters

import re 
text = '<phone_number><![CDATA[0145236243 <0x0C><0x05><0x4>

]]>' clean = re.sub('[^\x00-\x7f]',"", text)

This does not seem to do the job properly.Does someone have a proper solution. I have also uploaded a file in case stackoverflow has formatted the non-ascci characters.

  • what is the expected output? – mad_ Sep 27 '18 at 13:45
  • something like this text = '<![CDATA[07744454]]>' – Jonas Amara Sep 27 '18 at 13:46
  • 1
    Possible duplicate of [How can I remove non-ASCII characters but leave periods and spaces using Python?](https://stackoverflow.com/questions/8689795/how-can-i-remove-non-ascii-characters-but-leave-periods-and-spaces-using-python) – mad_ Sep 27 '18 at 13:48
  • 1
    all the characters in you example are [ASCII](http://www.asciitable.com/) char – Gsk Sep 27 '18 at 13:51
  • You dont have have non-ascii characters in your text. You just have characters and numbers. Also your expected out contains contact_number and should be phone_number but I assume that is a typo – mad_ Sep 27 '18 at 13:52
  • this i meant, slackoverflow stripped out the non-ASCII text = '<![CDATA[0145236243 <0x0C><0x05><0x4> ] ]>' – Jonas Amara Sep 27 '18 at 13:53
  • Ok did you have a look at the link I provided? Did it help? – mad_ Sep 27 '18 at 13:55
  • Yes, i did , but seems that it wasn't the same problem. I have tried the solution.But still not working – Jonas Amara Sep 27 '18 at 13:58

2 Answers2

1

Not a very generic one. But the below solution might work for you

''.join([i for i in text.split() if('<0x') not in i])#'<phone_number><![CDATA[0145236243]]></phone_number>'

Using regex

 re.sub('(<0x\w*>)|\s',"", text) # '<phone_number><![CDATA[0145236243]]></phone_number>'
mad_
  • 8,121
  • 2
  • 25
  • 40
0

This link also has a similar solution for all non UTF-8 characters. Regular expression that finds and replaces non-ascii characters with Python

You can try using str.encode() and str.decode() for this purpose.

Then you can replace them.