I am aware that there are a number of answers to this already but non of these work for me. Not sure why.
Quick outline of the problem:
- I'm getting the data from BigQuery and stream this into a template to create a 1.1GB XML data feed
- The data contains non UTF-8 characters in some fields. They stream fine, but mean that the XML cannot be parsed.
- I tried using an XML writer for the feed, which means these lines get skipped. The problem is that the XML dom is huge and takes up too much RAM.
Some approaches I tried:
my_string.encode('ascii', 'ignore').decode('utf-8', 'ignore')
This leaves character codes like \x0bA
and \x00
.
import string
''.join(x for x in mystring if x in string.printable)
Same as above.
import re
re.sub(r'[^\x00-\x7F]+',' ', my_string)
Removes every character.
I do not care about losing 5-6 chars in a data feed of 1.2Gb. So how can I remove them?
Here is a sample string: "White coated paint finish RAL 9010.All units manufactured with reinforced doors." - The note! These seem to be characters that cannot be represented in UTF-8.