0

I am aware that there are a number of answers to this already but non of these work for me. Not sure why.

Quick outline of the problem:

  1. I'm getting the data from BigQuery and stream this into a template to create a 1.1GB XML data feed
  2. The data contains non UTF-8 characters in some fields. They stream fine, but mean that the XML cannot be parsed.
  3. I tried using an XML writer for the feed, which means these lines get skipped. The problem is that the XML dom is huge and takes up too much RAM.

Some approaches I tried:

my_string.encode('ascii', 'ignore').decode('utf-8', 'ignore')

This leaves character codes like \x0bA and \x00.

import string
''.join(x for x in mystring if x in string.printable)

Same as above.

import re
re.sub(r'[^\x00-\x7F]+',' ', my_string)

Removes every character.

I do not care about losing 5-6 chars in a data feed of 1.2Gb. So how can I remove them?

Here is a sample string: "White coated paint finish RAL 9010.All units manufactured with reinforced doors." - The note! These seem to be characters that cannot be represented in UTF-8.

dengar81
  • 2,485
  • 3
  • 18
  • 23
  • 3
    What exactly *are* "non-UTF-8 chars"? Byte sequences which do not decode when interpreted as UTF-8? Are they malformed byte sequences? Or perhaps simply not meant to be UTF-8 in the first place? – deceze Feb 01 '21 at 12:59
  • 1
    can you also share with us the string you have and your desired string? – Moinuddin Quadri Feb 01 '21 at 13:01
  • I tried, the character isn't displayed here. It seems Stackoverflow is parsing them out and discarding them, as I'd like to do. There is a NOTE symbol between the sentences. This is expressed as `\x0b` in the command line. – dengar81 Feb 01 '21 at 13:05
  • U+000B is a Line Tabulation/Vertical Tab. It is a character that is part of ASCII as well (which is why your transformations didn't get rid of it). It just happens to be one of the rare control character that almost no one uses. – Joachim Sauer Feb 01 '21 at 13:39
  • Why does XML have an issue with it @JoachimSauer? – dengar81 Feb 01 '21 at 13:49
  • 1
    Please share a [mcve]. At least `my_string.encode('ascii','backslashreplace')`. – JosefZ Feb 01 '21 at 14:53
  • 1
    @dengar81: these control characters are explicitly not allowed in XML content. See [this question](https://stackoverflow.com/questions/404107/why-are-control-characters-illegal-in-xml-1-0). – Joachim Sauer Feb 01 '21 at 15:32
  • Any idea how can get rid of them? I don't like them, I don't care for them, I don't want them! AFAIC, the world would be better without them! – dengar81 Feb 02 '21 at 14:58
  • The problem with giving this example is that it's been stripped out by Stackoverflow, @JosefZ. You can see a similar issue here: https://www.w3schools.com/python/trypython.asp?filename=demo_ref_string_encode – dengar81 Feb 02 '21 at 15:00
  • I don't understand what is _stripped out_. Try my [`encodeuni` function](https://stackoverflow.com/a/65968666/3439404). – JosefZ Feb 02 '21 at 15:11
  • What I meant is that Stackoverflow removed the offending character from my example. I cannot post an example on Stack, as the character is U+266A – dengar81 Feb 02 '21 at 16:16

0 Answers0