4

I'm using Python 3.7. How do I remove all non-UTF-8 characters from a string? I tried using "lambda x: x.decode('utf-8','ignore').encode("utf-8")" in the below

coop_types = map(
    lambda x: x.decode('utf-8','ignore').encode("utf-8"),
    filter(None, set(d['type'] for d in input_file))
)

but this is resulting in the error ...

Traceback (most recent call last):
  File "scripts/parse_coop_csv.py", line 30, in <module>
    for coop_type in coop_types:
  File "scripts/parse_coop_csv.py", line 25, in <lambda>
    lambda x: x.decode('utf-8','ignore').encode("utf-8"),
AttributeError: 'str' object has no attribute 'decode'

If you have a generic way to remove all non-UTF8 chars from a string, that's all I'm looking for.

Dave
  • 15,639
  • 133
  • 442
  • 830
  • You first *encode* `x`, *then* decode it. `str.encode` takes a Unicode string and produces a UTF-8 encoding of it. `bytes.decode` takes a string and attempts interpret it as an encoding to produce a `str` object. – chepner Jan 28 '20 at 16:19
  • Can you give an example of what would be a non-UTF-8 character in an instance of `str`? Do you mean surrogate code points? – lenz Jan 28 '20 at 17:28

1 Answers1

4

You're starting with a string. You can't decode a str (it's already decoded text, you can only encode it to binary data again). UTF-8 encodes almost any valid Unicode text (which is what str stores) so this shouldn't come up much, but if you're encountering surrogate characters in your input, you could just reverse the directions, changing:

x.decode('utf-8','ignore').encode("utf-8")

to:

x.encode('utf-8','ignore').decode("utf-8")

where you encode any UTF-8 encodable thing, discarding the unencodable stuff, then decode the now clean UTF-8 bytes.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • 1
    Side-note: If the problem is surrogates, you may not want to discard them; you may just need to [accept them properly (e.g. via `json.loads` or the like)](https://stackoverflow.com/q/38147259/364696) in the first place, so you never actually see them, you just see the single Unicode character they represent. – ShadowRanger Jan 28 '20 at 18:45
  • so long as you're familiar with your input data and outcome of loosing chars beyond byte 127, then this is a great choice - perhaps one of the simplest I've found in this topic. good job, @ShadowRanger – nate May 03 '21 at 22:25
  • 1
    @NathanBenton: To be clear, this doesn't lose all characters beyond byte 127 (if you used `'ascii'` as the encoding instead of `'utf-8'` it would). UTF-8 handles all normal Unicode ordinals, just not [high-low surrogates](https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates) (a UTF-16 thing that doesn't apply to UTF-8). – ShadowRanger May 03 '21 at 22:55
  • got it - thank you for the feedback and correction, @ShadowRanger – nate May 03 '21 at 22:57