Is it possible to get ASCII equivalent of UTF-8 characters?

Asked Nov 02 '22 at 05:11

Active Nov 02 '22 at 05:11

Viewed 29 times

I have extracted text from a PDF using PyMuPDF and certain characters appear to be Unicode instead of ASCII, which is annoying. For example, U+201D instead of U+0022 or U+2019 instead of U+0027. I want the ASCII codes and one way that I have thought about this is by using regex substitution or something similar. Is there a way that I could change these to their ASCII counterparts?

More visual examples:

asked Nov 02 '22 at 05:11

mhay10

So you want to convert Unicode to ASCII??? – Flow Nov 02 '22 at 05:16
1

Why is it annoying? You certainly can do `s.replace("\u201d",'"')` – Tim Roberts Nov 02 '22 at 05:16
@TimRoberts Sure, that handles a single character value. I think the OP is looking for a general way of translating Unicode text to ASCII. Any such translation is going to lose information (if the result is expected to be legible). – Keith Thompson Nov 03 '22 at 04:24
He needs exactly 4 translations. I didn't include them all, because I was pretty sure he could provides the others. There is no generic "Unicode text to ASCII" translation. This is a specific need for 4 specific characters. – Tim Roberts Nov 03 '22 at 06:23
@TimRoberts He said those were *examples*. If those are the only 4 characters he wants to translate, then of course it's trivial. If they really are just examples, the problem can be arbitrary difficult to solve. – Keith Thompson Nov 04 '22 at 04:13

Is it possible to get ASCII equivalent of UTF-8 characters?

0 Answers0