0

I have extracted text from a PDF using PyMuPDF and certain characters appear to be Unicode instead of ASCII, which is annoying. For example, U+201D instead of U+0022 or U+2019 instead of U+0027. I want the ASCII codes and one way that I have thought about this is by using regex substitution or something similar. Is there a way that I could change these to their ASCII counterparts?

More visual examples:
More visual example

mhay10
  • 21
  • 1
  • 5
  • So you want to convert Unicode to ASCII??? – Flow Nov 02 '22 at 05:16
  • 1
    Why is it annoying? You certainly can do `s.replace("\u201d",'"')` – Tim Roberts Nov 02 '22 at 05:16
  • @TimRoberts Sure, that handles a single character value. I think the OP is looking for a general way of translating Unicode text to ASCII. Any such translation is going to lose information (if the result is expected to be legible). – Keith Thompson Nov 03 '22 at 04:24
  • He needs exactly 4 translations. I didn't include them all, because I was pretty sure he could provides the others. There is no generic "Unicode text to ASCII" translation. This is a specific need for 4 specific characters. – Tim Roberts Nov 03 '22 at 06:23
  • @TimRoberts He said those were *examples*. If those are the only 4 characters he wants to translate, then of course it's trivial. If they really are just examples, the problem can be arbitrary difficult to solve. – Keith Thompson Nov 04 '22 at 04:13

0 Answers0