0

I found the solution to similar question on the other topic, but unfortunately it's not working for me. Here is my problem:

I'm making dataframe from the surrogatepairs unicodes which I'd like to search for in another file (example: "\uD83C\uDFF3", "\u26F9", "\uD83C\uDDE6\uD83C\uDDE8"):

    with open("unicodes.csv", "rt") as csvfile:                                        
        emoticons = pd.read_csv(csvfile, names=["xy"])
        emoticons = pd.DataFrame(emoticons)
        emoticons = emoticons.astype(str)

Next I'm reading my file with text where some lines contain surrogate pairs unicodes:

    for chunk in pd.read_csv(path, names=["xy"], encoding="utf-8", chunksize=chunksize):            
        spam = pd.DataFrame(chunk)
        spam = spam.astype(str)

In this for loop I'm checking if line contains surrogatepairs unicode, and if it's true, then I'd like to print this surrogatepair unicode as emoji - that's why I'm encoding and decoding this "i" value which is str: (solution from: How to work with surrogate pairs in Python?)

        for i in emoticons.xy:
            if spam["xy"].str.contains(i, regex=False).any():                                 
                print(i.encode('utf-16', 'surrogatepass').decode('utf-16'))

               #printing:
               #\uD83C\uDFF3 
               #\u26F9
               #\uD83C\uDDE6\uD83C\uDDE8

So, when I start the program it still prints surrogatepairs unicode as str, not as emoji, but when I input surrogatepair unicode into print function by myself, it works:

    print("\uD83C\uDFF3".encode("utf-16", "surrogatepass").decode("utf-16", "surrogatepass"))

    #printing:
    # 

What am I doing wrong? I tried to make string from this i and another solutions, but it still doesn't work.

EDIT:

hexdump -C file.csv
00004b70  5c 75 44 38 33 44 5c 75  44 45 45 39 0a 5c 75 44  |\uD83D\uDEE9.\uD|
00004b80  38 33 44 5c 75 44 45 45  42 0a 5c 75 44 38 33 44  |83D\uDEEB.\uD83D|
00004b90  5c 75 44 45 45 43 0a 5c  75 44 38 33 44 5c 75 44  |\uDEEC.\uD83D\uD|
00004ba0  43 42 41 0a 5c 75 44 38  33 44 5c 75 44 45 38 31  |CBA.\uD83D\uDE81|

EDIT2: So I've found something kind of working, but still need an improvement: https://stackoverflow.com/a/54918256/4789281

Text from my another file which I want to convert looks file:

"O żółtku zapomniałaś \uD83D\uDE02"
"Piękny outfit \uD83D\uDE0D"

When I'm doing this what was recommended in another topic:

print(codecs.decode(i,encoding='unicode_escape',errors='surrogateescape').encode('utf-16', 'surrogatepass').decode('utf-16'))

I've got something like this:

O żóÅtku zapomniaÅaÅ 
PiÄkny outfit 

So my surrogatepairs are replaced, but my polish characters are replaced with something strange.

maliniaki
  • 47
  • 9
  • Surrogates aren't allowed in UTF-8 – even though some implementations support this, but not the Python one, which sticks to the standard. Also, it's not usually useful to have surragtes in strings, so `"\uD83C\uDFF3"` is correctly represented as `'\U0001f3f3'`. However, if you actually have data serialised as (invalid) UTF-8 with surrogates, and you can't fix the source, then please state so clearly (eg. show a hexdump snippet of such a region in an input document). – lenz Nov 20 '19 at 09:48
  • @lenz, so I've made something like this while printing: print(hexdump.hexdump(i.encode())) #printing: 00000000: 5C 75 44 38 33 43 5C 75 44 44 45 41 5C 75 44 38 \uD83C\uDDEA\uD8 00000010: 33 43 5C 75 44 44 46 41 3C\uDDFA None – maliniaki Nov 20 '19 at 12:08
  • Can you put this in the question? It's hard to read it like this. Also, it might be more informative to have a hexdump of the raw file (same snippet), rather than preprocessed by pandas. – lenz Nov 20 '19 at 14:59
  • @lenz, I've made hexdump on file. Some of the output is in edit of the question – maliniaki Nov 21 '19 at 08:24
  • Your file "file.csv" contains no surrogate pairs. It's plain ASCII: it contains escape sequences, ie. literally the characters `\` (backslash), `u`, `D`, `8`, `3`, `D` etc. Are there, by any chance, embedded JSON snippets inside the CSV file? Because JSON uses this notation for escaping non-ASCII characters, and it encodes characters above U+FFFF with surrogate escapes. – lenz Nov 21 '19 at 11:07
  • I'm not sure. This file contains comments from some website, and this surrogates were literally emoticons – maliniaki Nov 21 '19 at 15:44
  • tricky. byt feasible – jsbueno Nov 22 '19 at 16:01

1 Answers1

1

You are along the right track. WHat you are trying to do breaks because what you have in your "str" after you read the file are not "surrogate pairs" - instead, they are backslash-encoded codepoints for your surrogate pairs, encoded as text.

That is: the sequence "5c 75 44 38 33 44" in your file are the ACTUAL ascii characters "\uD83D" (6 characters in total), not the surrogate codepoint 0xD83D (which, when properly decoded, along with the next surrogate "\uDE0D" will be a single character in your string).

The part I said you are on the right track is: you really have to encode this into a bytes-sequence, and then decode it back. What is wrong is that you have to encode it using "latin1" (just to try to preserve any other non-ascii char you have on the string- it may break if you have codepoints not representable in latin1), and decode it back with the special "unicode escape" codec. or a charmap encoding, that will preserve your other characters on the string, and then decode it back, using the same codec. At that point, both surrogate characters will be text as two characters in a Python string:

In [16]: "\\uD83D\\uDE0D".encode("latin1").decode("unicode escape", "surrogatepass")                              
Out[16]: '\ud83d\ude0d'

The bad news is - that is not a very valid STR - the surrogate characters should not exist by themselves in the internal representation - instead, they should be combined in to the desired final character. So, trying to print that out will break:

In [19]: a  = "\\uD83D\\uDE0D".encode("utf-8").decode("unicode escape")                                          

In [20]: print(a)                                                                                                       
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-20-bca0e2660b9f> in <module>
----> 1 print(a)

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

Using "surrogatepass" error policy here will be of no help - you will get an unprintable bytesequence.

Therefore, for a second time, this have to be "encoded" and "decoded" - this time, the characters you have in text are actual "surrogate" codepoints that would be valid utf-16 to be decoded. So, the path now is to encode this sequence, brute-forcing these chars with "surrogatepass", and decode then back from utf-16 - which will finally understand the surrogate pair as a single character:

In [30]: a  = "\\uD83D\\uDE0D".encode("unicode escape").decode("unicode escape")                                          

In [31]: a                                                                                                              
Out[31]: '\ud83d\ude0d'

In [32]: b = a.encode("utf-16", "surrogatepass").decode("utf-16")                                                       

In [33]: print(b)                                                                                                       

Summarising:

You read your file as utf-8 text, to read possible other non-ascii characters, encode the result as "unicode escape" and decode it back - this will convert the extended human readable "\uXXXX" sequences in your file as the surrogate codepoints. Then you convert it back to utf-16, telling Python to ignore surrogates and copy then "as is", and decode back from utf-16:

def decode_surrogate_ascii(txt):
    interm = txt.encode("latin1").decode("unicode escape")
    return interm.encode("utf-16", "surrogatepass").decode("utf-16")

All you have to do is to apply the above function in the columns of interest on your data frame:

emoticons = emoticons.apply(pd.Series(lambda row: (decode_surrogate_ascii(item) if isinstance(item,  str) else item for item in row ))
jsbueno
  • 99,910
  • 10
  • 151
  • 209