I found the solution to similar question on the other topic, but unfortunately it's not working for me. Here is my problem:
I'm making dataframe from the surrogatepairs unicodes which I'd like to search for in another file (example: "\uD83C\uDFF3", "\u26F9", "\uD83C\uDDE6\uD83C\uDDE8"):
with open("unicodes.csv", "rt") as csvfile:
emoticons = pd.read_csv(csvfile, names=["xy"])
emoticons = pd.DataFrame(emoticons)
emoticons = emoticons.astype(str)
Next I'm reading my file with text where some lines contain surrogate pairs unicodes:
for chunk in pd.read_csv(path, names=["xy"], encoding="utf-8", chunksize=chunksize):
spam = pd.DataFrame(chunk)
spam = spam.astype(str)
In this for loop I'm checking if line contains surrogatepairs unicode, and if it's true, then I'd like to print this surrogatepair unicode as emoji - that's why I'm encoding and decoding this "i" value which is str: (solution from: How to work with surrogate pairs in Python?)
for i in emoticons.xy:
if spam["xy"].str.contains(i, regex=False).any():
print(i.encode('utf-16', 'surrogatepass').decode('utf-16'))
#printing:
#\uD83C\uDFF3
#\u26F9
#\uD83C\uDDE6\uD83C\uDDE8
So, when I start the program it still prints surrogatepairs unicode as str, not as emoji, but when I input surrogatepair unicode into print function by myself, it works:
print("\uD83C\uDFF3".encode("utf-16", "surrogatepass").decode("utf-16", "surrogatepass"))
#printing:
#
What am I doing wrong? I tried to make string from this i and another solutions, but it still doesn't work.
EDIT:
hexdump -C file.csv
00004b70 5c 75 44 38 33 44 5c 75 44 45 45 39 0a 5c 75 44 |\uD83D\uDEE9.\uD|
00004b80 38 33 44 5c 75 44 45 45 42 0a 5c 75 44 38 33 44 |83D\uDEEB.\uD83D|
00004b90 5c 75 44 45 45 43 0a 5c 75 44 38 33 44 5c 75 44 |\uDEEC.\uD83D\uD|
00004ba0 43 42 41 0a 5c 75 44 38 33 44 5c 75 44 45 38 31 |CBA.\uD83D\uDE81|
EDIT2: So I've found something kind of working, but still need an improvement: https://stackoverflow.com/a/54918256/4789281
Text from my another file which I want to convert looks file:
"O żółtku zapomniałaś \uD83D\uDE02"
"Piękny outfit \uD83D\uDE0D"
When I'm doing this what was recommended in another topic:
print(codecs.decode(i,encoding='unicode_escape',errors='surrogateescape').encode('utf-16', 'surrogatepass').decode('utf-16'))
I've got something like this:
O żóÅtku zapomniaÅaÅ
PiÄkny outfit
So my surrogatepairs are replaced, but my polish characters are replaced with something strange.