1

I have a .CSV file containing Arabic data and I need to view this file in jupyter using python and pandas. But I have a problem with the encoding What should I do ? any ideas please ? This is my code

And this is the error

  • Welcome to SO. Please refrain from posting images of your code. Instead type it in a code block in the question itself. – Zero May 04 '22 at 08:19
  • Please post the code and error. It could be that the file is using an older style windows encoding or even UTF-16. You could read and post a sample in bytes `print(open("thefile", "rb").read(100))` and that would help us guess. Try "UTF-16", and "cp720". You can look through the list here: https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers and put "cp" on the front of them to see what works. – tdelaney May 05 '22 at 23:44

3 Answers3

0

you might need to save the original CSV file as mentioned in this link CSV file with Arabic characters is displayed as symbols in Excel

Whereismywall
  • 188
  • 1
  • 12
0

I have never encountered a problem like this before, but it seems that it is a problem in the decoding. Check this question, it might help.

What you can always do is to check the encoding of your file; maybe is something else and not 'utf-8'. The following code will help you do this:

from bs4 import UnicodeDammit

filename="absolute_path_of_your_file"

with open(filename, "rb") as file:
    content = file.read()

suggestion = UnicodeDammit(content)
suggestion.original_encoding

The output will be the encoding of your file. I hope it helps and that I've correctly understood your problem.

Paschalis Ag
  • 128
  • 6
  • Thank you for your answer but the output is "Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER. utf-8", and still the result is incorrect it cannot read arabic characters – Malek Hedhli May 05 '22 at 13:13
0

Given that the decoder fails at character 15, I wonder if your file is correctly encoded as UTF-8, but then has some improperly encoded byte(s).

I was just dealing with this myself when a program improperly handled the RIGHT SINGLE QUOTATION MARK and wrote some corrupted UTF-8. I didn't know this till I ran it through a process in Python and got an error similar to yours:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 16: invalid continuation byte

Here's the bad text, which I've copied from VSCode:

Community Member�s monthly

So VSCode and Python cannot make sense of it. But everything else is obviously fine.

Here's a small-ish test you can run on your input file to see if the first 15-or-so characters are properly encoded as UTF-8. It uses and incremental decoder to build up a Unicode character byte-by-byte.

s starts as None, then when enough valid bytes have been passed in, the decoder returns the complete character, buffer keeps track of bytes that have been read but not decoded:

decoder = codecs.getincrementaldecoder("utf-8")()

with open("bad.txt", "rb") as f:
    s = None
    start = 0
    b = f.read(1)
    buffer = [b]

    while b:
        try:
            s = decoder.decode(b, final=False)
        except UnicodeDecodeError:
            print(f"{start}-{start+len(buffer)}: error, could not decode byte(s) {buffer}")
            sys.exit(1)

        if s:
            print(f"{start}-{start+len(buffer)}: {s}")
            start += len(buffer)
            buffer = []

        b = f.read(1)
        buffer.append(b)

When I run that on the sample text I included earlier, I get:

0-1: C
1-2: o
2-3: m
3-4: m
4-5: u
5-6: n
6-7: i
7-8: t
8-9: y
9-10:  
10-11: M
11-12: e
12-13: m
13-14: b
14-15: e
15-16: r
16-18: error, could not decode byte(s) [b'\xd5', b's']

If you do see valid Arabic text before the error, then you'll need to manually pull out the bad characters and try again. Or, find the source and see if you can get it re-encoded properly.

Zach Young
  • 10,137
  • 4
  • 32
  • 53