105

I have a CSV file and I wish to understand its encoding. Is there a menu option in Microsoft Excel that can help me detect it

OR do I need to make use of programming languages like C# or PHP to deduce it.

Vipul
  • 2,023
  • 2
  • 15
  • 12
  • Possible duplicate of http://superuser.com/questions/280603/how-to-set-character-encoding-when-opening-excel – Gandhi May 12 '16 at 05:06
  • Very closely related: [How to determine encoding table of a text file](https://stackoverflow.com/q/4255305/4518341) – wjandrea Mar 14 '23 at 19:00

11 Answers11

86

You can use Notepad++ to evaluate a file's encoding without needing to write code. The evaluated encoding of the open file will display on the bottom bar, far right side. The encodings supported can be seen by going to Settings -> Preferences -> New Document/Default Directory and looking in the drop down.

CamW
  • 3,223
  • 24
  • 34
  • 21
    Huh? Is this using chardet or what? This feels like a bad answer. – mlissner Sep 14 '17 at 00:13
  • 3
    Are you asking if Windows' built in text editor, notepad uses a python (or potentially js) library? I'm certain it doesn't since it'll be written in a native compiled language like C++. Also, text encoding is based on facts, not feelings so if you have knowledge of why this is a bad answer, by all means, raise it, but your feelings are irrelevant to the discussion. – CamW Sep 14 '17 at 08:24
  • 3
    Here's the thing: the CSV format doesn't have a way of identifying its character encoding. The format itself doesn't. So anything that opens a CSV has to guess what encoding was used to create the file. The question is, how does notepad do this? Chardet is a library that's used for this and it's based on the way browsers do it (IIRC, it was originally a C program). If notepad is guessing the encoding (as opposed to assuming all CSVs are `windows-1252` or something), it's using something like chardet. But what? We have no idea. I wouldn't recommend this approach. It's a black box. – mlissner Sep 14 '17 at 16:06
  • I'll add one more thing. Chardet (and other libraries) work by sniffing the contents of the file for binary patterns that would be common in certain encodings. If you do this programmatically, you can check the whole file for these kinds of patterns. If you do it via a GUI, it has to be fast, and you'll almost definitely limit yourself to the first X bytes of the file, which might not contain the pattern. So...I recommend using a library for this purpose, which can read the entire file, not a GUI like notepad, which would need to be performant for the user.. – mlissner Sep 14 '17 at 16:09
  • 1
    The OP specifically asked for a solution that doesn't require the use of a library. Yes, it's not a perfect solution with the unknowns involved, we do know that 1. notepad loads a lot of data since it slows with larger files. 2. Notepad is very mature and written by Microsoft so it most likely does a pretty good job of detecting the encoding. On balance, I feel the solution is good enough and requires the least effort. – CamW Sep 14 '17 at 16:43
  • Loading the data into the GUI may or may not be the same thing as sniffing it for an encoding. The time taken there could be totally different. We just don't *know* how much of the file it's sniffing or whether it's doing a good job — or even if it's doing a good job at all? Do we know whether it just has a default or whether it's sniffing? The OP also suggested that a programming language might be needed — to do this correctly, it is. – mlissner Sep 14 '17 at 17:17
  • Great suggestion. Thanks! – Josh Apr 21 '20 at 14:26
  • I'm using Mac, which Notepad++ is not compatible with. The OP doesn't mention that they're on Windows, and mentioning Microsoft Excel doesn't mean they're on Windows because Excel is compatible with Mac. It should be mentioned in the answer that the solution is only meant for Windows. – SteveExdia Jun 25 '21 at 14:31
79

In Linux systems, you can use file command. It will give the correct encoding

Sample:

file blah.csv

Output:

blah.csv: ISO-8859 text, with very long lines
Jitender Kumar
  • 909
  • 6
  • 5
  • 3
    Caution: I don't think it's very reliable. I saved a plain csv and checked the hex code and it was GB2312 (my system default encoding). But the file result is `ASCII-text`. However, maybe `ASCII-text` here stands for **system defaulted** which is beyond its original meaning. – Rick May 22 '18 at 07:53
  • 1
    @Rick. Many many encodings are supersets of ASCII. From a reading of the Wikipedia article, GB2312/EUC-CN text which contains only ASCII characters is indistinguishable from ASCII. – TRiG Sep 23 '21 at 08:56
68

If you use Python, just use a print() function to check the encoding of a csv file. For example:

with open('file_name.csv') as f:
    print(f)

The output is something like this:

<_io.TextIOWrapper name='file_name.csv' mode='r' encoding='utf8'>
Alineat
  • 833
  • 6
  • 9
  • 11
    Note: this *may* not be correct 100% of the time. I have a CSV file that can be opened correctly as `utf-16`, but the approach in this answer returns `cp1252`. Trying to open it with Pandas CSV reader as `cp1252` returns a `ParserError`, so, something's amiss somewhere. – Mast Feb 12 '20 at 15:26
  • This did not work for me. The file I had was `ANSI` encoded, but the above steps gave me `cp1252` as encoding, which when I tried opening by mentioning `cp1252` as encoding failed. `UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 54852: character maps to ` – Prometheus Jan 07 '21 at 00:39
  • From the python3 docs: `open(file, ..., encoding=None)`. In text mode, if encoding is not specified the encoding used is platform dependent: `locale.getpreferredencoding(False)`. The encoding of a CSV file is determined by the platform/program it was created on. If you don't know the context, `chardet` is a good start, but know that it's more than a decode old and has no support for emoticons etc. Use `encoding=utf-8` is more robust nowadays. – Stefaan Ghysels Jul 04 '21 at 18:38
  • 1
    This answer seems wrong! It always shows `cp1252`, even though I can see in Notepad++ that the csv file is utf-8. Only when doing `with open('file_name.csv', encoding='utf-8') as f:` it does in fact show utf-8, which is not helpful. – Benji Mar 17 '22 at 13:29
  • 1
    This answer is totally wrong. The encoding Python uses to open the file *does not* necessarily correspond with the encoding of the data in the file; Python just uses **the platform's default encoding**. On Linux and I believe Mac, that's UTF-8, while on Windows, it's CP1252, hence the comments above saying this always returns `cp1252`. For more details, see the [docs on `open`](https://docs.python.org/3/library/functions.html#open). – wjandrea Mar 14 '23 at 18:51
20

You can also use python chardet library

# install the chardet library
!pip install chardet

# import the chardet library
import chardet 

# use the detect method to find the encoding
# 'rb' means read in the file as binary
with open("test.csv", 'rb') as file:
    print(chardet.detect(file.read()))
Md Kaish Ansari
  • 251
  • 2
  • 7
10

Use chardet https://github.com/chardet/chardet (documentation is short and easy to read).

Install python, then pip install chardet, at last use the command line command.

I tested under GB2312 and it's pretty accurate. (Make sure you have at least a few characters, sample with only 1 character may fail easily).

file is not reliable as you can see.

enter image description here

Rick
  • 7,007
  • 2
  • 49
  • 79
  • Indeed, chardetect (the cli tool from python-chardet package shipped with most distro) is more reliable than file, but it can also take much longer (especially with huge file) to compute a result with a certain degree of confidence – Cyril Chaboisseau Nov 11 '19 at 22:49
  • If this takes too long you can sample part of the file to achieve a decent result, e.g. start by doing "head -n 500 bigfile.csv > smallerfile.csv" to get the top 500 lines to use as a sample. – greggles May 27 '22 at 18:02
6

Or you can execute in python console or in Jupyter Notebook:

import csv
data = open("file.csv","r") 
data

You will see information about the data object like this:

<_io.TextIOWrapper name='arch.csv' mode='r' encoding='cp1250'>

As you can see it contains encoding infotmation.

NemrodDev
  • 69
  • 1
  • 1
5

CSV files have no headers indicating the encoding.

You can only guess by looking at:

  • the platform / application the file was created on
  • the bytes in the file

In 2021, emoticons are widely used, but many import tools fail to import them. The chardet library is often recommended in the answers above, but the lib does not handle emoticons well.

icecream = ''

import csv

with open('test.csv', 'w') as f:
    wf = csv.writer(f)
    wf.writerow(['ice cream', icecream])


import chardet
with open('test.csv', 'rb') as f:
    print(chardet.detect(f.read()))

{'encoding': 'Windows-1254', 'confidence': 0.3864823918622268, 'language': 'Turkish'}

This gives UnicodeDecodeError while trying to read the file with this encoding.

The default encoding on Mac is UTF-8. It's included explicitly here but that wasn't even necessary... but on Windows it might be.

with open('test.csv', 'r', encoding='utf-8') as f:
    print(f.read())

ice cream,

The file command also picked this up

file test.csv
test.csv: UTF-8 Unicode text, with CRLF line terminators

My advice in 2021, if the automatic detection goes wrong: try UTF-8 before resorting to chardet.

Stefaan Ghysels
  • 121
  • 2
  • 7
2

In Python, You can Try...

from encodings.aliases import aliases
alias_values = set(aliases.values())

for encoding in set(aliases.values()):
    try:
        df=pd.read_csv("test.csv", encoding=encoding)
        print('successful', encoding)
    except:
        pass
Md Kaish Ansari
  • 251
  • 2
  • 7
1

As it is mentioned by @3724913 (Jitender Kumar) to use file command (it also works in WSL on Windows), I was able to get encoding information of a csv file by executing file --exclude encoding blah.csv using info available on man file as file blah.csv won't show the encoding info on my system.

Shobeira
  • 29
  • 5
0
import pandas as pd
import chardet
def read_csv(path: str, size: float = 0.10) -> pd.DataFrame:
 """
 Reads a CSV file located at path and returns it as a Pandas DataFrame. If 
 nrows is provided, only the first nrows rows of the CSV file will be 
 read. Otherwise, all rows will be read.

 Args:
    path (str): The path to the CSV file.
    size (float): The fraction of the file to be used for detecting the 
    encoding. Defaults to 0.10.

 Returns:
    pd.DataFrame: The CSV file as a Pandas DataFrame.

 Raises:
    UnicodeError: If the encoding of the file cannot be detected with the 
    initial size, the function will retry with a larger size (increased by 
    0.20) until the encoding can be detected or an error is raised.
 """
 try:
    byte_size = int(os.path.getsize(path) * size)

    with open(path, "rb") as rawdata:
        result = chardet.detect(rawdata.read(byte_size))

    return pd.read_csv(path, encoding=result["encoding"])

 except UnicodeError:
    return read_csv(path=path, size=size + 0.20)

Hi, I just added a function to find the correct encoding and read the csv in the given file path. Thought it would be useful

sri dhurkesh
  • 11
  • 1
  • 1
  • 3
-3

Just add the encoding argument that matches the file you`re trying to upload.

open('example.csv', encoding='UTF8')
Sven Eberth
  • 3,057
  • 12
  • 24
  • 29
Alani
  • 73
  • 6