When you force the input encoding to Latin-1, you are basically wrecking any input files which are not actually Latin-1. For example, a Russian text file containing the text привет
in code page 1251 will silently be translated to ïðèâåò
. (The same text in the UTF-8 encoding would map to the similarly bogus but completely different string пÑивеÑ
.)
The sustainable solution is to, first, correctly identify the input encoding of each file, and then, second, choose an output encoding which can accommodate all of the input encodings correctly.
I would choose UTF-8 for output, but any Unicode variant will technically work. If you need to pass the result to something more or less braindead (cough Microsoft cough Java) maybe UTF-16 will be more convenient for your use case.
data = dict()
for file in glob.glob("DR_BigData_*.csv"):
if 'ru' in file:
enc = 'cp1251'
elif 'it' in file:
enc = 'latin-1'
# ... add more here
else:
raise KeyError("I don't know the encoding for %s" % file)
data[file] = pd.read_csv(file, encoding=enc)
# ... merge data[] as previously
The if
statement is really just a placeholder for something more useful; without access to your files, I have no idea how your files are named, or which encodings to use for which ones. This simplistically assumes that files in Russian would all have the substring "ru" in their names, and that you want to use a specific encoding for all of those.
If you only have two encodings, and one of them is UTF-8, this is actually quite easy; try to decode as UTF-8, then if that doesn't work, fall back to the other encoding:
for file in glob.glob("DR_BigData_*.csv"):
try:
data[file] = pd.read_csv(file, encoding='utf-8')
except UnicodeDecodeError:
data[file] = pd.read_csv(file, encoding='latin-1')
This is likely to work simply because text which is not valid UTF-8 will typically raise a UnicodeDecodeError
very quickly. The encoding is designed so that bytes with the 8th bit set have to adhere to a very specific pattern. This is a useful feature, not something you should feel frustrated about. Not getting the correct data from the file is much worse.
If you don't know what encodings are, now would be a good time to finally read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
As an aside, your computer already knows which directory it's in; you basically never need to call os.getcwd()
unless you require to find out the absolute path of the current directory.