2

I have a lot of CSV files and I want to merge them into one CSV file. The thing is that the CSV files contain data in different languages like Russian, English, Croatian, Spanish, etc. Some of the CSV files even have their data written in multiple languages.

When I open the CSV files, the data looks perfectly fine, written properly in their languages and I want to read all the CSV files in their language, and write them to one big CSV file as they are.

The code I use is this:

directory_path = os.getcwd()
all_files=glob.glob(os.path.join(directory_path,"DR_BigData_*.csv"))
print(all_files)
merge_file='data_5.csv'
df_from_each_file=(pd.read_csv(f,encoding='latin1') for f in all_files)
df_merged=pd.concat(df_from_each_file,ignore_index=True)
df_merged.to_csv(merge_file,index=False)

If I use "encoding='latin1'", it successfully writes all the CSV files into one but as you might guess, the characters are so messed up. Here is a part of the output as an example:

Horrendously, OP seems to use Excel to look at data

I also tried to write them into .xlsx with using encoding='latin1', I still encountered the same issue. In addition to these, I tried many different encoding, but those gave me decoding errors.

tripleee
  • 175,061
  • 34
  • 275
  • 318
3Pac
  • 29
  • 3
  • Could you use something like URL encoding of the special characters, then you have them stored correctly, and worry about how you convert them back to a printable character later? – JeffUK Dec 01 '20 at 11:46
  • Possible duplicate of https://stackoverflow.com/questions/10611455/what-is-character-encoding-and-why-should-i-bother-with-it – tripleee Dec 01 '20 at 12:25

2 Answers2

2

When you force the input encoding to Latin-1, you are basically wrecking any input files which are not actually Latin-1. For example, a Russian text file containing the text привет in code page 1251 will silently be translated to ïðèâåò. (The same text in the UTF-8 encoding would map to the similarly bogus but completely different string пÑивеÑ.)

The sustainable solution is to, first, correctly identify the input encoding of each file, and then, second, choose an output encoding which can accommodate all of the input encodings correctly.

I would choose UTF-8 for output, but any Unicode variant will technically work. If you need to pass the result to something more or less braindead (cough Microsoft cough Java) maybe UTF-16 will be more convenient for your use case.

data = dict()
for file in glob.glob("DR_BigData_*.csv"):
   if 'ru' in file:
      enc = 'cp1251'
   elif 'it' in file:
      enc = 'latin-1'
   # ... add more here
   else:
      raise KeyError("I don't know the encoding for %s" % file)
   data[file] = pd.read_csv(file, encoding=enc)
# ... merge data[] as previously

The if statement is really just a placeholder for something more useful; without access to your files, I have no idea how your files are named, or which encodings to use for which ones. This simplistically assumes that files in Russian would all have the substring "ru" in their names, and that you want to use a specific encoding for all of those.

If you only have two encodings, and one of them is UTF-8, this is actually quite easy; try to decode as UTF-8, then if that doesn't work, fall back to the other encoding:

for file in glob.glob("DR_BigData_*.csv"):
    try:
       data[file] = pd.read_csv(file, encoding='utf-8')
    except UnicodeDecodeError:
       data[file] = pd.read_csv(file, encoding='latin-1')

This is likely to work simply because text which is not valid UTF-8 will typically raise a UnicodeDecodeError very quickly. The encoding is designed so that bytes with the 8th bit set have to adhere to a very specific pattern. This is a useful feature, not something you should feel frustrated about. Not getting the correct data from the file is much worse.

If you don't know what encodings are, now would be a good time to finally read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

As an aside, your computer already knows which directory it's in; you basically never need to call os.getcwd() unless you require to find out the absolute path of the current directory.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • For what it's worth, the Russian data in your screen shot looks like it's already in UTF-8. If all your input files are actually UTF-8, simply use that instead of Latin-1; no other code changes needed. (This would also wreck the occasional accented character in the Western text you currently import, like *doppelgänger* for example.) – tripleee Dec 01 '20 at 12:23
  • This partly helped me solve my problem. As you can see, I used pandas to do the merging. With this solution, I gave csv reader a chance. Similar to this one, I used try/except because I didnt know what "ru" means in"if 'ru' in file:". I have utf-8 and latin1 encodings,I got all the csv file names in a list and in a for loop, if csv reader fails to read a csv file with utf-8 encodings, it will try to read it with latin1 and it actually worked. Thank you so much. PS: I am still curious how it can be solved with using pandas. Maybe the method I used with csv reader will do with pandas too. – 3Pac Dec 02 '20 at 09:36
  • Again, "it worked" only in the sense that you didn't get a traceback; if the real encoding was not Latin-1, it means you now have garbage in the output file. – tripleee Dec 02 '20 at 09:38
  • `if 'ru' in file:` is a placeholder for something more useful; without access to your files, I have no idea which encodings to use for which ones, or even how they are named. You have to know for each file which encoding to use. If the files which are in Russian have "ru" in the file name, this would select those files; but probably you want something more specific which is correct for your actual files. I'll update the answer to explain this in more detail,. – tripleee Dec 02 '20 at 09:39
0

If I understood your question correctly, you can easily merge all your csv files (as they are) using cat command:

cat file1.csv file2.csv file3.csv ... > Merged.csv
Apex
  • 1,055
  • 4
  • 22
  • If all the files use the same encoding, this actually works, though if each file has a header line, you will get headers interspersed with the data. – tripleee Dec 01 '20 at 12:26