0

I have a huge dataset (about 8.5M records) on a ".csv" file (it uses pipes instead of commas), I got no idea what is its encoding, since I live in Mexico and has accents (á é...) I assume its either latin or iso-8859-1.

When I try to import the file to a DataFrame using pandas

bmc=pd.read_csv('file.csv', sep='|', 
            error_bad_lines=False, encoding='iso-8859-1')

It reads nothing:

    ÿþF     Unnamed: 1  Unnamed: 2  Unnamed: 3  Unnamed: 4
0   NaN     NaN         NaN         NaN         NaN
1   NaN     NaN         NaN         NaN         NaN
2   NaN     NaN         NaN         NaN         NaN

If I don't place iso-8859-1 or latin, I got the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

So, to encode the file to utf-8, I open the file in Notepad++ which can read huge files, I mannualy delete the ÿþ at the start of the file, then change the encoding to utf-8 and save as a new file.

Notepad++ says the file encoding is: UCS-2 LE BOM

The filesize goes from 1.8Mb to about 0.9Mb, now I can open this file with pandas without problem.

So I think converting to utf-8 should be part of my preprocessing.

I used this solution: How to convert a file to utf-8 in Python? and created a function to convert several files:

BLOCKSIZE = 1048576 # or some other, desired size in bytes

def convert_utf8(sourceFileName, targetFileName, sourceEncoding='iso-8859-1'):
    with codecs.open(sourceFileName, "r", sourceEncoding) as sourceFile:
        with codecs.open(targetFileName, "w", "utf-8") as targetFile:
            while True:
                contents = sourceFile.read(BLOCKSIZE)
                if not contents:
                    break
                targetFile.write(contents)

Now, the problem is, that when the file is written it adds a NULL character after every valid character, let me show it in the editor:

enter image description here

This file, of course, doesn't work in Pandas. So far, I have solved my problem using Notepad++, but of course there must be a better way, a way that I don't have to rely on other tools.

luisfer
  • 1,927
  • 7
  • 40
  • 54
  • 1
    Why convert the file? Just open it with the proper encoding: `bmc=pd.read_csv('file.csv', sep='|', error_bad_lines=False, encoding='utf-16')`. UTF-16 is a superset of UCS-2 which is obsolete. – Mark Ransom Jan 02 '19 at 22:03
  • Why this works?????? It worked, and did it extremely fast, but I got no idea what is *utf-16*. I really hate all this encoding thing. THANKS A LOT! – luisfer Jan 02 '19 at 22:06
  • 2
    https://en.wikipedia.org/wiki/UTF-16 – Mark Ransom Jan 02 '19 at 22:08
  • Do you want to add it as answer? Thanks again! – luisfer Jan 02 '19 at 22:10
  • 1
    P.S. It's a shame Windows was an early adopter of Unicode. If they'd waited they could have standardized on UTF-8 like Linux mostly has. UCS-2 was the first version of Unicode encoding. – Mark Ransom Jan 02 '19 at 22:10
  • 1
    I'm reluctant to add it as an answer, because while it solves your problem it isn't an actual answer to the question you asked. – Mark Ransom Jan 02 '19 at 22:11
  • The answer could be, read the df, then save it using utf-8 encoding using pandas ;-) – luisfer Jan 02 '19 at 22:32

1 Answers1

2

To convert a file from one encoding to another in Python:

with open('file1.txt',encoding='utf16') as fin:
   with open('file2.txt','w',encoding='utf8') as fout:
       fout.write(fin.read())

But in your case, as Mark Ransom pointed out in a comment, just open with the appropriate encoding:

bmc = pd.read_csv('file.csv', sep='|', error_bad_lines=False, encoding='utf16')
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Of course I tried to do this, but when I did it, produced the file with NUL characters every other character – luisfer Jan 04 '19 at 23:25
  • @luisfer That's what UTF-16 is, a word-based encoding. You have to use an editor that can read UTF-16 encoding if you want to read it correctly. Notepad++ can do that. Pandas can read the file if you use the `utf16` encoding. If you don't want the UTF-16 encoding, convert it using the first part of my answer. – Mark Tolonen Jan 05 '19 at 00:35
  • Ohhhhh got it. I didn't use utf-16 encoding, so I was getting nothing! That's much clear now – luisfer Jan 05 '19 at 14:50