I have a text file containing 7 millions rows of text ~ and encoded in utf-16.
70357719 new.file
new.file: text/plain; charset=utf-16le
When I use pandas read_csv encoding to utf-16 it only imports a percentage of the rows.
Using the following test code;
import pandas as pd
data = pd.read_csv('new.file',names=['Text'],sep="\n")
print "Plain:",len(data)
data = pd.read_csv('new.file',names=['Text'],encoding="utf-16",sep="\n")
print "utf-16",len(data)
Provides the following output;
'Plain:', 215585254
'utf-16', 65446415
I'm using python 2.7, and have already tested for empty rows in the file (of which there are none).
Basically, I'm at a lost for what to try next, I need all rows of this file to be imported.