2

I have a text file containing 7 millions rows of text ~ and encoded in utf-16.

70357719 new.file

new.file: text/plain; charset=utf-16le

When I use pandas read_csv encoding to utf-16 it only imports a percentage of the rows.

Using the following test code;

import pandas as pd 
data = pd.read_csv('new.file',names=['Text'],sep="\n")
print "Plain:",len(data)

data = pd.read_csv('new.file',names=['Text'],encoding="utf-16",sep="\n")
print "utf-16",len(data)

Provides the following output;

'Plain:', 215585254
'utf-16', 65446415

I'm using python 2.7, and have already tested for empty rows in the file (of which there are none).

Basically, I'm at a lost for what to try next, I need all rows of this file to be imported.

F.D
  • 767
  • 2
  • 10
  • 23
  • Take a look: https://stackoverflow.com/questions/38728366/pandas-cannot-load-data-csv-encoding-mystery and https://stackoverflow.com/questions/55316476/pandas-read-csv-not-reading-all-rows – rafaelc Mar 23 '19 at 17:52
  • Why are you using sep="\n"? – Burrito Mar 23 '19 at 17:53
  • RafaelC, the second links goes back to this question. | Benitok, to separate each line = row, I'm aware names= would also do this. – F.D Mar 23 '19 at 17:56

0 Answers0