Pandas read_table
function is missing some lines in a file I'm trying to read and I can't find out why.
import pandas as pd
import numpy as np
filename = "whatever.txt"
df_pd = pd.read_table(filename, use_cols=['FirstColumn'], skip_blank_lines=False)
df_np = np.genfromtxt(filename, usecols=0)
#function to count file line by line
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_pd = len(df_pd)
len_np = len(df_np)
len_linebyline = file_len(filename)
Unfortunately I can't share my actual data because its a huge file, 30 columns x 58 million rows besides being protected by licensing. For some reason the numpy and file_len methods give the correct length of ~58 million rows but the pandas method only has ~55 million.
Does anyone have any ideas as to what could be causing this or how I could investigate it?