1

Pandas read_table function is missing some lines in a file I'm trying to read and I can't find out why.

import pandas as pd
import numpy as np
filename = "whatever.txt"

df_pd = pd.read_table(filename, use_cols=['FirstColumn'], skip_blank_lines=False)
df_np = np.genfromtxt(filename, usecols=0)

#function to count file line by line
def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

len_pd = len(df_pd)
len_np = len(df_np)
len_linebyline = file_len(filename)

Unfortunately I can't share my actual data because its a huge file, 30 columns x 58 million rows besides being protected by licensing. For some reason the numpy and file_len methods give the correct length of ~58 million rows but the pandas method only has ~55 million.

Does anyone have any ideas as to what could be causing this or how I could investigate it?

jesseWUT
  • 581
  • 4
  • 14
  • Please provide a __reproducible__ sample (use fake data) data set - 3-5 rows would be enough... Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – MaxU - stand with Ukraine Feb 07 '17 at 19:42
  • @MaxU the first column which I refer to in this example is just integers which function as an ID. I don't know how I'd provide a reproducible sample because it seems like most of the data is fine, but there is a chunk of it somewhere in the middle causing the problem, but I don't know how that chunk is different or where it is. Thank for the referral to that other question though – jesseWUT Feb 07 '17 at 19:50
  • The probability that someone will guess your problem not being able to see the reproducible data set is very low... So you would have to analyze the problem, find which data is missing on the Pandas side and after that you will either know the reason or will be able to provide a __reproducible__ data set. Just my $0.02 – MaxU - stand with Ukraine Feb 07 '17 at 19:57
  • @MaxU I appreciate your advice. I guess I'm just frustrated because I don't know how to reproduce the problem and if I could I would likely be able to solve the problem myself. I understand that the question as phrased is difficult to answer though – jesseWUT Feb 07 '17 at 20:01

1 Answers1

1

Using the following approach you can try to find the missing data:

In [31]: df = pd.DataFrame({'col':[0,1,2,3,4,6,7,8]})

In [32]: a = np.arange(10)

In [33]: df
Out[33]:
   col
0    0
1    1
2    2
3    3
4    4
5    6
6    7
7    8

In [34]: a
Out[34]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [35]: np.setdiff1d(a, df.col)
Out[35]: array([5, 9])
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419