0

I have a 6.6 GB (43 million row) .txt file. Inside the file is about 20 columns of data.

I have the same data stored in a DB table and I want to do simple spot-check comparisons, like row count, null count, distinct count etc between the 2. I have done this kind of thing before in Pandas, but never with a dataset this large. I am trying to figure out how to read in that .txt file, or if I even need to read it in entirely to do the above analysis.

Clearly this won't work, as it will just run indefinitely:

data = pd.read_csv('huge_file.txt', sep=" ", header=0)

Any suggestions?

JD2775
  • 3,658
  • 7
  • 30
  • 52
  • 1
    There are a few options depending on what you want to do with the data. You could always read in a portion of the data with pandas using the `nrows` or `skiprows` argument of `read_csv`. – hoffee Jul 13 '18 at 18:58

2 Answers2

2

Use something like this:

chunksize = 10 ** 6
for chunk in pd.read_csv('huge_file.txt', chunksize=chunksize):
    do_cool_stuff(chunk)

Now it will read your file by chunks. Hope it helps!

Igor S
  • 224
  • 3
  • 11
1

Not sure if its helpful in your situation, but you could upload that csv into its own sqlite database and read it through a connection which is significantly faster.

import pandas as pd
import sqlite3

conn = sqlite3.connect("DB_PATH")
df= pd.read_sql(YOUR_QUERY, conn)