Running simple analysis on very large text file - Python

Question

I have a 6.6 GB (43 million row) .txt file. Inside the file is about 20 columns of data.

I have the same data stored in a DB table and I want to do simple spot-check comparisons, like row count, null count, distinct count etc between the 2. I have done this kind of thing before in Pandas, but never with a dataset this large. I am trying to figure out how to read in that .txt file, or if I even need to read it in entirely to do the above analysis.

Clearly this won't work, as it will just run indefinitely:

data = pd.read_csv('huge_file.txt', sep=" ", header=0)

Any suggestions?

There are a few options depending on what you want to do with the data. You could always read in a portion of the data with pandas using the `nrows` or `skiprows` argument of `read_csv`. — hoffee, Jul 13 '18 at 18:58

Igor S · Answer 1 · 2018-07-13T19:08:18.343

2

Use something like this:

chunksize = 10 ** 6
for chunk in pd.read_csv('huge_file.txt', chunksize=chunksize):
    do_cool_stuff(chunk)

Now it will read your file by chunks. Hope it helps!

edited Jul 13 '18 at 19:08

answered Jul 13 '18 at 18:58

Igor S

224
3
11

thank you, ill give that a shot – JD2775 Jul 13 '18 at 19:07

score 1 · Answer 2 · answered Jul 13 '18 at 19:16

Not sure if its helpful in your situation, but you could upload that csv into its own sqlite database and read it through a connection which is significantly faster.

import pandas as pd
import sqlite3

conn = sqlite3.connect("DB_PATH")
df= pd.read_sql(YOUR_QUERY, conn)

Running simple analysis on very large text file - Python

2 Answers2