Python/Pandas: How can I read 7 million records?

Question

I was given a "database" (more correctly an ugly huge CSV file) that contains the results of a "discovery" process. The rows I get are very short, they are information about licensing on over 65,000 computers, it looks like:

10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Publisher MUI (Spanish) 2010;14.0.7015.1000;20150722;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Outlook MUI (Spanish) 2010;14.0.7015.1000;20160216;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Groove MUI (Spanish) 2010;14.0.7015.1000;20150722;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Word MUI (Spanish) 2010;14.0.7015.1000;20151119;Microsoft Corporation

As you see is a semicolon separated file, it has the time when the process was run, the PC's id, a blank (I don't know what it is), the program, and version program, there are more fields, but I don't care about them, only those ones are relevant.

So I turn to Pandas to do some analysis (basically counting), and got around 3M records. Problem is, this file is over 7M records (I looked at it using Notepad++ 64bit). So, how can I use Pandas to analyze a file with so many records?

I'm using Python 3.5, Pandas 0.19.2

Adding info for Fabio's comment:

I'm using:

df = pd.read_csv("inventario.csv", delimiter=";", 
                 header=None, usecols=[0,1,2,3,4], 
                 encoding="latin_1")

To be very precise: the file is 7'432,175 rows, Pandas is only accessing 3'172,197. Something curious is that if I load the file into Excel 2017 (using a data query) it will load exactly 3'172,197 rows.

EDIT: After the comments, I checked the file and found some lines are corrupted (around 450), I don't know if they were signaling and end of file, it doesn't look so, anyway, I cleaned the wrong-formed lines, and still Pandas read only around 3M lines.

EDIT:

OK, I solved the problem, but really, help me understand what I did wrong. I can't be doing things like I did... First, I cleaned the file for "strange" lines, they were around 500 of them, and then I saved the file to inv.csv

Then I did the following:

f_inventario = open("inv.csv", "r", encoding="latin1")
f_inventario.readlines()
f_inventario.close()

df = pd.DataFrame(lines)

df.columns = ['data']

df['fecha']    = df.data.apply(lambda s : s.split(';')[0])
df['equipo']   = df.data.apply(lambda s : s.split(';')[1])
df['software'] = df.data.apply(lambda s : s.split(';')[2])
df['version']  = df.data.apply(lambda s : s.split(';')[3][:-1])

df.drop(['data'], axis=1, inplace=True)

And now I got my dataframe with the 7M rows. If I did a df=pd.read_csv('inv.csv' ... ) it would only read about 3M records.

I got my problem solved, but this is terrible, this is not how it should be. As I see it is not a memory problem. Could it be some global variable that tells read_csv to load up to a maximum??? I really don't know.

This should not be a great problem for pandas. Check your amount of memory but 7 millions rows are not so much. Did you try `df=pd.read_csv(yourfile.csv, sep=';', usecols=[list of header you want to read])`? — Fabio Lamanna, Jun 27 '17 at 15:09
Reading 7 million rows in pandas: * read 7 million rows * there's no step two! — Eric Duminil, Jun 27 '17 at 15:12
Thanks for the edit, the syntax looks correct, maybe try to change `delimiter` into `sep`. — Fabio Lamanna, Jun 27 '17 at 15:15
That's what I was hoping @EricDuminil, but it only loads about half of the file — luisfer, Jun 27 '17 at 15:21
It's possible that using Notepad is confusing you -- there could be a newline within a quoted value, which means that the number of lines Notepad sees isn't really the number of rows. Try `import csv`/`with open("inventario.csv", "r", newline="") as fp: reader = csv.reader(fp, delimiter=';'); print(sum(1 for row in reader))` to check the number of ssv rows. If that's 3172197, then pandas and Excel are probably right. — DSM, Jun 27 '17 at 15:21
Great @DSM, it shows that some of the "records" are corrupt, so I think I'll have to clean the file first on a line by line base — luisfer, Jun 27 '17 at 15:32

score 1 · Answer 1 · answered Jun 27 '17 at 15:15

If performance is not an issue, a trivial approach would be to simply read the file line by line in to a buffer. Analyze the data in the buffer once the buffer is full. Continue this iteratively until you have processed the entire file. Once that's done you can then aggregate the results from each chunk to form your final result. To speed things up, you can look in to something like memory mapping, something like

import mmap
with open("hello.txt", "r+") as f:
    # memory-map the file, size 0 means whole file
    map = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print(map.readline())

see this thread

Python/Pandas: How can I read 7 million records?

1 Answers1