I was given a "database" (more correctly an ugly huge CSV file) that contains the results of a "discovery" process. The rows I get are very short, they are information about licensing on over 65,000 computers, it looks like:
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Publisher MUI (Spanish) 2010;14.0.7015.1000;20150722;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Outlook MUI (Spanish) 2010;14.0.7015.1000;20160216;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Groove MUI (Spanish) 2010;14.0.7015.1000;20150722;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Word MUI (Spanish) 2010;14.0.7015.1000;20151119;Microsoft Corporation
As you see is a semicolon separated file, it has the time when the process was run, the PC's id, a blank (I don't know what it is), the program, and version program, there are more fields, but I don't care about them, only those ones are relevant.
So I turn to Pandas to do some analysis (basically counting), and got around 3M records. Problem is, this file is over 7M records (I looked at it using Notepad++ 64bit). So, how can I use Pandas to analyze a file with so many records?
I'm using Python 3.5, Pandas 0.19.2
Adding info for Fabio's comment:
I'm using:
df = pd.read_csv("inventario.csv", delimiter=";",
header=None, usecols=[0,1,2,3,4],
encoding="latin_1")
To be very precise: the file is 7'432,175 rows, Pandas is only accessing 3'172,197. Something curious is that if I load the file into Excel 2017 (using a data query) it will load exactly 3'172,197 rows.
EDIT: After the comments, I checked the file and found some lines are corrupted (around 450), I don't know if they were signaling and end of file, it doesn't look so, anyway, I cleaned the wrong-formed lines, and still Pandas read only around 3M lines.
EDIT:
OK, I solved the problem, but really, help me understand what I did wrong. I can't be doing things like I did... First, I cleaned the file for "strange" lines, they were around 500 of them, and then I saved the file to inv.csv
Then I did the following:
f_inventario = open("inv.csv", "r", encoding="latin1")
f_inventario.readlines()
f_inventario.close()
df = pd.DataFrame(lines)
df.columns = ['data']
df['fecha'] = df.data.apply(lambda s : s.split(';')[0])
df['equipo'] = df.data.apply(lambda s : s.split(';')[1])
df['software'] = df.data.apply(lambda s : s.split(';')[2])
df['version'] = df.data.apply(lambda s : s.split(';')[3][:-1])
df.drop(['data'], axis=1, inplace=True)
And now I got my dataframe with the 7M rows. If I did a df=pd.read_csv('inv.csv' ... )
it would only read about 3M records.
I got my problem solved, but this is terrible, this is not how it should be. As I see it is not a memory problem. Could it be some global variable that tells read_csv
to load up to a maximum??? I really don't know.