I tried to look on other answers but I am still not sure the right way to do this. I have a number of really large .csv files (could be a gigabyte each), and I want to first get their column labels, cause they are not all the same, and then according to user preference extract some of this columns with some criteria. Before I start the extraction part I did a simple test to see what is the fastest way to parse this files and here is my code:
def mmapUsage():
start=time.time()
with open("csvSample.csv", "r+b") as f:
# memory-mapInput the file, size 0 means whole file
mapInput = mmap.mmap(f.fileno(), 0)
# read content via standard file methods
L=list()
for s in iter(mapInput.readline, ""):
L.append(s)
print "List length: " ,len(L)
#print "Sample element: ",L[1]
mapInput.close()
end=time.time()
print "Time for completion",end-start
def fileopenUsage():
start=time.time()
fileInput=open("csvSample.csv")
M=list()
for s in fileInput:
M.append(s)
print "List length: ",len(M)
#print "Sample element: ",M[1]
fileInput.close()
end=time.time()
print "Time for completion",end-start
def readAsCsv():
X=list()
start=time.time()
spamReader = csv.reader(open('csvSample.csv', 'rb'))
for row in spamReader:
X.append(row)
print "List length: ",len(X)
#print "Sample element: ",X[1]
end=time.time()
print "Time for completion",end-start
And my results:
=======================
Populating list from Mmap
List length: 1181220
Time for completion 0.592000007629
=======================
Populating list from Fileopen
List length: 1181220
Time for completion 0.833999872208
=======================
Populating list by csv library
List length: 1181220
Time for completion 5.06700015068
So it seems that the csv library most people use is really alot slower than the others. Maybe later it proves to be faster when I start extracting data from the csv file but I cannot be sure for that yet. Any suggestions and tips before I start implementing? Thanks alot!