I want to read csv file and perform some operation on that file. I'm created the program for my requirement but I'm not getting output because the file size is very large i.e. ~5GB.
I'm using simple system calls such as open,readline ect. Meanwhile I explore about Memory mapped support in python but I didn't understand the implementation of mmap.
Can Anyone help me to implement reading of large csv file using mmap or any other way so that I can reduce the speed of my Application?
I'm reading one csv file and I want to perform one task.
Task-
I want to read one csv file and read all the line_id from this csv file and find out unique line_id's and from this one unique line Id I want to find out maximum time_gap for this single unique line_id. I have to find out same line_id and their corresponding maximum time_gap. After getting all unique line_id & their corresponding maximum time_gap I want this two column information in another output.csv file.
I previously created one program for this task and it is working for the small input file but It is not working for large files. i.e ~2GB.
My Stuff-
import csv
import sys, getopt
def csv_dict_reader(file_obj):
listOfLineId = []
reader = csv.DictReader(file_obj, delimiter=',')
i = 0;
for line in reader:
listOfLineId.insert(i, line['line_id']);
i = i + 1;
set1 = set(listOfLineId)
new_dict = dict()
i = 0;
for se in set1:
f1 = open("latency.csv")
readerInput = csv.DictReader(f1, delimiter=',')
for inpt in readerInput:
if (se == inpt['line_id']):
if se in new_dict:
if new_dict[se] < inpt['time_gap']:
new_dict[se] = inpt['time_gap']
else:
new_dict[se] = inpt['time_gap']
print new_dict
write_dict(new_dict)
def write_dict(new_dict):
name_list = ['line_id', 'time_gap']
f = open('finaloutput.csv', 'wb')
writer = csv.DictWriter(f, delimiter=',', fieldnames=name_list)
writer.writeheader()
for key, value in new_dict.iteritems():
writer.writerow({'line_id': key, 'time_gap': value})
f.close()
print "check finaloutput.csv file..."
if __name__ == "__main__":
argv = sys.argv[1:]
inputfile = ''
outputfile = ''
try:
opts, args = getopt.getopt(argv, "hi:o:", ["ifile=", "ofile="])
except getopt.GetoptError:
print 'test.py -i <inputfile> -o <outputfile>'
sys.exit(2)
for opt, arg in opts:
if opt == '-h':
print 'test.py -i <inputfile> -o <outputfile>'
sys.exit()
elif opt in ("-i", "--ifile"):
inputfile = arg
elif opt in ("-o", "--ofile"):
outputfile = arg
with open(inputfile) as f_obj:
csv_dict_reader(f_obj)
How I can reduce the speed of execution of my application?