how to read a huge csv file using mmap in python?

Question

I want to read csv file and perform some operation on that file. I'm created the program for my requirement but I'm not getting output because the file size is very large i.e. ~5GB.

I'm using simple system calls such as open,readline ect. Meanwhile I explore about Memory mapped support in python but I didn't understand the implementation of mmap.

Can Anyone help me to implement reading of large csv file using mmap or any other way so that I can reduce the speed of my Application?

I'm reading one csv file and I want to perform one task.

Task-

I want to read one csv file and read all the line_id from this csv file and find out unique line_id's and from this one unique line Id I want to find out maximum time_gap for this single unique line_id. I have to find out same line_id and their corresponding maximum time_gap. After getting all unique line_id & their corresponding maximum time_gap I want this two column information in another output.csv file.

I previously created one program for this task and it is working for the small input file but It is not working for large files. i.e ~2GB.

My Stuff-

import csv
import sys, getopt

def csv_dict_reader(file_obj):

    listOfLineId = []
    reader = csv.DictReader(file_obj, delimiter=',')

    i = 0;
    for line in reader:
        listOfLineId.insert(i, line['line_id']);
        i = i + 1;

    set1 = set(listOfLineId)
    new_dict = dict()
    i = 0;

    for se in set1:
        f1 = open("latency.csv")
        readerInput = csv.DictReader(f1, delimiter=',')
        for inpt in readerInput:
            if (se == inpt['line_id']):
                if se in new_dict:
                    if new_dict[se] < inpt['time_gap']:
                        new_dict[se] = inpt['time_gap']
                else:
                    new_dict[se] = inpt['time_gap']

    print new_dict
    write_dict(new_dict)

def write_dict(new_dict):

    name_list = ['line_id', 'time_gap']
    f = open('finaloutput.csv', 'wb')
    writer = csv.DictWriter(f, delimiter=',', fieldnames=name_list)
    writer.writeheader()
    for key, value in new_dict.iteritems():
        writer.writerow({'line_id': key, 'time_gap': value})

f.close()
print "check finaloutput.csv file..."


if __name__ == "__main__":

    argv = sys.argv[1:]
    inputfile = ''
    outputfile = ''
    try:
        opts, args = getopt.getopt(argv, "hi:o:", ["ifile=", "ofile="])
    except getopt.GetoptError:
    print 'test.py -i <inputfile> -o <outputfile>'
    sys.exit(2)
    for opt, arg in opts:
       if opt == '-h':
          print 'test.py -i <inputfile> -o <outputfile>'
          sys.exit()
       elif opt in ("-i", "--ifile"):
          inputfile = arg
       elif opt in ("-o", "--ofile"):
          outputfile = arg

    with open(inputfile) as f_obj:
       csv_dict_reader(f_obj)

How I can reduce the speed of execution of my application?

can you provide any links to refer I'm not able to find out implementation? — ketan, Jul 06 '16 at 12:43
no need for mmap, just process the file line by line as shown here: https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files. — MKesper, Jul 06 '16 at 12:43
https://docs.python.org/3/library/mmap.html https://docs.python.org/3/library/io.html#io.BytesIO https://docs.python.org/3/library/csv.html — Ignacio Vazquez-Abrams, Jul 06 '16 at 12:45
@kit, please provide some source. I'm quite sure your problem does not require mmap. — MKesper, Jul 06 '16 at 12:46
@kit, have a look here: http://stackoverflow.com/questions/17246260/python-readlines-usage-and-efficient-practice-for-reading — MKesper, Jul 06 '16 at 12:52
Is your operation row-independent? Or does it need the entire dataframe in memory, or at least multiple pieces? — Jeff, Jul 06 '16 at 13:04
@Jeff L.- my operation is column based. Read above edited section. You will get clear idea about my task.! — ketan, Jul 06 '16 at 13:14
No need for temp file. Build a dictionary. For every line, check if line_id is in dict and if the value for that id is bigger. If not, store the new line_id and the corresponding value. At the end iterate over the dict and write all the keys and values. In your code, you're iterating twice over your values, afaics. — MKesper, Jul 06 '16 at 13:19
@MKesper- ok I will try with that... but explain with the help of some code so I can get better idea about what you want to say exactly! — ketan, Jul 06 '16 at 13:23
@MKesper- I reduced code too much and removed temporary file also. But still It takes time. I'm not satisfied with that. I will edit again and put new code. Give any valuable suggestion for reducing further. Think about that two for loops that takes much more time.! — ketan, Jul 06 '16 at 14:04

how to read a huge csv file using mmap in python?

0 Answers0