More efficient way to go through .csv file?

Question

I'm trying to parse through a few dictionary a in .CSV file, using two lists in separate .txt files so that the script knows what it is looking for. The idea is to find a line in the .CSV file which matches both a Word and IDNumber, and then pull out a third variable if there is a match. However, the code is running really slow. Any ideas how I could make it more efficient?

import csv

IDNumberList_filename = 'IDs.txt'
WordsOfInterest_filename = 'dictionary_WordsOfInterest.txt'
Dictionary_filename = 'dictionary_individualwords.csv'

WordsOfInterest_ReadIn = open(WordsOfInterest_filename).read().split('\n')
#IDNumberListtoRead = open(IDNumberList_filename).read().split('\n')

for CurrentIDNumber in open(IDNumberList_filename).readlines():
    for CurrentWord in open(WordsOfInterest_filename).readlines():
        FoundCurrent = 0

        with open(Dictionary_filename, newline='', encoding='utf-8') as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                if ((row['IDNumber'] == CurrentIDNumber) and (row['Word'] == CurrentWord)):
                    FoundCurrent = 1
                    CurrentProportion= row['CurrentProportion']

            if FoundCurrent == 0:
                CurrentProportion=0
            else:
                CurrentProportion=1
                print('found')

Can you provide an example of how you want the output to be displayed? — serk, Aug 14 '15 at 13:06
This code have O(mn) complexity, where `m` and `n` are count of words and ids in respective files. No wonder it is really slow. Does it really need to check every possible combination of ID and word? — J0HN, Aug 14 '15 at 13:08
What is the point of setting `CurrentProportion= row['CurrentProportion']` if you are just going to set it to 0 or 1 before it is used? — Sam Cohen-Devries, Aug 14 '15 at 13:15
How big are ```dictionary_WordsOfInterest.txt``` and ```IDs.txt```? Can you read them all at once? If so, I'd suggest storing them in a ```set()``` and using the operator ``in``. (i.e. ```a = set([1,2,3]); 1 in a```). Average search time in on a set is O(1). — tmrlvi, Aug 14 '15 at 13:18
Thanks... The CurrentProportion = 1 is just a placeholder at the moment. I'm setting CurrentProportion to zero though is because of how I want the output. If there is no Proportion in the file (because of no match for PID and CurrentWord), then I want to set it to 0. — SimonSchus, Aug 14 '15 at 13:56

score 2 · Answer 1 · answered Aug 14 '15 at 13:19

2

First of all, consider to load file dictionary_individualwords.csv into the memory. I guess that python dictionary is proper data structure for this case.

answered Aug 14 '15 at 13:19

Dmitriy Sorochenkov

114
3

score 1 · Answer 2 · answered Aug 14 '15 at 13:27

As you use readlines for the .txt files, you already build an in memory list with them. You should build those lists first and them only parse once the csv file. Something like:

import csv

IDNumberList_filename = 'IDs.txt'
WordsOfInterest_filename = 'dictionary_WordsOfInterest.txt'
Dictionary_filename = 'dictionary_individualwords.csv'

WordsOfInterest_ReadIn = open(WordsOfInterest_filename).read().split('\n')
#IDNumberListtoRead = open(IDNumberList_filename).read().split('\n')

numberlist = open(IDNumberList_filename).readlines():
wordlist =  open(WordsOfInterest_filename).readlines():

FoundCurrent = 0

with open(Dictionary_filename, newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        for CurrentIDNumber in numberlist:
            for CurrentWord in wordlist :

                if ((row['IDNumber'] == CurrentIDNumber) and (row['Word'] == CurrentWord)):
                    FoundCurrent = 1
                    CurrentProportion= row['CurrentProportion']

                if FoundCurrent == 0:
                    CurrentProportion=0
                else:
                    CurrentProportion=1
                    print('found')

Beware: untested

Thanks! I shall try to have a go with this, and let you know. The .csv has >100,000 rows and 100 columns. — SimonSchus, Aug 14 '15 at 13:53

score 1 · Accepted Answer · edited May 23 '17 at 10:28

1

Your are opening the CSV file N times where N = (# lines in IDS.txt) * (# lines in dictionary_WordsOfInterest.txt). If the file is not too large, you can avoid that by saving its content to a dictionary or a list of lists.

The same way you open dictionary_WordsOfInterest.txt every time you read a new line from IDS.txt

Also It seems that you are looking for any combination of pair (CurrentIDNumber, CurrentWord) possible from the txt files. So for example you can store the ids in a set, and the words in an other, and for each row in the csv file, you can check if both the id and the word are in their respective set.

edited May 23 '17 at 10:28

Community

1
1

answered Aug 14 '15 at 13:32

Bernard

301
2
6

Hi there, thank you for your excellent suggestions. The ID and the Word are definitely all in the set at least for this file; it is just a case of finding them. However, I can probably sort them. You have definitely pointed me in the right direction of where the code is slowing down so I'm going to work on those aspects. – SimonSchus Aug 14 '15 at 13:52

More efficient way to go through .csv file?

3 Answers3