1

I have 2 files. One (let's say file F1) is a cvs file with around 200K rows and 3 columns. Each row represents an interval and rows are arranged in increasing order. For example:

1000,1340,yes
1400,1800,no
1810,2000,maybe
...

Another text file (F2) has around 10K rows and they are not ordered. I need to iterate through F2 and find row it belongs to, take the 3rd element from that row (yes, no or maybe) and append it to F2. For example, if F2=[1402,1100,1900], the updated F2 would be:

1402,no
1100,yes
1900,maybe

Is there any more elegant way of approaching this other than brute force approach? I was thinking to find the first element in F1 that is greater or equal than element in F2 we are using and then run a search (or binary search) of the remainder. Any hints would be appreciated.

user201411
  • 254
  • 2
  • 11

1 Answers1

0

If the total range in F1 is small enough, e.g. if the max(2nd column in F1) - min(1st column in F1) is only a few million items, then you can simply store everything in a list:

class StringCache(dict):
    def __missing__(self, key):
        return key

f1_list = [None] * NUMBER_OF_ITEMS
string_cache = StringCache()

with open('F1.csv') as f1:
    for col1, col2, col3 in csv.reader(f1):
         for index in xrange(int(col1), int(col2) + 1):
              f1_list[index] = string_cache[col3]

with open('F2.csv') as f2:
    for (col,) in csv.reader(f2):
        print(f1_list[int(col)])

Also see https://stackoverflow.com/a/4544699/416224 for an automatically growing list.

Community
  • 1
  • 1
Kijewski
  • 25,517
  • 12
  • 101
  • 143
  • OK, this was my mistake. I just illustrated the example with yes, no maybe to show what I want. In reality, yes, no, maybe are around 1000 or possibly more different words. – user201411 Nov 04 '14 at 05:29
  • I changed my answer not to use `{'no':1…}`. It is only needed so that the multiple instances of the same string won't fill up the memory. A `dict` with [`__missing__`](http://stackoverflow.com/a/6229253/416224) will work just as well. If there are no repeated strings, then you don't need `string_cache` at all. – Kijewski Nov 04 '14 at 05:41
  • I am a bit confused. What is NUMBER_OF_ITEMS? – user201411 Nov 04 '14 at 05:53
  • The value of `max(2nd column in F1) - min(1st column in F1)` if you know it in beforehand, otherwise the automatically growing list. – Kijewski Nov 04 '14 at 05:57
  • it doesn't work with int[col]. it should be int[col[0]]. Otherwise, it shows TypeError: int() argument must be a string or a number, not 'list'. Correct me if I am wrong. – user201411 Nov 04 '14 at 06:07
  • Yes, correct. `for col in csv.reader(f2):` should have been `for (col,) in csv.reader(f2):`. – Kijewski Nov 04 '14 at 06:11