9

I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minimal example that shows the problem.

First make some fake data

paste <(seq 20000000) <(seq 2 20000001)  > largefile.txt

Now here is a minimal piece of python code to read it in and make a dictionary.

import sys
from collections import defaultdict
fin = open(sys.argv[1])

dict = defaultdict(list)

for line in fin:
    parts = line.split()
    dict[parts[0]].append(parts[1])

Timings:

time ./read.py largefile.txt
real    0m55.746s

However it is not I/O bound as:

time cut -f1 largefile.txt > /dev/null    
real    0m1.702s

If I comment out the dict line it takes 9 seconds. It seems that almost all the time is spent by dict[parts[0]].append(parts[1]).

Is there any way to speed this up? I don't mind using cython or even C if that is going to make a big difference. Or can pandas help here?

Here is the profile output on a file of size 10000000 lines.

python -m cProfile read.py test.data         20000009 function calls in 42.494 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 bisect.py:1(<module>)
        1    0.000    0.000    0.001    0.001 collections.py:1(<module>)
        1    0.000    0.000    0.000    0.000 collections.py:25(OrderedDict)
        1    0.000    0.000    0.000    0.000 collections.py:386(Counter)
        1    0.000    0.000    0.000    0.000 heapq.py:31(<module>)
        1    0.000    0.000    0.000    0.000 keyword.py:11(<module>)
        1   30.727   30.727   42.494   42.494 read.py:2(<module>)
 10000000    4.855    0.000    4.855    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 10000000    6.912    0.000    6.912    0.000 {method 'split of 'str' objects}
        1    0.000    0.000    0.000    0.000 {open}

Update. We can assume that parts[1] is an integer and that parts[0] is a short fixed length string.

My fake data isn't very good as you only get one value per key. Here is a better version.

perl -E 'say int rand 1e7, $", int rand 1e4 for 1 .. 1e7' > largefile.txt

The only operation I will do is to query a key to return the list of values associated with it.

Simd
  • 19,447
  • 42
  • 136
  • 271
  • You should try profiling your program – jh314 Aug 06 '13 at 17:22
  • I don't see how you conclude its not IO bound. Either way, you should get a decent improvement by reading in chunks (http://docs.python.org/2/library/mmap.html). How much speedup are you looking at? – user1827356 Aug 06 '13 at 17:24
  • Are you on a 32-bit or 64-bit system? How much RAM do you have? What platform? It's quite possible that most of the time is used up by allocating the gigabytes of memory it will take to store that dict. – abarnert Aug 06 '13 at 17:24
  • @user1827356: But `for line in fin:` is already reading in chunks, because it's buffered. Using larger chunks _might_ help, but it might not—the cost of concatenating/splitting/looping in Python instead of C (which also means more memory copying) is likely to outweigh the benefits. – abarnert Aug 06 '13 at 17:26
  • @jh314 If I comment out the `dict` line it takes 9 seconds. It seems that almost all the time is spent by `dict[parts[0]].append(parts[1])`. – Simd Aug 06 '13 at 17:26
  • 5
    As a side note, naming a dictionary `dict` is a very bad idea. Beyond the usual reasons, it happens to break the recursive sizeof recipe and a quick profile injector, both of which need to be able to see the `dict` type. – abarnert Aug 06 '13 at 17:27
  • If you really have to use Python dictionaries there can't be done very much. The creation of the dictionaries (which is probably the time-consuming part) already happens in Python's C-code. Maybe alternatives like a Sqlite-database are better. – Michael Butscher Aug 06 '13 at 17:35
  • @MichaelButscher I don't have to use python dictionaries. I just want something I can use to look up a key and give me a list (or array) or elements. – Simd Aug 06 '13 at 18:02
  • Are the keys truly random? Maybe you can get away by reading them into lists and having a more complicated look up function – user1827356 Aug 06 '13 at 18:08
  • Anyway, it's still worth profiling how long each of the parts of that single line take. Is it the dict getitem? The `__missing__` calls? Appending in-place to the lists? The dict setitem? Who knows without profiling? – abarnert Aug 06 '13 at 18:09
  • @adarnert (Added to question.) How do you get the level of detail you were talking about? – Simd Aug 06 '13 at 18:15
  • What do you want to do with the data afterwards? Maybe you can use the super optimised pandas `read_table` mathod and then do a `groupby`: `grouped = pd.read_table('largefile.txt', header=None).groupby(0)` took 4.97 seconds. But as I said it depends on what do you intend to do with the data afterwards. – Viktor Kerkez Aug 08 '13 at 21:24
  • @ViktorKerkez I will perform lots of lookups into the dictionary to return the lists associated with them. I don't know pandas but it looks promising. – Simd Aug 09 '13 at 07:16
  • If your intent is to do more lookups (time wise) then building the dictionary then I would recommend using a database (e.g. SQLite) as this will be both portable and far better speed wise then most custom implementations in any other language. I can post you a short example if you wish. – Sergey L. Aug 09 '13 at 13:25
  • For this sort of key value type dictionary a NoSQL implementation such as Couchbase would be more suited, assuming the data has to persist for any length of time. For repeatedly loading data sets then something written in C is probably more suitable (see my answer). – jmc Aug 09 '13 at 13:52
  • @SergeyL. I will actually read from the dictionary the exact same number of times as the number of elements in it so I fear a DB solution may be too expensive. – Simd Aug 09 '13 at 17:13

4 Answers4

8

If you want the thing you said in the comment, then you can do it easily in pandas: Let's say you have a file with the same layout but the entries get duplicated, since in your example you add all the duplicates into a list:

1 1
2 2
1 3
3 4
1 5
5 6

Then you can read and manipulate the data:

In [1]: df = pd.read_table('largefile.txt', header=None, index_col=0)
In [2]: df.loc[2]
Out[2]:
1    2
Name: 2, dtype: int64

In [3]: df.loc[1]
Out[3]:
   1
0
1  1
1  3
1  5

Pandas stores everything in DataFrames and Series objects which are indexed so don't bother a lot about the output, the first column is the index and the second column is the important one and it will give you the numbers you need.

You can do a lot more with pandas though... For example you can group by the first column in your file and perform aggregations:

In [64]: df = pd.read_table('largefile.txt', header=None).groupby(0)
In [65]: df.sum()
Out[65]:
   1
0
1  9
2  2
3  4
5  6
In [66]: df.mean()
Out[66]:
   1
0
1  3
2  2
3  4
5  6    
In [67]: df[0].count()
Out[67]:
0
1    3
2    1
3    1
5    1
dtype: int64

I know that this is not the answer to how to speed up the dictionary thing, but from what you mentioned in the comment, this could be an alternate solution.

Edit - Added timing

Compared to the fastest dictionary solution and loading data into pandas DataFrame:

test_dict.py

import sys
d = {}
with open(sys.argv[1]) as fin:
    for line in fin:
        parts = line.split(None, 1)
        d[parts[0]] = d.get(parts[0], []) + [parts[1]]

test_pandas.py

import sys
import pandas as pd
df = pd.read_table(sys.argv[1], header=None, index_col=0)

Timed on a linux machine:

$ time python test_dict.py largefile.txt
real    1m13.794s
user    1m10.148s
sys     0m3.075s

$ time python test_pandas.py largefile.txt
real    0m10.937s
user    0m9.819s
sys     0m0.504s

Edit: for the new example file

In [1]: import pandas as pd
In [2]: df = pd.read_table('largefile.txt', header=None,
                           sep=' ', index_col=0).sort_index()
In [3]: df.index
Out[3]: Int64Index([0, 1, 1, ..., 9999998, 9999999, 9999999], dtype=int64)
In [4]: df[1][0]
Out[4]: 6301
In [5]: df[1][1].values
Out[5]: array([8936, 5983])
Viktor Kerkez
  • 45,070
  • 12
  • 104
  • 85
  • Could you give some profiling info on how this will compare to the OP's code and the other answer? – Andy Hayden Aug 09 '13 at 13:44
  • The fastest version from @abarnert post took on my machine 33.311 seconds. Loading into pandas DataFrame as shown in my example took 11.800 seconds. – Viktor Kerkez Aug 09 '13 at 14:03
  • Awesome! Could you put that in your answer? – Andy Hayden Aug 09 '13 at 14:06
  • The profiling stack for the pandas version is quite long (lot of c functions, it's highly optimized, and written in cython) and I think it would be useless to post it. I posted just the timing informations. You can check for yourself, just do: `import pandas as pd; df = pd.read_table('largefile.txt', header=None, index_col=0)` And that is all you need to load the data into a DataFrame object with nonunique indexes. (which is the case here) – Viktor Kerkez Aug 09 '13 at 14:41
  • Added (the first measurement was on a windows machine, it appears that on a linux box the difference is even bigger). – Viktor Kerkez Aug 09 '13 at 14:59
  • real 0m10.737s to do df = pd.read_table(sys.argv[1], header=None, index_col=0) on my system which is great. In your example I would like the answer to a query "1" to be [1,3, 5]. Is there a query that will give me that in time proportional to the length of [1,3,5]? – Simd Aug 09 '13 at 17:17
  • There are multiple ways of doing this. One of the ways would be to select the first column of the DataFrame (in this case the only column) and then get the value by it's key: `df[1].loc[key]`. Which will return either a Series object if there are multiple values or number if only one value is present. For the case in my post: `df[1].loc[1].values` will be `[1, 3, 5]` since it's a Series object and you want only the values, and `df[1].loc[2]` would be just `2` since only one value is present. But this doesn't mean this is optimal solution it all depends on the operations you want to perform. – Viktor Kerkez Aug 09 '13 at 18:21
  • Pandas has a really optimised fancy indexing, so you can do a lot of things, but which way is the best depends on what you try to do... – Viktor Kerkez Aug 09 '13 at 18:23
  • Great. In my case the only query operation I ever want to do is df[1].loc[key] it seems. Will the list always be returned in the order it was in the original input file? – Simd Aug 09 '13 at 18:29
  • Yes it will be in the order it's in the file. – Viktor Kerkez Aug 09 '13 at 18:42
  • print df[1].loc[1].values using your input file gives me `KeyError: u'no item named 1'` – Simd Aug 09 '13 at 21:05
  • On what data file? `largefile.txt` created in your original post? If you testing it with the `1 2 3 4 5` example from my post you have to add a `sep=' '` parameter to the `read_table` function because the separator is not `\t` as in the `largefile.txt`. – Viktor Kerkez Aug 09 '13 at 21:35
  • Oh I see that you changed the example file. I added an example for the new file. – Viktor Kerkez Aug 10 '13 at 01:05
4

Here are a few quick performance improvements I managed to get:

Using a plain dict instead of defaultdict, and changing d[parts[0]].append(parts[1]) to d[parts[0]] = d.get(parts[0], []) + [parts[1]], cut the time by 10%. I don't know whether it's eliminating all those calls to a Python __missing__ function, not mutating the lists in-place, or something else that deserves the credit.

Just using setdefault on a plain dict instead of defaultdict also cuts the time by 8%, which implies that it's the extra dict work rather than the in-place appends.

Meanwhile, replacing the split() with split(None, 1) helps by 9%.

Running in PyPy 1.9.0 instead of CPython 2.7.2 cut the time by 52%; PyPy 2.0b by 55%.

If you can't use PyPy, CPython 3.3.0 cut the time by 9%.

Running in 32-bit mode instead of 64-bit increased the time by 170%, which implies that if you're using 32-bit you may want to switch.


The fact that the dict takes over 2GB to store (slightly less in 32-bit) is probably a big part of the problem. The only real alternative is to store everything on disk. (In a real-life app you'd probably want to manage an in-memory cache, but here, you're just generating the data and quitting, which makes things simpler.) Whether this helps depends on a number of factors. I suspect that on a system with an SSD and not much RAM it'll speed things up, while on a system with a 5400rpm hard drive and 16GB of RAM (like the laptop I'm using at the moment) it won't… But depending on your system's disk cache, etc., who knows, without testing.

There's no quick&dirty way to store lists of strings in disk-based storage (shelve will presumably waste more time with the pickling and unpickling than it saves), but changing it to just concatenate strings instead and using gdbm kept the memory usage down below 200MB and finished in about the same time, and has the nice side effect that if you want to use the data more than once, you've got them stored persistently. Unfortunately, plain old dbm wouldn't work because the default page size is too small for this many entries, and the Python interface doesn't provide any way to override the default.

Switching to a simple sqlite3 database that just has non-unique Key and Value columns and doing it in :memory: took about 80% longer, while on disk it took 85% longer. I suspect that denormalizing things to store multiple values with each key wouldn't help, and would in fact make things worse. (Still, for many real life uses, this may be a better solution.)


Meanwhile, wrapping cProfile around your main loop:

         40000002 function calls in 31.222 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   21.770   21.770   31.222   31.222 <string>:2(<module>)
 20000000    2.373    0.000    2.373    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 20000000    7.079    0.000    7.079    0.000 {method 'split' of 'str' objects}

So, that's one third of your time spent in string.split, 10% spent in append, and the rest spend it code that cProfile couldn't see, which here includes both iterating the file and the defaultdict method calls.

Switching to a regular dict with setdefault (which, remember, was a little faster) shows 3.774 seconds spent in setdefault, so that's about 15% of the time, or presumably around 20% for the defaultdict version. Presuambly the __setitem__ method isn't going to be any worse than the setdefault or defaultdict.__getitem__ were.

However, we may not be seeing the time charged by malloc calls here, and they may be a huge chunk of the performance. To test that, you'll need a C-level profiler. So let's come back to that.

Meanwhile, at least some of the leftover time is probably taken up by the line-splitting as well, since that must cost on the same order as space-splitting, right? But I don't know of any way to improve that significantly.


Finally, a C-level profiler is going to help here, but one run on my system may not help much for your system, so I'll leave that to you.


The fastest version on my system depends on which Python I run, but it's either this:

d = {}    
for line in fin:
    parts = line.split(None, 1)
    d[parts[0]] = d.get(parts[0], []) + [parts[1]]

Or this:

d = {}    
for line in fin:
    parts = line.split(None, 1)
    d.setdefault(parts[0], []).append(parts[1])

… And they're both pretty close to each other.

The gdbm solution, which was about the same speed, and has obvious advantages and disadvantages, looks like this:

d = gdbm.open(sys.argv[1] + '.db', 'c')
for line in fin:
    parts = line.split(None, 1)
    d[parts[0]] = d.get(parts[0], '') + ',' + parts[1]

(Obviously if you want to be able to run this repeatedly, you will need to add a line to delete any pre-existing database—or, better, if it fits your use case, to check its timestamp against the input file's and skip the whole loop if it's already up-to-date.)

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Thank you but those don't work well for me. (I am on 32 bit linux which may explain it.) Using d[parts[0]] = d.get(parts[0], []) + [parts[1]] I get time ./read.py test2.data real 0m55.475s – Simd Aug 06 '13 at 18:01
  • time pypy ./read.py test2.data real 0m39.993s (using 2.0.2). Using python 3.2 I get time ./read.py test2.data MemoryError. It must use more memory for some reason. – Simd Aug 06 '13 at 18:03
  • Wow, where'd you even find a 32-bit system to run in 2013? :) Anyway, you're pushing the limits of what you can store in a 32-bit VM space, so even if you _don't_ get a performance boost, you probably want to consider some form of on-disk storage just so you can handle slightly larger files without crashing. – abarnert Aug 06 '13 at 18:08
  • Thanks. Would you mind pasting the full code for your fastest version please? I am not sure how you are using setdefault exactly for example. – Simd Aug 06 '13 at 18:45
  • Unfortunately both your fastest solutions take essentially exactly the same amount of time on my system as the version I gave initially using python 2.7.3. The gdbm solution gives me the error d[parts[0]] = d.get(parts[0], '') + ',' + parts[1] AttributeError: get – Simd Aug 06 '13 at 19:03
3

Here's a quick C version for anyone who is interested. Headline times on my machine:

Python (>5Gb memory)

time ./read.py  largefile.txt

real    0m48.711s
user    0m46.911s
sys     0m1.783s

C (~1.9Gb memory)

gcc -O3 read.c -o read
time ./read largefile.txt

real    0m6.250s
user    0m5.676s
sys     0m0.573s

So about 7.8x quicker in C. :)

And I should add that my version of seq would not create a usable list without changing the command to:

paste <(seq -f "%.0f" 20000000) <(seq -f "%.0f" 2 20000001)  > largefile.txt

Code below, credit must go to Vijay Mathew who copied the dict example from Section 6.6 of The C Programming Language into his example (and I copied into my answer below): Quick Way to Implement Dictionary in C

====== Edit ====== (13/08/2013)

Following on from comment #2 on my answer, I've updated the code to that in code listing 2 to allow multiple values for a single key and also started using the updated perl code to generate the test file (which is half the size hence the approx halfed execution times).

Headline times are:

Python (>5Gb memory)

time ./read.py  largefile.txt

real    0m25.925s
user    0m25.228s
sys     0m0.688s

C (~1.9Gb memory)

gcc -O3 read.c -o read
time ./read largefile.txt

real    0m3.497s (although sub 3 seconds is possible by reducing the hash size?!?!?)
user    0m3.183s
sys     0m0.315s

So about 7.4x quicker in C although panda is probably close.

However, the point about the has size is important. I could 'cheat' by reducing the hash size down to a very small number which for a multivalued dictionary would increase insert speed at the expense of lookup. Therefore to really test either of these implementations you really need to also test the lookup speed.

Code 2 (multivalue dictionary)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct nlist { /* table entry: */
    struct nlist *next; /* next entry in chain */
    char *name; /* defined name */
    char *defn; /* replacement text */
};

#define HASHSIZE 10000001
static struct nlist *hashtab[HASHSIZE]; /* pointer table */

/* hash: form hash value for string s */
unsigned hash(char *s)
{
    unsigned hashval;
    for (hashval = 0; *s != '\0'; s++)
      hashval = *s + 31 * hashval;
    return hashval % HASHSIZE;
}

/* lookup: look for s in hashtab */
struct nlist *lookup(char *s)
{
    struct nlist *np;
    for (np = hashtab[hash(s)]; np != NULL; np = np->next)
        if (strcmp(s, np->name) == 0)
          return np; /* found */
    return NULL; /* not found */
}

struct nlist * lookup_all(char *key)
{
    struct nlist *np, *np2, *ret;
    unsigned hashval = hash(key);

    ret = NULL;

    for (np = hashtab[hashval]; np != NULL; np = np->next) {
      if (strcmp(key, np->name) == 0) {
        np2 = malloc(sizeof(*np2));
        np2->name = np->name;
        np2->defn = np->defn;
        np2->next = ret;
        ret = np2;
      }
    }
    return ret; /* not found */
}

/* install: put (name, defn) in hashtab */
struct nlist *install(char *name, char *defn)
{
    struct nlist *np, *np2;
    unsigned hashval = hash(name);;
    //if ((np = lookup(name, hashval)) == NULL) { /* not found */

    np = (struct nlist *) malloc(sizeof(*np));
    if (np == NULL || (np->name = strdup(name)) == NULL)
      return NULL;
    np->next = hashtab[hashval];
    hashtab[hashval] = np;

    if ((np->defn = strdup(defn)) == NULL)
       return NULL;
    return np;
}
#ifdef STRDUP
char *strdup(char *s) /* make a duplicate of s */
{
    char *p;
    p = (char *) malloc(strlen(s)+1); /* +1 for .\0. */
    if (p != NULL)
       strcpy(p, s);
    return p;
}
#endif /* STRDUP */

int main(int argc, char *argv[]) {

    FILE *fp;
    char str1[20];
    char str2[20];
    int size = 0;
    int progress = 0;
    struct nlist *result;

    fp = fopen(argv[1],"r");
    if(fp==NULL) {return 1;}

    fseek(fp, 0, SEEK_END);
    size = ftell(fp);
    rewind(fp);

   while(size != ftell(fp)) {
        if(0==fscanf(fp, "%s %s",str1,str2))
            break;
        (void)install(str1,str2);
    }
    printf("Done\n");
    fclose(fp);

    // Test a lookup to see if we get multiple items back.    
    result = lookup_all("1");
    while (result) {
        printf("Key = %s Value = %s\n",result->name,result->defn);
        result = result->next;
    }

    return 0;
}

Code 1 (single value dictionary)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct nlist { /* table entry: */
    struct nlist *next; /* next entry in chain */
    char *name; /* defined name */
    char *defn; /* replacement text */
};

#define HASHSIZE 10000001
static struct nlist *hashtab[HASHSIZE]; /* pointer table */

/* hash: form hash value for string s */
unsigned hash(char *s)
{
    unsigned hashval;
    for (hashval = 0; *s != '\0'; s++)
      hashval = *s + 31 * hashval;
    return hashval % HASHSIZE;
}

/* lookup: look for s in hashtab */
struct nlist *lookup(char *s)
{
    struct nlist *np;
    for (np = hashtab[hash(s)]; np != NULL; np = np->next)
        if (strcmp(s, np->name) == 0)
          return np; /* found */
    return NULL; /* not found */
}

/* install: put (name, defn) in hashtab */
struct nlist *install(char *name, char *defn)
{
    struct nlist *np;
    unsigned hashval;
    if ((np = lookup(name)) == NULL) { /* not found */
        np = (struct nlist *) malloc(sizeof(*np));
        if (np == NULL || (np->name = strdup(name)) == NULL)
          return NULL;
        hashval = hash(name);
        np->next = hashtab[hashval];
        hashtab[hashval] = np;
    } else /* already there */
        free((void *) np->defn); /*free previous defn */
    if ((np->defn = strdup(defn)) == NULL)
       return NULL;
    return np;
}
#ifdef STRDUP
char *strdup(char *s) /* make a duplicate of s */
{
    char *p;
    p = (char *) malloc(strlen(s)+1); /* +1 for .\0. */
    if (p != NULL)
       strcpy(p, s);
    return p;
}
#endif /* STRDUP */

int main(int argc, char *argv[]) {

    FILE *fp;
    char str1[20];
    char str2[20];
    int size = 0;
    int progress = 0;

    fp = fopen(argv[1],"r");
    if(fp==NULL) {return 1;}

    fseek(fp, 0, SEEK_END);
    size = ftell(fp);
    rewind(fp);

   while(size != ftell(fp)) {
        if(0==fscanf(fp, "%s %s",str1,str2))
            break;
        //printf(">%s<>%s<\n",str1,str2);
        (void)install(str1,str2);
        ++progress;
        if((progress % 100000)==0)
            printf("Progress = %d\n",progress);
    }
    printf("Done\n");
    fclose(fp);

    return 0;
}
Community
  • 1
  • 1
jmc
  • 813
  • 10
  • 18
  • Amazing! real 0m9.365s on my system so about 7 times faster according to the test I just did (both C and python are a little faster the second time you run them. This was the first.) Now I am wondering if this can be made a C extension so I can use it in python. – Simd Aug 09 '13 at 17:08
  • One thing is that this code doesn't seem to give you a hash of arrays of strings. My fake data wasn't very good. I have added a better example. I would like to be able to query "1" and find all the values associated with it as I commented in Viktor Kerkez's answer. – Simd Aug 09 '13 at 18:41
  • Ok, see edits above. I do really think that the lookup speed is probably rather important too.. you can gear the insert speed and the expense of the lookup (although I've intentionally not done this in my timings). – jmc Aug 13 '13 at 14:17
0

You can still add an extra optimization on top of the others:

Since your keys are strings of "nearly" consecutive integers you can speed up the creation of the dict by inserting the elements in the dict in order. It will reduce the dict collisions. See comments on python dict implementation

Major subtleties ahead: Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't: its most important hash functions (for strings and ints) are very regular in common cases:

map(hash, (0, 1, 2, 3)) [0, 1, 2, 3] map(hash, ("namea", "nameb", "namec", "named")) [-1658398457, -1658398460, -1658398459, -1658398462]

This isn't necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are "consecutive" strings. So this gives better-than-random behavior in common cases, and that's very desirable.

So if you can pre-process the file to sort it, the python execution can be much faster.

barracel
  • 1,831
  • 13
  • 24