2

I would like to filter records from a big file (a list of lists, 10M+ lines) based on given ids.

selected_id = list()            # 70k+ elements

for line in in_fp:              # input file: 10M+ lines
    id = line.split()[0]        # id (str type), such as '10000872820081804' 

    if id in selected_id:
        out_fp.write(line)

The above code is time consuming. A idea comes to my mind. Store selected_id as dict instead of list.

Any better solutions?

SparkAndShine
  • 17,001
  • 22
  • 90
  • 134
  • 2
    You aren't storing a separate value for each id, so a `dict` seems like overkill. Maybe a `set`? – John Gordon Mar 01 '16 at 19:37
  • @JohnGordon because `set` is unhashable. Map `selected_id` into arbitrary values, e.g., `True`. – SparkAndShine Mar 01 '16 at 19:39
  • 1
    Are you sure it's unhashable? http://stackoverflow.com/questions/3949310/how-is-set-implemented – John Gordon Mar 01 '16 at 19:42
  • @JohnGordon thx for this update. `CPython's sets are implemented as something like dictionaries with dummy values` – SparkAndShine Mar 01 '16 at 19:46
  • 1
    @SparkandShine: That hasn't been the case for a while; [they moved to a `set` implementation with no dummies for memory savings in CPython 2.5](https://docs.python.org/2/whatsnew/2.5.html#optimizations) (and the builtin `set` was only introduced in 2.4, so we're not talking a long time with memory wasting `set`s). – ShadowRanger Mar 01 '16 at 20:01

2 Answers2

1

First off, in order to get the first column from your lines you can read your file using csv module with a proper delimiter them use zip() function (in python 3 and in pyhton 2 itertools.izip()) and next() function in order to get the first column then pass the result to a set() function in order to preserve the unique values.

import csv

with open('file_name') as f:
    spam_reader = csv.reader(f, delimiter=' ')
    unique_ids = set(next(zip(*spam_reader)))

If you want to preserve the order you can use collections.OrderedDict():

import csv
from collections import OrderedDict
with open('file_name') as f:
    spam_reader = csv.reader(f, delimiter=' ')
    unique_ids = OrderedDict.fromkeys(next(zip(*spam_reader)))
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • 1
    Using `zip` to accomplish this means you must hold the whole parsed input file in memory, and you must parse the whole thing before you produce a single output. Using `zip` with `*` unpacking to do a transpose operation is clever, but it's a really bad idea when the input is huge because it forces a complete slurp. – ShadowRanger Mar 01 '16 at 19:59
1

You've got a few issues, though only the first is really nasty:

  1. (By far the biggest cost in all likelihood) Checking for membership in a list is O(n); for a 70K element list, that's a lot of work. Make it a set/frozenset and lookup is typically O(1), saving thousands of comparisons. If the types are unhashable, you can pre-sort the selected_list and use the bisect module to do lookups in O(log n) time, which would still get a multiple order of magnitude speedup for such a large list.
  2. If your lines are large, with several runs of whitespace, splitting at all points wastes time; you can specify maxsplit to only split enough to get the ID
  3. If the IDs are always integer values it may be worth the time to make selected_id store int instead of str and convert on read so the lookup comparisons run a little faster (this would require testing). This probably won't make a major difference, so I'll omit it from the example.

Combining all suggestions:

selected_id = frozenset(... Your original list of 70k+ str elements ...)

for line in in_fp:               # input file: 10M+ lines
    id, _ = line.split(None, 1)  # id (str type), such as '10000872820081804' 

    if id in selected_id:
        out_fp.write(line)

You could even convert the for loop to a single call with a generator expression (though it gets a little overly compact) which pushes more work to the C layer in CPython, reducing Python byte code execution overhead:

out_fp.writelines(x for x in in_fp if x.split(None, 1)[0] in selected_id)
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • thx. is the performance the same between `for ... in ...` and `for ... not in ...`? – SparkAndShine Mar 01 '16 at 20:07
  • 1
    @SparkandShine: You don't use `not in` with `for`, unless you refer to the `if` test following a `for`. If you meant is `if x in y` as fast as `if x not in y`, it's effectively the same (both check `if x in y`, Python just flips the result for you). Technically, a successful lookup usually involves fewer checks than a failed lookup (on average, a success is found w/half as many rounds of collision resolution), but we're talking about a tiny number of chaining operations in C. Virtually all name resolution in Python involves `dict` lookups of similar cost, so the overhead is lost in the noise. – ShadowRanger Mar 01 '16 at 20:17
  • sorry for my clerical error. it is `if ... (not) in ...`. – SparkAndShine Mar 01 '16 at 20:24