What is the most efficient way to repeatedly search a large text file (800 MB) for certain numbers?

Question

The large file is 12 million lines of text such as this:

81.70,  89.86,  717.985
81.74,  89.86,  717.995
81.78,  89.86,  718.004
81.82,  89.86,  718.014
81.86,  89.86,  718.024
81.90,  89.86,  718.034

This is latitude, longitude, and distance from the nearest coastline (respectively).

My code uses coordinates of known places (for example: Mexico City: "-99.1, 19.4) and searches the large file, line by line, to output the distance from the nearest coastline of that coordinate.

I put each line into a list because many lines meet the long/lat criteria. I later average the distances from the coastline.

Each coordinate takes about 12 seconds to retrieve. My entire script takes 14 minutes to complete.

Here's what I have been using:

long = -99.1
lat = 19.4
country_d2s = []

# outputs all list items with specified long and lat values
with open(r"C:\Users\jason\OneDrive\Desktop\s1186prXbF0O", 'r') as dist2sea:
    for line in dist2sea:
        if long in line and lat in line and line.startswith(long):
             country_d2s.append(line)

I am looking for a way to search through the file much quicker and/or rewrite the file to make it easier to work with.

your script just _cannot_ work, because `line` is a string, and lat & long are floats. Convert to list of floats first, then test. — Jean-François Fabre, Jul 19 '19 at 20:30
Do you have the option of splitting the file into smaller files with meaningful names? i.e. you could have a file named `81.70` that contains all coordinates with that latitude, or perhaps a file named `81` that contains all `81.*` latitudes. — John Gordon, Jul 19 '19 at 20:36
@Jean-FrançoisFabre I imagine this is a simplified version of his real code. — John Gordon, Jul 19 '19 at 20:38
To get an optimized solution some additional questions should be answered: The coordinates look ordered, are they? The coordinates look sampled with constant frequency (0.04), are they? Are all the loop-up long at lat values exactly contained in the file (or some kind of interpolation is needed)? By the way, your matching criterion is not very precise, for example it would match `-99.1, 89.86, 719.424` because `19.4` is a substring of the distance. — a_guest, Jul 19 '19 at 21:47
Does the order of the data in the file matter? Do you use it for anything else? — martineau, Jul 19 '19 at 22:52

score 3 · Answer 1 · answered Jul 19 '19 at 20:34

3

Use a database with a key comprised of the latitude and longitude. If you're looking for a lightweight DB that can be shared as a file, there's SqliteDict or bsddb3. That would be much faster than reading a text file each time the program is run.

answered Jul 19 '19 at 20:34

IronMan

1,854
10
7

score 2 · Answer 2 · answered Jul 19 '19 at 20:40

2

Import your data into SQLite database, then create index for (latitude, longitude). Index lookup should take milliseconds. To read data, use python SQLite module.

answered Jul 19 '19 at 20:40

mvp

111,019
13
122
148

score 1 · Answer 3 · answered Jul 19 '19 at 20:53

Comments:

It's unclear if you are using the fact that your long/lat are XX.Y and you are searching against XX.YY as some kind of fuzzy matching technique.
I also cannot tell how you plan to execute this: load + [run] x 1000 vs [load + run] x 1000, which would inform the solution you want to use.

That being said, if you want to do very fast exact lookups one option is to load the entire thing into memory as a mapping, e.g. {(long, lat): coast_distance, ...}. Since floats are not good keys, it would be better to use strings, integers, or fractions for this.

If you want to do fuzzy matching, there are data structures (and a number of packages) that would solve that issue:

If you want the initial load time to be faster you can do things like writing a binary pickle and loading that directly instead of parsing a file. A database is also a simple solution to this.

score 0 · Answer 4 · answered Jul 19 '19 at 20:40

You could partition the file into 10 by 10 degree patches. This would reduce the search space by 648 which would yield 648 files with each one having about 18500 lines. This would reduce the search time to about 0.02 seconds.

As you are doing exact matches of lat-long, you could instead use any on-disk key-value store. Python has at least one of them built in. If you were doing nearest neighbor or metric space searches, there are spacial databases that support those.

score 0 · Answer 5 · answered Jul 19 '19 at 20:47

If you are using python i recommend to use PySpark. in this particular case you can use the function mapPartitions and join the results. this could help How does the pyspark mapPartitions function work?

PySpark is a useful at the time to work with giant amount of data because it makes N partitions and use your processor full power.

Hope it helps you.

What is the most efficient way to repeatedly search a large text file (800 MB) for certain numbers?

5 Answers5