I have literally millions (no exaggeration) of rows of data like the following:
sample = [
[16777216, 33554431, None, 'AU', False, False, False, False],
[16777216, 16777471, 13335, 'AU', False, True, False, False],
[16777472, 16777727, None, 'CN', False, False, False, False],
[16777728, 16778239, None, 'CN', False, False, False, False],
[16778240, 16779263, 38803, 'AU', False, False, False, False],
[16778496, 16778751, 38803, 'AU', False, False, False, False],
[16779264, 16781311, None, 'CN', False, False, False, False],
[16781312, 16785407, None, 'JP', False, False, False, False],
[16781312, 16781567, 2519, 'JP', False, False, False, False],
[16785408, 16793599, None, 'CN', False, False, False, False],
[16785408, 16785663, 141748, 'CN', False, False, False, False],
[16793600, 16809983, 18144, 'JP', False, False, False, False],
[16809984, 16842751, 23969, 'TH', False, False, False, False],
[16809984, 16826367, 23969, 'TH', False, False, False, False],
[16809984, 16818175, 23969, 'TH', False, False, False, False],
[16809984, 16810239, 23969, 'TH', False, False, False, False],
[16810240, 16810495, 23969, 'TH', False, False, False, False],
[16810496, 16811007, 23969, 'TH', False, False, False, False],
[16811008, 16811263, 23969, 'TH', False, False, False, False],
[16811264, 16811519, 23969, 'TH', False, False, False, False],
[16812032, 16812287, 23969, 'TH', False, False, False, False],
[16812288, 16812543, 23969, 'TH', False, False, False, False],
[16812544, 16812799, 23969, 'TH', False, False, False, False],
[16812800, 16813055, 23969, 'TH', False, False, False, False],
[16813312, 16813567, 23969, 'TH', False, False, False, False],
[16814080, 16818175, 23969, 'TH', False, False, False, False],
[16818176, 16826367, 23969, 'TH', False, False, False, False],
[16818176, 16819199, 23969, 'TH', False, False, False, False],
[16819200, 16819455, 23969, 'TH', False, False, False, False],
[16819456, 16819711, 23969, 'TH', False, False, False, False],
[16819712, 16819967, 23969, 'TH', False, False, False, False],
[16819968, 16820223, 23969, 'TH', False, False, False, False]
]
They represent IPv4 networks, and I have data about IPv6 networks, too.
I got the data from a text file, and I wish to convert the data to an sqlite3 database. In fact I have already done so, but the data updates daily and for reasons I won't go into here I need to frequently update my database too.
As you can see from the sample, only the start and end are unique, there is a big amount of duplication of groups of the other six elements. Storing the rows like this is extremely inefficient, if we store only unique groups of the other elements, and replace them in the rows with a reference to the stored group, by compressing the data in this way, we can save a huge amount of storage space (this is true for my data).
One easy way to store compress the data is to store the repeated elements in a list in their order of occurrence, and replace them in the rows with their index in the list.
The basic idea is simple, start with an empty list, for each row, check if the data is in the list, if not, append the data to the list and replace the data in the row with the length of the list prior to appending, else, replace the data with the index of the data in the list.
The problem is, list membership checking uses linear search, which is O(n) and very inefficient for large data, and my data is really huge, it would take a lot of time. Even binary search would be slow in my case, and as you can see I can't do binary search here.
But set membership checking is amortized O(1), it is almost constant, it is extremely fast and remains fast no matter how many elements it contains, so that is exactly what I need.
The process would be like this:
groups = []
unique_groups = set()
compressed = []
for row in sample:
data = tuple(row[2:])
if data in unique_groups:
compressed.append([*row[:2], groups.index(data)])
else:
unique_groups.add(data)
compressed.append([*row[:2], len(groups)])
groups.append(data)
The problem with this approach is that list.index
uses linear search which is inefficient for my purposes as explained above, and for each unique group I need to store two copies of it, it would use twice as much space as needed.
A different approach would be using a dictionary, like this:
groups = {}
compressed = []
for row in sample:
data = tuple(row[2:])
compressed.append([*row[:2], groups.setdefault(data, len(groups))])
This would be much faster, but I don't know about its space complexity. I heard that Python dict
is memory expensive, but using list
+ set
approach each item would be stored twice...
It may be that for small amount of data the first approach will use less space, but for a large amount of data the second approach will use less space, and my data is extremely large. I don't know how they will scale, and I haven't tested yet, just loading my data would take one Gibibyte of RAM, and I can't reliably generate test cases with the same duplication level as my data, because I never saved the intermediate stage.
So which approach will have less space complexity? Do know that I also want performance, so there is a tradeoff, if the faster method takes about twice as much space but one tenth the execution time, it is acceptable, but if it takes ten times as much space, then even if it is instant it isn't acceptable (I only have 16 GiB RAM, it is finite).
Edit
All existing answers fail to solve my problem.
Because I need to process the data in Python first. There are a lot of overlapping ranges with different data, overlapping ranges with same data, and adjacent ranges with same data, I need to process the ranges so that there are no overlaps, the data is always from the smaller range, I will subtract the smaller range from the larger range, and I will merge adjacent ranges with same data. So that there is no ambiguity and the data is as small as possible.
See this question for more details and code.
Given the above sample, the result would be:
[(16777216, 16777471, 1),
(16777472, 16778239, 2),
(16778240, 16779263, 3),
(16779264, 16781311, 2),
(16781312, 16781567, 5),
(16781568, 16785407, 4),
(16785408, 16785663, 6),
(16785664, 16793599, 2),
(16793600, 16809983, 7),
(16809984, 16842751, 8),
(16842752, 33554431, 0)]
And no, I am sure as hell I can't use sqlite3 to do that.
I would then store two tables in my sqlite3 data base.
One with the above data, but each third column is incremented by one (sqlite3 id is one-based). The other with below data:
[(None, 'AU', False, False, False, False),
(13335, 'AU', False, True, False, False),
(None, 'CN', False, False, False, False),
(38803, 'AU', False, False, False, False),
(None, 'JP', False, False, False, False),
(2519, 'JP', False, False, False, False),
(141748, 'CN', False, False, False, False),
(18144, 'JP', False, False, False, False),
(23969, 'TH', False, False, False, False)]
In the first table, each third column is the id of the actual data in the second table.
This is to achieve maximum storage efficiency, it can then be complimented by sqlite3 compression further.
But I need the data as such. And sqlite3 can't do that alone.
And I don't actually want to use sqlite3 or Pandas DataFrame and such, I only store the data in sqlite3 database because I need to persist the data between sessions, and any other serialization format such as JSON and CSV will be inefficient for this. Pickle can be very efficient but it isn't secure and not compatible across versions.
And no, I have benchmarked the performance of sqlite3 query on a disk based database, a single query to retrieve one item by id takes milliseconds to complete because of I/O delay, this is unacceptable. I will load all data from the database at the start of the program.
Into Python lists because DataFrame is also slow on my machine according to my benchmarks.
My goal is to find the network of any given IP address, to do this I will do binary search on the starts and determine IP is less than or equal to the end, like the following:
from bisect import bisect
ASN = [
(None, 'AU', False, False, False, False),
(13335, 'AU', False, True, False, False),
(None, 'CN', False, False, False, False),
(38803, 'AU', False, False, False, False),
(None, 'JP', False, False, False, False),
(2519, 'JP', False, False, False, False),
(141748, 'CN', False, False, False, False),
(18144, 'JP', False, False, False, False),
(23969, 'TH', False, False, False, False)
]
NETWORKS = [
(16777216, 16777471, 1),
(16777472, 16778239, 2),
(16778240, 16779263, 3),
(16779264, 16781311, 2),
(16781312, 16781567, 5),
(16781568, 16785407, 4),
(16785408, 16785663, 6),
(16785664, 16793599, 2),
(16793600, 16809983, 7),
(16809984, 16842751, 8),
(16842752, 33554431, 0)
]
STARTS, ENDS, ASNS = zip(*NETWORKS)
def get_network(ip):
index = bisect(STARTS, ip) - 1
if ip <= (end := ENDS[index]):
return [STARTS[index], end, *ASN[index]]
return None
And if you want more details see this.
Update
I have benchmarked the size of the objects, it seems for arbitrarily generated data, dict
has smaller space complexity, and dict
has better time complexity, too. One important fact observed, however, is that the benchmark reported the memory usage of the objects in hundreds of Mebibytes range, but I didn't actually observe the interpreter use that much more RAM.
from itertools import product
from typing import Mapping, Sequence, Set
def get_size(obj: object) -> int:
size = obj.__sizeof__()
if isinstance(obj, Mapping):
size += sum(get_size(k) + get_size(v) for k, v in obj.items())
elif isinstance(obj, Sequence | Set) and not isinstance(obj, str):
size += sum(get_size(e) for e in obj)
return size
def generate(a, b, mode):
gen = ((i, j, *k) for i, j in product(range(a), range(b)) for k in product((0, 1), repeat=4))
if mode:
return {k: i for i, k in enumerate(gen)}
l = list(gen)
return l, set(l)
print(get_size(generate(16, 16, 0)))
print(get_size(generate(16, 16, 1)))
print(get_size(generate(256, 256, 0)))
print(get_size(generate(256, 256, 1)))
ls = list(range(32768))
d = dict.fromkeys(range(32768))
2060792
1210444
528477112
314540108
In [348]: %timeit ls.index(128)
1.67 µs ± 20.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [349]: %timeit ls.index(4096)
52.2 µs ± 810 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [350]: %timeit d[128]
48.8 ns ± 0.758 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [351]: %timeit d[4096]
61 ns ± 0.937 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
Overall, the dict
approach is clearly better than the list
+ set
one. But can we do better?