0

I have literally millions (no exaggeration) of rows of data like the following:

sample = [
    [16777216, 33554431, None, 'AU', False, False, False, False],
    [16777216, 16777471, 13335, 'AU', False, True, False, False],
    [16777472, 16777727, None, 'CN', False, False, False, False],
    [16777728, 16778239, None, 'CN', False, False, False, False],
    [16778240, 16779263, 38803, 'AU', False, False, False, False],
    [16778496, 16778751, 38803, 'AU', False, False, False, False],
    [16779264, 16781311, None, 'CN', False, False, False, False],
    [16781312, 16785407, None, 'JP', False, False, False, False],
    [16781312, 16781567, 2519, 'JP', False, False, False, False],
    [16785408, 16793599, None, 'CN', False, False, False, False],
    [16785408, 16785663, 141748, 'CN', False, False, False, False],
    [16793600, 16809983, 18144, 'JP', False, False, False, False],
    [16809984, 16842751, 23969, 'TH', False, False, False, False],
    [16809984, 16826367, 23969, 'TH', False, False, False, False],
    [16809984, 16818175, 23969, 'TH', False, False, False, False],
    [16809984, 16810239, 23969, 'TH', False, False, False, False],
    [16810240, 16810495, 23969, 'TH', False, False, False, False],
    [16810496, 16811007, 23969, 'TH', False, False, False, False],
    [16811008, 16811263, 23969, 'TH', False, False, False, False],
    [16811264, 16811519, 23969, 'TH', False, False, False, False],
    [16812032, 16812287, 23969, 'TH', False, False, False, False],
    [16812288, 16812543, 23969, 'TH', False, False, False, False],
    [16812544, 16812799, 23969, 'TH', False, False, False, False],
    [16812800, 16813055, 23969, 'TH', False, False, False, False],
    [16813312, 16813567, 23969, 'TH', False, False, False, False],
    [16814080, 16818175, 23969, 'TH', False, False, False, False],
    [16818176, 16826367, 23969, 'TH', False, False, False, False],
    [16818176, 16819199, 23969, 'TH', False, False, False, False],
    [16819200, 16819455, 23969, 'TH', False, False, False, False],
    [16819456, 16819711, 23969, 'TH', False, False, False, False],
    [16819712, 16819967, 23969, 'TH', False, False, False, False],
    [16819968, 16820223, 23969, 'TH', False, False, False, False]
]

They represent IPv4 networks, and I have data about IPv6 networks, too.

I got the data from a text file, and I wish to convert the data to an sqlite3 database. In fact I have already done so, but the data updates daily and for reasons I won't go into here I need to frequently update my database too.

As you can see from the sample, only the start and end are unique, there is a big amount of duplication of groups of the other six elements. Storing the rows like this is extremely inefficient, if we store only unique groups of the other elements, and replace them in the rows with a reference to the stored group, by compressing the data in this way, we can save a huge amount of storage space (this is true for my data).

One easy way to store compress the data is to store the repeated elements in a list in their order of occurrence, and replace them in the rows with their index in the list.

The basic idea is simple, start with an empty list, for each row, check if the data is in the list, if not, append the data to the list and replace the data in the row with the length of the list prior to appending, else, replace the data with the index of the data in the list.

The problem is, list membership checking uses linear search, which is O(n) and very inefficient for large data, and my data is really huge, it would take a lot of time. Even binary search would be slow in my case, and as you can see I can't do binary search here.

But set membership checking is amortized O(1), it is almost constant, it is extremely fast and remains fast no matter how many elements it contains, so that is exactly what I need.

The process would be like this:

groups = []
unique_groups = set()
compressed = []
for row in sample:
    data = tuple(row[2:])
    if data in unique_groups:
        compressed.append([*row[:2], groups.index(data)])
    else:
        unique_groups.add(data)
        compressed.append([*row[:2], len(groups)])
        groups.append(data)

The problem with this approach is that list.index uses linear search which is inefficient for my purposes as explained above, and for each unique group I need to store two copies of it, it would use twice as much space as needed.

A different approach would be using a dictionary, like this:

groups = {}
compressed = []
for row in sample:
    data = tuple(row[2:])
    compressed.append([*row[:2], groups.setdefault(data, len(groups))])

This would be much faster, but I don't know about its space complexity. I heard that Python dict is memory expensive, but using list + set approach each item would be stored twice...

It may be that for small amount of data the first approach will use less space, but for a large amount of data the second approach will use less space, and my data is extremely large. I don't know how they will scale, and I haven't tested yet, just loading my data would take one Gibibyte of RAM, and I can't reliably generate test cases with the same duplication level as my data, because I never saved the intermediate stage.

So which approach will have less space complexity? Do know that I also want performance, so there is a tradeoff, if the faster method takes about twice as much space but one tenth the execution time, it is acceptable, but if it takes ten times as much space, then even if it is instant it isn't acceptable (I only have 16 GiB RAM, it is finite).


Edit

All existing answers fail to solve my problem.

Because I need to process the data in Python first. There are a lot of overlapping ranges with different data, overlapping ranges with same data, and adjacent ranges with same data, I need to process the ranges so that there are no overlaps, the data is always from the smaller range, I will subtract the smaller range from the larger range, and I will merge adjacent ranges with same data. So that there is no ambiguity and the data is as small as possible.

See this question for more details and code.

Given the above sample, the result would be:

[(16777216, 16777471, 1),
 (16777472, 16778239, 2),
 (16778240, 16779263, 3),
 (16779264, 16781311, 2),
 (16781312, 16781567, 5),
 (16781568, 16785407, 4),
 (16785408, 16785663, 6),
 (16785664, 16793599, 2),
 (16793600, 16809983, 7),
 (16809984, 16842751, 8),
 (16842752, 33554431, 0)]

And no, I am sure as hell I can't use sqlite3 to do that.

I would then store two tables in my sqlite3 data base.

One with the above data, but each third column is incremented by one (sqlite3 id is one-based). The other with below data:

[(None, 'AU', False, False, False, False),
 (13335, 'AU', False, True, False, False),
 (None, 'CN', False, False, False, False),
 (38803, 'AU', False, False, False, False),
 (None, 'JP', False, False, False, False),
 (2519, 'JP', False, False, False, False),
 (141748, 'CN', False, False, False, False),
 (18144, 'JP', False, False, False, False),
 (23969, 'TH', False, False, False, False)]

In the first table, each third column is the id of the actual data in the second table.

This is to achieve maximum storage efficiency, it can then be complimented by sqlite3 compression further.

But I need the data as such. And sqlite3 can't do that alone.

And I don't actually want to use sqlite3 or Pandas DataFrame and such, I only store the data in sqlite3 database because I need to persist the data between sessions, and any other serialization format such as JSON and CSV will be inefficient for this. Pickle can be very efficient but it isn't secure and not compatible across versions.

And no, I have benchmarked the performance of sqlite3 query on a disk based database, a single query to retrieve one item by id takes milliseconds to complete because of I/O delay, this is unacceptable. I will load all data from the database at the start of the program.

Into Python lists because DataFrame is also slow on my machine according to my benchmarks.

My goal is to find the network of any given IP address, to do this I will do binary search on the starts and determine IP is less than or equal to the end, like the following:

from bisect import bisect

ASN = [
    (None, 'AU', False, False, False, False),
    (13335, 'AU', False, True, False, False),
    (None, 'CN', False, False, False, False),
    (38803, 'AU', False, False, False, False),
    (None, 'JP', False, False, False, False),
    (2519, 'JP', False, False, False, False),
    (141748, 'CN', False, False, False, False),
    (18144, 'JP', False, False, False, False),
    (23969, 'TH', False, False, False, False)
]

NETWORKS = [
    (16777216, 16777471, 1),
    (16777472, 16778239, 2),
    (16778240, 16779263, 3),
    (16779264, 16781311, 2),
    (16781312, 16781567, 5),
    (16781568, 16785407, 4),
    (16785408, 16785663, 6),
    (16785664, 16793599, 2),
    (16793600, 16809983, 7),
    (16809984, 16842751, 8),
    (16842752, 33554431, 0)
]

STARTS, ENDS, ASNS = zip(*NETWORKS)

def get_network(ip):
    index = bisect(STARTS, ip) - 1
    if ip <= (end := ENDS[index]):
        return [STARTS[index], end, *ASN[index]]
    return None

And if you want more details see this.


Update

I have benchmarked the size of the objects, it seems for arbitrarily generated data, dict has smaller space complexity, and dict has better time complexity, too. One important fact observed, however, is that the benchmark reported the memory usage of the objects in hundreds of Mebibytes range, but I didn't actually observe the interpreter use that much more RAM.

from itertools import product
from typing import Mapping, Sequence, Set

def get_size(obj: object) -> int:
    size = obj.__sizeof__()
    if isinstance(obj, Mapping):
        size += sum(get_size(k) + get_size(v) for k, v in obj.items())
    elif isinstance(obj, Sequence | Set) and not isinstance(obj, str):
        size += sum(get_size(e) for e in obj)
    return size

def generate(a, b, mode):
    gen = ((i, j, *k) for i, j in product(range(a), range(b)) for k in product((0, 1), repeat=4))
    if mode:
        return {k: i for i, k in enumerate(gen)}
    l = list(gen)
    return l, set(l)

print(get_size(generate(16, 16, 0)))
print(get_size(generate(16, 16, 1)))
print(get_size(generate(256, 256, 0)))
print(get_size(generate(256, 256, 1)))
ls = list(range(32768))
d = dict.fromkeys(range(32768))
2060792
1210444
528477112
314540108

In [348]: %timeit ls.index(128)
1.67 µs ± 20.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [349]: %timeit ls.index(4096)
52.2 µs ± 810 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [350]: %timeit d[128]
48.8 ns ± 0.758 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [351]: %timeit d[4096]
61 ns ± 0.937 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Overall, the dict approach is clearly better than the list + set one. But can we do better?

Ξένη Γήινος
  • 2,181
  • 1
  • 9
  • 35
  • I don't know, maybe you can try first some compression, e.g compressed folder where the sqlite db resides. If you have lots duplicate data maybe the compression will be enough. – Andrej Kesely Jul 26 '23 at 18:16
  • If you have a lot of data, using a database instead of memory. – erip Jul 26 '23 at 19:04
  • 1
    If you wish to persist the data, and update it somewhat regularly, you (obviously) should put it into a database. A few million rows is not much to be concerned about. You seem to be wrapped up on normalizing the data, but the parts you are concerned about are 4 booleans and a 2-character string. It's **not worth it**. Those are trivial to store. Just put the records in as-is, with 8 fields of data in a table. – AirSquid Jul 26 '23 at 20:44
  • It seems you are doing complicated things for no apparent reason to me. Storing boolean is cheap and storing integer is relatively cheap unless you have really a huge amount of integer. Million rows is not a huge amount of data. Indeed, even stored inefficiently in text, 10 millions rows would use ~650 MiB. In practice SQLite appears to use a [more efficient](https://www.sqlite.org/fileformat.html) encoding. – Jérôme Richard Jul 26 '23 at 22:08
  • If SQLite is not optimal (likely), then you can pack booleans in bits and use 8 bits for the country. The number before seems to fit in 16 bits so 1 row should fit in 96 bits, that is 12 bytes (on standard machines). 10 million rows would fit in only 120 MiB. This is relatively small for a database. You can even load it entirely in RAM pretty quickly! – Jérôme Richard Jul 26 '23 at 22:13
  • Ha and by the way, if you still want to "compress" rows, then (1) databases can do that, (2) you can do it yourself efficiently once rows packed. You can compute the hash of the blob and use a hash-map to tack unique values in `O(1)` time per row (each access should take less than 0.1 µs in practice so at least 10 million rows per second can be compressed). Isn't this not enough? – Jérôme Richard Jul 26 '23 at 22:21
  • The question start to be rather huge for a S.O. question (decreasing the number of readers, increasing the confusion and so not the well suited answers). That being said here is for point regarding performance : (1) doing bisection on a list of tuple is inherently inefficient due to the many memory indirection needed to do that (i.e. too many random access just to search 1 value). Numpy can do that faster. Hash maps (`dict`) are much better for that but certainly not optimal. You can just store the values in a basic array if they tends not to be random and the higher bound is relatively small. – Jérôme Richard Jul 27 '23 at 11:12
  • Regarding the memory space, CPython cache small values. For example `None`, `False` and `True` are cached as well as integers smaller than 256 AFAIK. This means you should not include the size of the object in this case. AFAIK, in CPython, objects are basically ~32 bytes and lists/tuples store reference, taking 8 bytes, to objects. Reading 1 object of a list requires to fetch the list, increase the list ref-counter, read the reference, dereference the pointer, increase the object ref-counter, decrease the list ref-counter, access/decode the object value/type, decrease the object ref-counter. – Jérôme Richard Jul 27 '23 at 11:21
  • @JérômeRichard Did you read my example code? I am not going to bisect a list of tuples, in fact as you can see from my example code I am going to split the list of three columns into three flat lists, each corresponding to one column. And I will then perform bisect on the first column, which is a flat list of integers representing the starts of IP networks. – Ξένη Γήινος Jul 27 '23 at 11:22
  • Indeed, there is no tuple during the computation. This is better, but still far from being efficient I think because of the CPython objects. The last previous comment still applies. CPython is very slow for most things (the interpreter, but also data storage/access, function calls, etc.). If done in Numpy, the bisection will be faster because of less indirection, and a more efficient data layout better fitting in the cache, not to mention it will be more compact in memory (32+8 byte per integer is huge : 4x more than needed). This is why Pandas use Numpy internally. – Jérôme Richard Jul 27 '23 at 11:36

3 Answers3

-1

There is no need for you to be operating on such a large data set in RAM. Instead of using Python sets and lists, you can do the data compression in your SQLite database.

The database design that comes to mind is to store your "unique groups" in their own table and then reference it from another table that stores your records.

Something like this:

CREATE TABLE unique_group(
    id INTEGER PRIMARY KEY,
    first INTEGER,
    second TEXT,
    third BOOLEAN,
    fourth BOOLEAN,
    fifth BOOLEAN,
    sixth BOOLEAN,
    UNIQUE (first, second, third, fourth, fifth, sixth)
);

CREATE TABLE sample(
    first INTEGER,
    second INTEGER,
    group_id INTEGER REFERENCES unique_group(id)
);

As you insert data into your database, write to the unique_group table with an INSERT OR IGNORE INTO unique_group (...) VALUES (...) statement. Then get the id of the row in the unique group table and reference that in your sample table.

Nick
  • 1
  • 2
  • *"Then get the id of the row in the unique group table and reference that in your sample table."* - That sounds slow, though. How do you propose to get the id, exactly? – Kelly Bundy Jul 26 '23 at 19:27
  • @KellyBundy `last_insert_rowid()` is one option. See [here](https://stackoverflow.com/questions/6242756/how-to-retrieve-inserted-id-after-inserting-row-in-sqlite-using-python) for more detail. Even a SELECT query probably wouldn't be slow enough to make much difference, but it would need to be tested for sure. – Nick Jul 26 '23 at 20:23
  • Hmm, they said "Even binary search would be slow in my case", and I suspect a database query would be even slower than that. Not sure about `last_insert_rowid`, maybe that doesn't require an extra query? But sounds like you'd need to insert those rows one by one then, and then I suspect that *that* is slow... – Kelly Bundy Jul 26 '23 at 20:43
-1

You can change the database structure in below manner:

Metadata:

PK and (the six repeating columns)

Data:

PK FK(PK of Metadata) and (the two main data col)

And for faster execution you can use pandas.

So, first I would convert this data in DataFrame then I'll perform the below operations.

  1. Create two DataFrame as our new Database tables i.e one for metadata and one for actual data.
  2. You can then take unique of the six columns using unique and store that in the metadata DataFrame.
  3. Use groupby on the six columns and save that in your data DataFrame.
  4. Then you can directly store the DataFrame in your database using to_sql.
Navkar Jain
  • 195
  • 1
  • 8
-1

First create a table using something like this

CREATE TABLE sample(
    id INTEGER PRIMARY KEY,
    col1 INTEGER,
    col2 TEXT,
    col3 BOOLEAN,
    col5 BOOLEAN,
    col6 BOOLEAN,
    col7 BOOLEAN,
    col8 BOOLEAN,
    UNIQUE (col3, col4, col5, col6, col7, col8)
);

You can convert this nested list to a pandas dataframe:

df = pd.DataFrame(sample, columns=[columns you defined in the database])

and you can use df.to_sql() to insert data in the database.

If you don't want to create unique constraint in the database but want to remove the duplicates data beforehand you can do so in the dataframe as well.

df.drop_duplicates(
  subset = [columns you want to be unique together],
  keep = 'last').reset_index(drop = True)

Anyway I suggest you to convert the dataframe to parquet file (which is very light) rather than dumping to a file based database. You can simply do df.to_parquet() to convert it to parquet file.