2

I was trying to find solution for my problem for some time but couldn't find any good idea yet..

I have a list of large strings where some character patterns (there are no spaces) are repeating. For my needs I need to reduce the memory these strings take by finding these patterns and converting the list of strings into two lists: one list that has all of the pattens typed once, and one list that has tuples/lists of pointers to the pattern list so that the original string can be recreated from the tuple.

For example, for the list of strings ['FOOBAR', 'BARFOO'] I would like to get ['FOO', 'BAR'], [(0, 1), (1, 0)]

The words in the pattern list should have length of at least 2 (unless we have no other choice, for example if there is only single character between two repeating patterns or entire input string has only length of 1) - or best as long as possible (cause addressing takes memory too so if some word happens once, it should have just one pointer instead of few).

Also the algorythm needs to be fast (best linear complexity) as my script performs this operation on user's input - and I don't want my user to wait too long.

Below I show example script of how it should work:

def getLists(str_list):
    # code here
    return pointers, out_strings


strings = ["FGJohnyRFGDERT", "VBSJohnR", "AAERFGR"]
pointers, out_strings = getLists(strings)
print(pointers, out_strings)
# [(0, 1, 2, 0, 3, 4, 5), (6, 1, 7), (8, 4, 0, 7)]["FG", "John", "yR", "D", "ER", "T", "VBS", "R", "AA"]

Thank you in advance for help! <3

EDIT: Someone proposed compressions like zlib. Sadly I need to unpack original strings in a low memory and very simple language that has no zlib support. So while compression algorythm can be very complex , the decompression has to be as simple as possible.

Amae Saeki
  • 65
  • 1
  • 5
  • Hi! Interesting problem. For things like this, one of my reflexes is always to take a look at DNA sequencing libraries and try to see if they don't have tools for this. For instance [Biopython](https://biopython.org/docs/1.75/api/index.html) – Stef Feb 08 '23 at 09:28
  • 1
    Will take a look :) Although even just for learning would be nice to find out the algorythm rather than using library not knowing what is inside :D – Amae Saeki Feb 08 '23 at 09:31
  • Related: [Algorithm to find common substring across n strings?](https://stackoverflow.com/questions/2418504/algorithm-to-find-common-substring-across-n-strings); [Common substring in list of strings?](https://stackoverflow.com/questions/66231459/common-substring-in-list-of-strings) (although this last link only presents bruteforce algorithms) – Stef Feb 08 '23 at 09:33
  • Yes but doing it string by string might give me O(n2) complexity and this is something i would like to avoid.. I am sure some smart usage of dicts here could give me O(n) – Amae Saeki Feb 08 '23 at 09:35
  • Here is a bit easier than in DNA as the words are always together (no wrong chars in these words to make the process harder) – Amae Saeki Feb 08 '23 at 09:45
  • Perhaps you could try with suffix trees. It should be particularly efficient if the user is typing one string at a time. When the user adds a string, the suffix tree from all previous strings is already built, so looking for substrings in the new string is fast. See [wikipedia](https://en.wikipedia.org/wiki/Suffix_tree) and [this python package](https://github.com/kasraavand/SuffixTree) – Stef Feb 08 '23 at 10:11
  • Suffix tree.. interesting, will take a look at it. And no, user adds list of strings immediately, not one string by one. But still if the tree could reduce complexity it is worth of checking – Amae Saeki Feb 08 '23 at 10:28
  • What is the expected output for `["ABCD" "AB" "BC", "CD"]`? – trincot Feb 08 '23 at 11:21
  • Interesting question, I guess to make it simple the algorythm would just go string by string, so firstly It would see that "AB" is already in first one, then "BC" will no longer work as first string is split on "AB" + "CD" so it will treat it as separate word. And the "CD" has already match in first one so it would be accepted. So in my opinion it would give ```["AB", "CD", "BC"], [(0, 1), (0, ), (2, ), (1,)]``` It doesn't have to be exactly that as I write - just to reduce memory used by the string as much as possible by finding patterns. – Amae Saeki Feb 08 '23 at 11:34
  • So if you typed them opposite like ```["BC", "ABCD", "AB", "CD"]``` the result would be different - unless there is some amazing way to find out best possible combination but to be honest even this simple compression would help me – Amae Saeki Feb 08 '23 at 11:36

1 Answers1

0

This sounds like a compression problem. There's plenty of compression libraries out there. zlib is one that comes with the standard library & does a decent job of shrinking things.

Another thought is that you could change data structures. For instance, instead of a list of str, try a list of bytes, or a bytearray with a custom seeking & data segmentation protocol.

Anywho, here's an example of zlib in use:

import zlib

compressed_strings = [
    zlib.compress(string.encode(), 9) 
    for string in strings
]
assert strings == [
    zlib.decompress(compressed_string) 
    for compressed_string in compressed_strings
]

Best,

aiootp
  • 154
  • 4
  • It won't help me - because I need to use these strings in another, very basic language with limited memory that doesn't have zlib support. I need very simple decompression algorythm - if i had list of lists of pointers it would be very simple to recreate strings there – Amae Saeki Feb 08 '23 at 10:13
  • Hm. Does the language support sets? Also, what is the goal? Full reconstruction, or is probabilistic matching ok? If lossless decompression is the goal: Re-implement a well-studied efficient lossless compression algorithm. This is probably the best option. Because compression is, well, complex & hard. Storing all unique sub-strings & pointers from each reference to all occurrences will always use more memory than just storing the original string unless the data & parameters are tuned to each other just right. You have to learn from the data itself what kinds of patterns to optimize for. – aiootp Feb 08 '23 at 10:48
  • The language supports only lists (which elements can be of different types) and basic elements. So sets and any other hash-like structures aren't supported. – Amae Saeki Feb 08 '23 at 10:50
  • And storing pointers is very efficient in my case. I want python to create the byte-like structure that is easiest to decompress while uses as little memory as possible. The strings that I use aren't always ASCII, they are UTF-16 strings. So a single pointer of one byte for a repeating word of 6 bytes will quickly get me memory savings. – Amae Saeki Feb 08 '23 at 10:52
  • So the thing is.. that if I use too complex algorythm for decompression, even if the data won't take much memory, the code to actually decompress it will take lots. So I rather want to leave all of the heavy computing for python and just output very easy data to decompress like the one that I proposed in this question – Amae Saeki Feb 08 '23 at 10:58
  • Why not leave all of the computing for python then? Just use that language to handle message passing. Also not clear is where this user input is coming from: a web form, or key presses, or a log file... I'm leaning heavily towards the idea that you'll need to implement an algorithm that has already been studied thoroughly, or offload the processing to a worker that already knows how to. The space of all possible sub-strings grows exponentially. Unless you are using a clever algorithm you're going to find your program on the worst-case side of those complexities more often than is tolerable. – aiootp Feb 08 '23 at 11:39
  • Because the basic language I talk about is a scripted language made for the specified game. I can not control all of the game aspects from python - only gaming scripts can do that - but those have memory limits. Python autogenerates these gaming scripts in proper language. Anyways I just need simple decompression algorythm. And in those strings there are a lot of repetitions so pattern recognition sounds like a best idea for now to me – Amae Saeki Feb 08 '23 at 11:41
  • Seems like there's a flaw in the goal. You'll be using Python for compression because the game scripts can't handle the memory overhead of the decompressed data, but then decompressing the same data in the same game scripting environment which cannot handle the decompressed data. So, another idea is to just buffer the inputs so the gaming scripts only the get the amount of data they can handle. – aiootp Feb 08 '23 at 11:50
  • I don't decompress all data at once. Just part of the data - only currently needed data is decompressed and the rest stays compressed. And the script can not read data from disk - all the data has to be in the script from the very beginning – Amae Saeki Feb 08 '23 at 12:05