Efficient data store / algorithm to pattern match binary strings of 0s and 1s

Question

I have generated a .txt file containing hundreds of rows. Each of the rows looks like this:

0000000000000000000000000000000000000000100000000010000000001000000000100000000010000000000000000000

Now I have an input function where an string like this gets submitted:

0000000000000000000000000000000000000000000011000010000000001000000000100000000010000000000000000000

If the input string has a 1 where a string from the .txt has a 0 it doesn't matter. The question the algorithm should solve is following:

find the "best matching" rows in the .txt where input string has as much 1s at the same indices the row has
output at which indices theses "best matching rows" have 1s and the input hasn't

Circumstances:

There are a maximum of five 1s in each row of the .txt
There can be up to a hundred 1s in the input string.
Each row and the input are exactly 100 characters long.
Each row and the input will always only contain 0 or 1; no other characters.

Here is an example that's smaller than the real problem assuming length 6 with max 3 ones (I will count indices starting from 1 not 0 for this example):

rows.txt

input string:

Expected output of the algorithm

Closest matching rows are:

Row 1: two matching 1s; your input misses a 1 on index 6
Row 4: two matching 1s; your input misses a 1 on index 4

Question / possible solutions:

I choose 0s and 1s to represent the information I need as I planned on putting them in a binary file. I was expecting better performance and lower file size when using binary representation and unsigned integers in the code (Java/Kotlin). However to go on uInt8; I would need to add "0000" on each line to make each line 104 characters long which is dividable by 8. That's surely possible, but than I have everything saved as integers and I just can't get my head around on how to compare these on a byte-representation level to basically do the pattern matching I need in the code.

Looking for the best performing storage/pattern matching solution for the given problem.

Here is why I am doing this: I have a frontend with a clickable grid. An unckecked item is represented by a 0. Clicking on a item and thus selecting it is represented by a 1. Users can select up to 50 of the 100 available items. However, there some items depend on other items. Every item always depends on 4 other items. These are the rows of the (currently) .txt file. And that's why I need to figure out what ones are missing to recommend a configuration that has a full dependency of five 1s working while I don't need to care if there are additional selections (1s). Maybe this use-case clarifies how an abstract problem like this is a real world problem and not a school assignment. Prior solution involved representing the grid by a matrix, but having a binary string seemed to be more efficient.

Clarifying on the problem: Binary three:

00000011

Binary thirteen:

00001101

If there is a row representing binary 3 and the input is binary 13, how could I compare

val row: uint8 = 3u
val input: uint8 = 13u

to get the result

missing value 2 (-> a one on index 7

https://meta.stackoverflow.com/questions/334822/how-do-i-ask-and-answer-homework-questions — sidgate, Dec 13 '21 at 12:06
@sidgate this is NOT a homework question. It's for a personal project. Also looking less for an complete algorithm and more for hints/explanation on how to deal with integer <-> binary matching. — Cerealz, Dec 13 '21 at 12:07
You could create your algorithm first and use a function to retrieve the "bits". If that works, you can change the internal representation and store the strings as bit arrays. In the last step it could be interesting to "inline" the get-bit functions in order to gain some processing speed. And as a hint, if I'd like to know if bit 5 (zero based) of a byte `a` is 1, an expression like `(a & 0x20 == 0 ? 0 : 1)` yields 1 if the bit is set, 0 otherwise. — Ronald, Dec 13 '21 at 12:23

polo-language · Accepted Answer · 2021-12-14T13:19:32.690

Don't work with "0", "1" text, use binary representations.

Compare the 'input' with each 'row' using bitwise &, then calculate the number of ones in the result by wrangling an acceptable solution from one of the answers here: Count number of 1's in binary representation.

EDIT:

To expand this idea a bit in response to @gidds comment:

You have expressed a willingness to pad with zeros ('to go on uInt8; I would need to add "0000"'), so maybe use int (too bad each line isn't of length 96!), which trades a bit of space for the convenience of comparinging 32 columns in one primitive operation. Really it won't waste much space though because you can read each line of the file in a loop into a single re-useable ByteBuffer. ByteBuffer#allocate initializes to zeros already, so padding is taken care of. There's no reason you couldn't use a long, an int, and a byte for a total of 104 bits, but probably that's not worth the trouble - there's no nibble in Java/Kotlin so no way to get exactly 100. I've chosen 4 ints over 2 longs solely to be able to take advantage of the bitCount algorithm found at the above link.

Here is a fairly comprehensive sketch in pure Java:

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;

public class BitSimilarityChecker {
    
    public void main(String[] args) throws Exception {
        
        // int[] input = ... 
        assert input.length == 4;
        
        try (ByteFileReader reader = new ByteFileReader(args[0])) {
            while (/* TODO: When has file been fully read? */) {
                int[] row = reader.read(); 
                assert row.length == 4;
                int sum = 0;
                for (int i = 0; i < 4; ++i) {
                    sum += bitCount(and(input[i], row[i]));
                }
                // TODO: Keep track of the max sum, row pair...
            }
        }
    }
    
    private static int and(int left, int right) {
        return left & right;
    }
    
    // Taken from https://stackoverflow.com/a/8871435 - unverified.
    private static int bitCount(int u) {
         int uCount = u - ((u >> 1) & 033333333333) - ((u >> 2) & 011111111111);
         return ((uCount + (uCount >> 3)) & 030707070707) % 63;
    }
    
    private static class ByteFileReader implements AutoCloseable {

        private static final int CAPACITY = 4 * 8 * Integer.BYTES;
        private static final int LIMIT = 100;
        
        private final ReadableByteChannel channel;
        private final ByteBuffer buffer;
        
        @SuppressWarnings("resource")
        public ByteFileReader(String file) throws FileNotFoundException {
            channel = Channels.newChannel(new FileInputStream(file));
            buffer = ByteBuffer.allocate(CAPACITY);
        }
        
        public int[] read() throws IOException {
            buffer.position(0);
            buffer.limit(LIMIT);
            int bytesRead = channel.read(buffer);
            buffer.limit(CAPACITY);
            // TODO: Check bytesRead to see the status of the read.
            // Pass this information on to the caller somehow. 
            return new int[] {
                    buffer.getInt(),
                    buffer.getInt(),
                    buffer.getInt(),
                    buffer.getInt()
            };
        }

        @Override
        public void close() throws Exception {
            channel.close();
        }
    }
}

What type would the binary representation have? According to the question, each line has 100 characters, so each number has 100 bits — too large for `Long` (64 bits) or any other primitive type. Kotlin doesn't provide a larger type which can use bitwise `&` operators. (If running on the JVM, you could use `java.util.BitSet`, though the syntax is more awkward.) — gidds, Dec 13 '21 at 17:40

Efficient data store / algorithm to pattern match binary strings of 0s and 1s

1 Answers1