Find combination of string sentences - Combinations of frequency tables to target frequency table

Question

The problem is explained in the following article.

I have a list of sentences, for example a list of 1000 sentences.

I would like to find a combination of sentences to match/'match closest' a certain frequency table:

[a:100, b:80, c:90, d:150, e:100, f:100, g:47, h:10 ..... z:900]

I thought about finding all possible combinations from the sentences list by using combinations like in here (so comb(1000, 1); to comb(1000, 1000); ) and then compare every combination with the frequency table, so that distance is minimum. So sum all frequency tables from a possible combination and compare this sum with the target, the combination with the smallest difference with the target should be recorded. There could be multiple combinations that match closest.

The problem is that the calculation of all combinations takes way too long to complete, apparently couple of days. Is there a known algorithm that could solve this efficiently? Ideally couple of minutes maximum?

Input sentences:

More RVs were seen in the storage lot than at the campground.

She did her best to help him. There have been days when I wished to be separated from my body, but today wasn’t one of those days.

The swirled lollipop had issues with the pop rock candy.

The two walked down the slot canyon oblivious to the sound of thunder in the distance.

Acres of almond trees lined the interstate highway which complimented the crazy driving nuts.

He is no James Bond; his name is Roger Moore.

The tumbleweed refused to tumble but was more than willing to prance.

She was disgusted he couldn’t tell the difference between lemonade and > limeade.

He didn’t want to go to the dentist, yet he went anyway.

Find combination of sentences that match the following frequency table closest:

[a:5, b:5, c:5, d:5, e:5, f:5, g:5, h:5 ..... z:5]

Example:

Frequency table of sixth sentence

He is no James Bond; his name is Roger Moore.

is [a:2, e:5, g:1, h:1, i:3, j:1, m:3, n:3, o:5, r:3, s:4]

Frequency table takes upper and lower equal and excludes special characters.

It's somewhat hard to follow your thoughts and try to understand what you're trying to achieve. Could you please include an actual example? WIth an actual list of sentences (but no more than 10 sentences) and an actual frequency table, and the actual desired output? — Stef, Dec 04 '21 at 11:57
Also, what I understand of your question makes me think about "balancing a chemical reaction". Instead of a list of sentences, a chemical reaction has a list of molecules; a molecule contains atoms, just like a sentence contains letters; and to balance the equation, an algorithm must determine the correct number of each molecule so that the numbers of each atoms are consistent; just like you want to determine the number of each sentence so that the numbers of each letter is consistent. — Stef, Dec 04 '21 at 11:59
Alternatively, your problem is maybe similar to the problem **multiset cover**, where the frequencies form a multiset, and each sentence is a sub-multiset, and you want to pick the smallest number of sentences to cover your frequency multiset. — Stef, Dec 04 '21 at 12:01
hi Stef I think that would be a good comparison, so the smallness of sentences is also an important factor you think? I thought the length of sentences didn't really matter as long as it would sum to the target best — BigChief, Dec 04 '21 at 12:20
Not the length of each individual sentence; the number of selected sentences. In the multiset-cover problem, a valid solution is a solution in which the frequencies are *at least* the frequencies in the target; an optimal solution is a solution in which the frequencies are at least the frequencies in the target, and the number of selected multisets is minimum. But in your case, you don't just want to have frequencies at least as high as the target: you want to have frequencies as close to possible to the targets. So you don't need to optimise on the number of sentences. — Stef, Dec 04 '21 at 12:35
yes that is right, I think having a result with different subsets that are equally close to the target is also sufficient, the list of combinations could perhaps be of use afterwards — BigChief, Dec 04 '21 at 12:39
***a combination of sentences to match/'match closest' a certain frequency table*** can you provide an example of the output i.e. combination of sentences for given input sentences? — kiner_shah, Dec 07 '21 at 03:59
@kiner_shah please check Stef his answer and the comments below — BigChief, Dec 07 '21 at 10:46
This is actually a Multidimensional Subset Sum (kD-SS) problem (a generalization of the [subset sum problem](https://en.wikipedia.org/wiki/Subset_sum_problem) with `k` being the number of dimensions (the number of different characters in your case). Both are np-complete, furthermore kD-SS is not within [APX](https://en.wikipedia.org/wiki/APX), so there's no fast approximation algorithm either. [link to research paper](https://hal.inria.fr/hal-01401896/document) - so you'll have to settle for a suboptimal approach like the greedy algorithm described below. — Turtlefight, Dec 10 '21 at 19:14
@Turtlefight thank you so there is a brute force method like I have coded here which traverses all subsets (set line 109 to i<50 instead of 6 to show the slowness) https://www.onlinegdb.com/TDBEemTd5 but what you say is that there is no faster way to have an exact match like the brute force method? only approximations that could be far away from the optimal solution?? — BigChief, Dec 10 '21 at 19:39
@BigChief Given that kD-SS is not within `APX` there can be no approximation function that runs in <= `O(nᶜ)` (with c being a constant factor). (it's closely related to the [Closest Vector Problem](https://en.wikipedia.org/wiki/Lattice_problem#Closest_vector_problem_(CVP))). If such a function would exist you would have proven that `P = NP`. So basically you're stuck with `O(scary)` complexity. — Turtlefight, Dec 10 '21 at 20:31
and the O(less scariest) for this problem is the link I sent before https://www.onlinegdb.com/TDBEemTd5 or are there known improvements? — BigChief, Dec 10 '21 at 20:53
so I coded the brute force method, which is O(scary), it will never complete for bigger n, so are there still possibilities to have the solution completed in couple of minutes for n=50/n=1000, or is there no solution for this? — BigChief, Dec 13 '21 at 14:18
@BigChief would you be willing to share a couple of challenges of different sizes to benchmark solutions against for your bounty? And including acceptable distance to the target freq table? A lot will depend on distribution of the input, the output and very importantly the distance that you are willing to accept. — Marc Stevens, Dec 18 '21 at 11:18
@MarcStevens https://www.onlinegdb.com/TDBEemTd5 if you would set line 109 to i<50, i<1000 (add lines to lines.h https://randomwordgenerator.com/sentence.php) it would still need to return an answer within an acceptable amount of time — BigChief, Dec 18 '21 at 15:12
added to this, the brute force method would return the closest combination of sentences possible — BigChief, Dec 18 '21 at 17:41
This is a variation of the multiple-dimension subset sum problem, however, the question cannot be answered until you more clearly define what "closest" means. Specifically, you need to define a formula/function metric that returns a value that can be used to determine which combinations are actually closer. If you choose to leave this undefined or as a "black box" function, then the only solutions possible are brute-force combinatorial searches, as the mechanics of the distance function are key to algorithmic optimizations (and for some distance functions, there are no useful optimizations) — RBarryYoung, Dec 21 '21 at 15:11
@RBarryYoung I thought it was clear from the problem statement that it would need to return one or more sentences that match closest to the target. So the distance is measured by taking the sentence or sum of sentences and compare it to the target, the absolute distance from the target needs to be minimized. So the smallest distance to target should be used as the result (there could be multiple with same distance). Hope this is more clear now, please let me know if there are details missing. — BigChief, Dec 24 '21 at 13:11
@BigChief There is no standard of distance for multi-dimensional values like this and if there were it would be Pythagorean distance, but you've given no indication of that. The only hint in you post and even what you've said in your comment is the word "absolute". From that, can we assume that by "distance" you mean "*the sum of the absolute differences of corresponding letter counts*"? — RBarryYoung, Dec 24 '21 at 15:30
first take the sum of corresponding letter counts for each sentence -> per sum of corresponding letter count the absolute difference with the target and then the sum of these absolute difference, this number must be as small as possible, please let me know if this is more clear because I'm not fully sure if I understand you correctly — BigChief, Dec 24 '21 at 20:02

Majid Hajibaba · Answer 1 · 2021-12-21T08:14:29.213

Whenever someone find a combination of sentences with 3c, 3a, 3b, 3d or 30c, 30a, 30b, 30d from the following sentences with 5% above or below it can be solved.

S1: aaaaaaaaaaaaaaaaaa bbbbbb c
S2: aaaaaaaa bbbbbbbb d
S3: aaaaaaaaaaa bbbbbbbbb c dd
S4: aaaaaaaaaa bbbbbbbb

Be realistic. There is No solution, not NP-hard nor NP-complete, No solution. The number of occurrence of letters in a sentence (for example vowels like i or a) is not equal to others (like x or w). We can just find best matches like the code provided here or change the requirement. I tried to solve this with KnapSack algorithm and Euclidean distance and Standard deviation, but none gives me such answer since there is no sentence with the same size of letters.

Stef · Answer 2 · 2021-12-04T15:32:27.500

A greedy algorithm

Your first idea to test all the possible combinations of sentences is too slow. If you have n sentences, then there are 2**n (2 to the power of n) possible combinations of sentences. For instance with n=1000, there are 2**1000 ≈ 10**300 possible combinations. That's a 1 followed by 300 zeroes: more than the number of particles in the universe, and more than the number of different possible games of chess!

Here is a suggestion for a greedy algorithm. It's not particularly optimised, and its running time is O(k * n**2), where n is the number of sentences and k is the length of the longest sentence.

The idea is the following:

Attribute to each sentence the score number of useful characters - number of superfluous characters. For instance, if a sentence contains 20 'a' and the target requires only 15 'a', we're going to count 15 useful 'a' and 5 superfluous 'a', so character 'a' contributes 10 to the score of that sentence.
Add the sentence with the highest score to the result;
Update the target to remove the characters that are already in the result;
Update the score of every sentence to reflect the updated target.
Loop until no sentence has a positive score.

I was too lazy to implement it in C++, so here it is in python, using a max-heap and a Counter. After the code I wrote a quick explanation to help you translate it into C++.

from collections import Counter
import heapq

sentences = ['More RVs were seen in the storage lot than at the campground.', 'She did her best to help him.', 'There have been days when I wished to be separated from my body, but today wasn’t one of those days.', 'The swirled lollipop had issues with the pop rock candy.', 'The two walked down the slot canyon oblivious to the sound of thunder in the distance.', 'Acres of almond trees lined the interstate highway which complimented the crazy driving nuts.', 'He is no James Bond; his name is Roger Moore.', 'The tumbleweed refused to tumble but was more than willing to prance.', 'She was disgusted he couldn’t tell the difference between lemonade and limeade.', 'He didn’t want to go to the dentist, yet he went anyway.']

target = Counter('abcdefghijklmnopqrstuvwxyz' * 10)
Counter({'a': 10, 'b': 10, 'c': 10, 'd': 10, 'e': 10, 'f': 10, 'g': 10, 'h': 10, 'i': 10, 'j': 10, 'k': 10, 'l': 10, 'm': 10, 'n': 10, 'o': 10, 'p': 10, 'q': 10, 'r': 10, 's': 10, 't': 10, 'u': 10, 'v': 10, 'w': 10, 'x': 10, 'y': 10, 'z': 10})

print(target)

counts = [Counter(''.join(filter(str.isalpha, s)).lower()) for s in sentences]  # remove punctuation, spaces, uncapitalize, then count frequencies

def get_score(sentence_count, target):
    return sum((sentence_count & target).values()) - sum((sentence_count - target).values())

candidates = []
for sentence, count in zip(sentences, counts):
    score = get_score(count, target)
    candidates.append((-score, sentence, count))

heapq.heapify(candidates)    # order candidates by score
                             # python's heapq only handles min-heap
                             # but we need a max-heap
                             # so I added a minus sign in front of every score

selection = []
while candidates and candidates[0][0] < 0:  # while there is a candidate with positive score
    score, sentence, count = heapq.heappop(candidates)  # greedily selecting best candidate
    selection.append(sentence)
    target = target - count                             # update target by removing characters already accounted for
    candidates = [(-get_score(c,target), s, c) for _,s,c in candidates]  # update scores of remaining candidates
    heapq.heapify(candidates)                       # reorder candidates according to new scores

# HERE ARE THE SELECTED SENTENCES:
print(selection)
# ['Acres of almond trees lined the interstate highway which complimented the crazy driving nuts.', 'There have been days when I wished to be separated from my body, but today wasn’t one of those days.']

# HERE ARE THE TOTAL FREQUENCIES FOR THE SELECTED SENTENCES:
final_frequencies = Counter(filter(str.isalpha, ''.join(selection).lower()))
print(final_frequencies)
# Counter({'e': 22, 't': 15, 'a': 12, 'h': 11, 's': 10, 'o': 10, 'n': 10, 'd': 10, 'i': 9, 'r': 8, 'y': 7, 'm': 5, 'w': 5, 'c': 4, 'b': 4, 'f': 3, 'l': 3, 'g': 2, 'p': 2, 'v': 2, 'u': 2, 'z': 1})

# CHARACTERS IN EXCESS:
target = Counter('abcdefghijklmnopqrstuvwxyz' * 10)
print(final_frequencies - target)
# Counter({'e': 12, 't': 5, 'a': 2, 'h': 1})

# CHARACTERS IN DEFICIT:
print(target - final_frequencies)
# Counter({'j': 10, 'k': 10, 'q': 10, 'x': 10, 'z': 9, 'g': 8, 'p': 8, 'u': 8, 'v': 8, 'f': 7, 'l': 7, 'b': 6, 'c': 6, 'm': 5, 'w': 5, 'y': 3, 'r': 2, 'i': 1})

Explanations:

Python's Counter( ) transforms a sentence into a map character -> frequency;
For two Counters a and b, a & b is multiset-intersection, and a - b is multiset-difference;
For a Counter a, sum(a.values()) is the total count (the sum of all frequencies);
heapq.heapify transforms a list into a min-heap, which is a data structure that allows easy access to the element with minimum score. We actually want the sentence with maximum score, not minimum, so I replaced all the scores with negative numbers.

Non-optimality of the greedy algorithm

I should mention that this greedy algorithm is an approximation algorithm. At every iteration, it chooses the sentence with the highest score; but there is no guarantee that the optimal solution actually contains that sentence.

It is easy to build an example where the greedy algorithm fails to find the optimal solution:

target = Counter('abcdefghijklmnopqrstuvwxyz')
print(target)
# Counter({'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 1, 'f': 1, 'g': 1, 'h': 1, 'i': 1, 'j': 1, 'k': 1, 'l': 1, 'm': 1, 'n': 1, 'o': 1, 'p': 1, 'q': 1, 'r': 1, 's': 1, 't': 1, 'u': 1, 'v': 1, 'w': 1, 'x': 1, 'y': 1, 'z': 1})

sentences = [
    'The quick brown fox jumps over the lazy dog.',
    'abcdefghijklm',
    'nopqrstuvwxyz'
]

With this target, the scores are as follow:

[
    (17, 'The quick brown fox jumps over the lazy dog.'),
    (13, 'abcdefghijklm'),
    (13, 'nopqrstuvwxyz')
]

The two "half-alphabets" have a score of 13 each, because they contain 13 letters of the alphabet. The sentence "The quick brown fox..." has a score of 17 = 26 - 9, because it contains the 26 letters of the alphabets, plus 9 excess letters (for instance, there are 3 excess 'o' and 2 excess 'e').

The optimal solution, obviously, is to cover the target perfectly with the two halves of the alphabet. But our greedy algorithm will select the "quick brown fox" sentence first, because it has a higher score.

Hi Stef, thank you for your optimality addition if one removes g from the full alphabet sentence, the algorithm should select the two half alphabets. Not sure now if that is currently covered by the algorithm, but it should... I also converted your Python code to CPP as you can see below — BigChief, Dec 04 '21 at 22:12

BigChief · Answer 3 · 2021-12-04T22:14:59.170

typedef struct
{
    wstring text{ L"" };            
    vector<int> encoded_text;
    int counter[26] // frequency table
    {
        0,0,0,0,0,
        0,0,0,0,0,
        0,0,0,0,0,
        0,0,0,0,0,
        0,0,0,0,0,
        0,
    };

    int score = INT_MIN;

} Sentence;  

 
int m_target[26]
{
    10,10,10,10,10,
    10,10,10,10,10,
    10,10,10,10,10,
    10,10,10,10,10,
    10,10,10,10,10,
    10
};

bool orderByScore(const Sentence &a, const Sentence &b)
{
    return b.score < a.score;
}

int SentencesCounter::GetScore(Sentence sentence, int* target)
{
    int sum1 = 0;
    int sum2 = 0;

    for (size_t i = 0; i < 26; i++)
    {
        int sentenceFreq = sentence.counter[i];
        int targetFreq = target[i];

        sum1 += min(sentenceFreq, targetFreq);
        sum2 += max(0, sentenceFreq - targetFreq);
    }

    return sum1 - sum2;
}

vector<Sentence> SentencesCounter::SolveSO(vector<Sentence> &sentences)
{
    vector<Sentence> candidates{ sentences };

    for (size_t i = 0; i < candidates.size(); i++)
    {
        candidates[i].score = GetScore(candidates[i], m_target);
    }

    sort(candidates.begin(), candidates.end(), orderByScore);

    int target[26];
    memcpy(target, m_target, 26 * sizeof(int));

    vector<Sentence> selection;
    while (candidates.front().score > 0) // while there is a candidate with positive score
    {
        Sentence s = candidates.front();
        if(s.encoded_text.size() > 0) selection.push_back(s);
        candidates.front().score = INT_MIN;

        for (size_t i = 0; i < 26; i++) { target[i] -= s.counter[i]; } // update target

        size_t i;
        for (i = 0; i < candidates.size(); i++)
        {
            if (candidates[i].score > INT_MIN) // int min means already added to selection
                candidates[i].score = GetScore(candidates[i], target);
            else if (i != 0) break; // int min found at other index than top
        }

        partial_sort(candidates.begin(), candidates.begin() + i, candidates.end(), orderByScore);
    }
    return selection
}

Attempt at replicating the Python code from Stef in psuedo CPP

user1984 · Answer 4 · 2021-12-04T12:48:08.590

This can be reduced to the subsequence sum with least absolute difference with a target problem.

The problem is as follows: You have an array A with integer values, say [1,5,3,2,6], and an integer value T, the target. You want to find the subsequence A' of elements from A such that abs(target - sum(A')) is minimized.

In your case, the individual integer values of A are 2 dimensional where they contain each sentence's frequency table for its characters and the target is also 2 dimensional as it contains counts of characters. You want to minimize the sum of the absolute difference.

This is clearly a dynamic programming problem. Without optimization the time complexity would be exponential where we need to check 2^n possibilities (for each element we have 2 possibilities: we either take it or leave it). I think that's what you referred to in your question by creating all combinations.

But with optimization we can achieve n * T where n is the number of elements in A and T is the value of target. This is of course if we only wanted the closest number itself, not the elements that sum to that number.

To get the elements of the subsequence itself that leads to the optimal solution you have 2 options:

Backtracking, which has the exponential time complexity explained earlier.
DP with path reconstruction where the time complexity remains manageable as explained above.

These problems and algorithms are well known and I don't think they need explaining.

How your specific problem maps to this problem, as far as I understand, is also evident. There are of course some complexities in how you want to implement it. But if the relation between your problem and the subsequence sum problem as described above is not clear, please let me know so I can explain further.

Here are a few links I found that may help you to solve this problem. Please note that they are not a straight forward answer as this problem is relatively complex.

Closest Subsequence Sum Problem on LeetCode. This handles the case where you are only looking for the closest sum, not the path that lead to that sum. The discussion page is full of different ideas with detailed explanations (sort by most votes).
DP and Path Reconstruction: This is part of a series about DP.
Primer on DP
Reconstructing the Path of the Optimal Solution

Hi user1984 thank you for your analysis, do you have perhaps a sample code somewhere that applies DP/backtracking to a similar problem. Unfortunately I don't have too much experience constructing dynamic programming solutions myself. — BigChief, Dec 04 '21 at 12:24
You are welcome. Let me check if I find something. @BigChief — user1984, Dec 04 '21 at 12:25
Unfortunately I don't have something readily available but I added some resources to the bottom of my answer. Most of them are lengthy and need some study but that's the nature of this type of problem, IMHO. @BigChief — user1984, Dec 04 '21 at 12:49
So your latest edit would suggest that this could be a knapsack problem? Like here? https://www.geeksforgeeks.org/0-1-knapsack-problem-dp-10/ — BigChief, Dec 06 '21 at 08:04
@BigChief I didn't make any new edits after your last comment yesterday. I still think that this is a dp with path reconstruction problem where the dp part is conceptually similar to the Closest Subsequence Sum problem, as linked in the first bullet point. — user1984, Dec 06 '21 at 08:47
I built solutions for the leetcode problem, not sure yet how to convert it to a solution with frequency tables CPP: https://onlinegdb.com/8TnTR0ygo Python: https://onlinegdb.com/Y9eGPwfq2 — BigChief, Dec 08 '21 at 19:45
https://www.geeksforgeeks.org/meet-in-the-middle/ I guess this methodology is used? — BigChief, Dec 09 '21 at 10:14
I agree, this is NP-hard. In fact, if you go to https://en.wikipedia.org/wiki/NP-hardness, the example given is sub-set sum problem. — Tiger4Hire, Dec 10 '21 at 14:48

score 2 · Answer 5 · answered Dec 17 '21 at 17:15

2

We tried to find a solution shown in this article but I think the solution isn’t good. https://www.codeproject.com/Articles/5320281/A-problem-finding-optimal-number-of-sentences-and

answered Dec 17 '21 at 17:15

Michael Haephrati

3,660
1
33
56

score 2 · Answer 6 · answered Dec 20 '21 at 07:16

2

This looks to me like an advanced knapsack problem. The upper limit on the input size (1000) also helps, as it seems, O(n^2) complexity should be acceptable here.

In a standard knapsack problem, you have 2 inputs, value and weight and a limit to which you can carry the total weight such that total value is maximised.

Here, your limit will be your target frequency table, eg.

[a:100, b:80, c:90, d:150, e:100, f:100, g:47, h:10 ..... z:900]

and input weights will be individual sentences' frequency table, eg, in the 10-sentence example you have given, instead of seeing the input as the sentences, look into input as following:

More RVs were seen in the storage lot than at the campground ->
{'m': 2, 'o': 4, 'r': 5, 'e': 8, 'v': 1, 's': 3, 'w': 1, 'n': 4, 'i': 1, 't': 6, 'h': 3, 'a': 4, 'g': 2, 'l': 1, 'c': 1, 'p': 1, 'u': 1, 'd': 1}
She did her best to help him. There have been days when I wished to be separated from my body, but today wasn’t one of those days ->
{'s': 8, 'h': 9, 'e': 16, 'd': 8, 'i': 4, 'r': 4, 'b': 5, 't': 9, 'o': 8, 'l': 1, 'p': 2, 'm': 3, 'a': 7, 'v': 1, 'n': 4, 'y': 5, 'w': 3, 'f': 2, ',': 1, 'u': 1, '’': 1}
The swirled lollipop had issues with the pop rock candy ->
{'t': 3, 'h': 4, 'e': 4, 's': 4, 'w': 2, 'i': 4, 'r': 2, 'l': 4, 'd': 3, 'o': 4, 'p': 4, 'a': 2, 'u': 1, 'c': 2, 'k': 1, 'n': 1, 'y': 1}
...
...
...
He didn’t want to go to the dentist, yet he went anyway ->
{'h': 3, 'e': 6, 'd': 3, 'i': 2, 'n': 5, 't': 9, 'w': 3, 'a': 3, 'o': 3, 'g': 1, 's': 1, 'y': 3}
and so on...

Now, in this case, we do not have the values list, which we need to maximise in case of a standard knapsack. Our value will be derived from the combined frequency table only, as our miximisation condition is min differential of the target freq table and combined freq table. Instead of normal addition to maximise, we need a function to cater this maximisation condition.

NOTE: While writing this answer, I assume you have prior knowledge of DP and standard knapsack algorithm. If not, you really need to study that first as that forms the basis of this solution.

NOTE-2: There are certainly some things in the answer where it is possible for me to elaborate further. If any bit is unclear or need explicit explanation, please feel free to ask in the comments and I'll be happy to edit the answer in reply to that.

answered Dec 20 '21 at 07:16

vish4071

5,135
4
35
65

1

I implemented it before but performance is too bad for large input. For example number of states for letter target 100 is (100^36). However for small input of sentences it's OK. – Majid Hajibaba Dec 20 '21 at 08:42
@vish4071 do you have sample code that would demonstrate your findings – BigChief Dec 20 '21 at 15:36
1

@MajidHajibaba Number of states shouldn't go that big. It should only be equal to `n^2` where `n=number of input sentences`. – vish4071 Dec 20 '21 at 17:21
1

@BigChief I don't have the code for now. Will try and update the answer with something, which would help you at least to write the full code. Would pseudocode or python work? Its been long since I coded in c++, so... – vish4071 Dec 20 '21 at 17:23
1

Also, @MajidHajibaba I don't think you need to "reach" the "letter target" – vish4071 Dec 20 '21 at 17:52
1

@BigChief I would also want some constraints / upper limit on total number of sentences, size of sentences and values in the target array. – vish4071 Dec 20 '21 at 18:03
1

@vish4071 keep in mind though that going over the target can result in better solutions (simple case: sentences: `aab`, `bccc` - target: `{a: 1, b: 2, c: 3}` - the best solution here is actually `{a: 2, b: 2, c: 3}`, even though it's over the target for a) so you can't stop once you reach the limit, because there might be better solutions on the other side :) – Turtlefight Dec 20 '21 at 18:44
@vish4071 n = 1500, just generate sentences with https://randomwordgenerator.com/sentence.php target can be ranging from 10 to 10000 per letter – BigChief Dec 20 '21 at 22:30
1

@BigChief I really hoped upper limit per letter on target isn't that big. That makes the DP array almost impossible to build on a normal 8GB machine. However, if there is upper limit on, sum of values in target distribution and if that is of order 4, then only it is possible to try by knapsack method. eg. I can try and build a solution if target sum of values is < 20,000. – vish4071 Dec 21 '21 at 07:31
1

@Turtlefight It can be taken care of if we double the target freq of each letter for capacity of knapsack. – vish4071 Dec 21 '21 at 07:32
1

@vish4071 I know and I did it before by KnapSack. However, it doesn't have efficiency for tons of millions of sentences. See my sample program ( https://onlinegdb.com/XkoYQjvSW) – Majid Hajibaba Dec 21 '21 at 08:14
1

@vish4071 consider `aabbb`, `cc` with target `{a: 0, b: 3, c: 2}` - you can't put an upper limit on it (in this case you need to take the 2 a's for the optimal solution even though they aren't in the target) – Turtlefight Dec 21 '21 at 10:49
1

@Turtlefight The way I am thinking of would have yielded the correct result here. There are a couple hacks as well along with the standard knapsack calculations. – vish4071 Dec 21 '21 at 11:55

Find combination of string sentences - Combinations of frequency tables to target frequency table

6 Answers6

A greedy algorithm

Non-optimality of the greedy algorithm