Create cartesian product of strings by replacing substrings from a list

Question

I have a dictionary with placeholders and their possible list of values, as shown below:

{
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan'],
    # and so on ...
}

I want to create all possible combinations of strings by replacing the placeholders (i.e. ~GPE~ and ~PERSON~) from the template:

"My name is ~PERSON~. I travel to ~GPE~ with ~PERSON~ every year".

Expected output is:

"My name is John Davies. I travel to UK with Tom Banton every year."
"My name is John Davies. I travel to UK with Joe Morgan every year."
"My name is John Davies. I travel to USA with Tom Banton every year."
"My name is John Davies. I travel to USA with Joe Morgan every year."
"My name is Tom Banton. I travel to UK with John Davies every year."
"My name is Tom Banton. I travel to UK with Joe Morgan every year."
"My name is Tom Banton. I travel to USA with John Davies every year."
"My name is Tom Banton. I travel to USA with Joe Morgan every year."
"My name is Joe Morgan. I travel to UK with Tom Banton every year."
"My name is Joe Morgan. I travel to UK with John Davies every year."
"My name is Joe Morgan. I travel to USA with Tom Banton every year."
"My name is Joe Morgan. I travel to USA with John Davies every year."

Also notice how the values corresponding to a key in the dictionary do not repeat in the same sentence. e.g. I do not want: "My name is Joe Morgan. I travel to USA with Joe Morgan every year." (so not exactly cartesian product, but close enough)

I am new to python and experimenting with the re module, but could not find a solution to this problem.

EDIT

The main problem I am facing is replacing string causes the length to change, which makes subsequent modifications to the string difficult. This is especially due to possibility of multiple instances of same placeholder in the string. Below is a snippet to elaborate more:

label_dict = {
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}


template = "My name is ~PERSON~. I travel to ~GPE~ with ~PERSON~ every year."

for label in label_dict.keys():
    modified_string = template
    offset = 0
    for match in re.finditer(r'{}'.format(label), template):
        for label_text in label_dict.get(label, []):
            start, end = match.start() + offset, match.end() + offset
            offset += (len(label_text) - (end - start))
#             print ("Match was found at {start}-{end}: {match}".format(start = start, end = end, match = match.group()))
            modified_string = modified_string[: start] + label_text + modified_string[end: ]
            print(modified_string)

Gives the incorrect output as:

My name is ~PERSON~. I travel to UK with ~PERSON~ every year.
My name is ~PERSON~. I travel USA with ~PERSON~ every year.
My name is John Davies. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohTom Banton. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with John Davies every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with JohTom Banton every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with JohToJoe Morgan every year.

Does this answer your question? [Permutations between two lists of unequal length](https://stackoverflow.com/questions/12935194/permutations-between-two-lists-of-unequal-length) — Tomalak, Jul 24 '21 at 09:17
@Tomalak No, because I do not know a way to replace multiple instances of same placeholder in a given string. e.g. "~GPE~" appears twice in the template. Using re.finditer I can handle the multiple instances of a placeholder, but the length of the string changes on replacing the placeholder with its possible values. Added snippet to elaborate further. — Japun Japun, Jul 24 '21 at 10:16
Regular expressions are for replacing *patterns*. You don't have a pattern. You have a fixed string. You can use the good old `str.replace()`, and you can instruct that function to only replace only N (e.g. 1) instances. https://docs.python.org/3/library/stdtypes.html#str.replace — Tomalak, Jul 24 '21 at 10:17
I do have patterns. The keys in the dictionary are the patterns. The challenge is to replace all instances of the pattern/placeholder with unique value from its possible values. — Japun Japun, Jul 24 '21 at 11:45
No, you don't. You have `"~GPE~"` and `"~PERSON~"` and those are fixed strings. — Tomalak, Jul 24 '21 at 11:47
Does this answer your question? [String Replacement Combinations](https://stackoverflow.com/questions/14841652/string-replacement-combinations) — Karl Knechtel, Mar 02 '23 at 03:15
@Tomalak no, that question is completely different, and also not suitable for use as a duplicate (because most people who tried to answer it, misunderstood it completely). The canonical you want is https://stackoverflow.com/questions/533905/, but that only partially addresses the problem. — Karl Knechtel, Mar 02 '23 at 03:16

norie · Accepted Answer · 2021-07-24T16:23:52.107

Here are two ways, well three if you include the new code I added a moment ago, you could do it, they all produce the desired output.

Nested Loops

data_in ={
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}

data_out = []
for gpe in data_in['~GPE~']:
    for person1 in data_in['~PERSON~']:
        for person2 in data_in['~PERSON~']:
            if person1 != person2: 
                data_out.append(f'My name is {person1}. I travel to {gpe} with {person2} every year.')

print('\n'.join(data_out))

List Comprehension

data_in ={
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}

data_out = [f'My name is {person1}. I travel to {gpe} with {person2} every year.' for gpe in data_in['~GPE~'] for person1 in data_in['~PERSON~'] for person2 in data_in['~PERSON~'] if person1!=person2]

print('\n'.join(data_out))

Using merge from Pandas

Note, this code required Pandas version 1.2 or above.

import pandas as pd

data = {
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan'],
    # and so on ...
}

country = pd.DataFrame({'country':data['~GPE~']})
person = pd.DataFrame({'person':data['~PERSON~']})

cart = country.merge(person, how='cross').merge(person, how='cross')

cart.columns = ['country', 'person1', 'person2']

cart = cart.query('person1 != person2').reset_index()

cart['sentence'] = cart.apply(lambda row: f"My name is {row['person1']}. I travel to {row['country']} with {row['person2']} every year." , axis=1)

sentences = cart['sentence'].to_list()

print('\n'.join(sentences))

This is very close to what I want. Can this be extended to cover all keys of the dictionary? This solution is perfect if it can avoid assumptions. I can of course use another loop to iterate over the keys to get "~GPE~" and "~PERSON~", but the solution assumes I know how many times each placeholder needs to be replaced and adds as many "for loops" for it. e.g. '~PERSON~' is present twice hence needs two loops, but cannot determine this at runtime. — Japun Japun, Jul 24 '21 at 11:41
How will python know how many times you want to replace a placeholder? I'll try to add something later to my answer that shows how to do a cartesian product, without looping, using pandas, but if you do it that way you would need to process the data further to get the exact results you want. — norie, Jul 24 '21 at 13:33
The only way I know to ensure all instances of placeholder get replaced is using re.finditer. If simply count is needed, len(findall(r'~PERSON~', template)) could work. Thanks for your efforts! — Japun Japun, Jul 24 '21 at 14:40

Create cartesian product of strings by replacing substrings from a list

1 Answers1

Nested Loops

List Comprehension

Using merge from Pandas