Regular Expressions: How to check if a text contains at least all the letters in a character set?

Question

I have a string (essentially an abbreviation - e.g. USA, with all letters in capitals) and a list of texts. I want to select those texts containing all the letters in the string (case-sensitive match). For example,

string = "USA"

texts = ["United States of America", "United States", "United States of America and Iraq"]

#Result shoud be:

results = ["United States of America", "United States of America and Iraq"]

I have tried with (?=U)(?=S)(?=A) (which is what the answers to the duplicate question suggests) but this doesn't seem to work as the regex expects the letters to be occurring in exact sequence. Also, I do not want to check small letters and spaces following each of the Capitals i.e., [?=U]([a-zA-Z]*[\s]+)*[?=S]([a-zA-Z]*[\s]+)*[?=A][a-zA-Z]* as these would be simply redundant (while not working perfectly).

What I am looking is to try with an expression equivalent to [USA] - which instead performs an OR operation to select texts containing at least one letter of the string. Is there any expression as elegant to carry out an 'AND' operation in regex?

How do you tell `USA` means `United States of America`? and why should the result have `United States of America and Iraq`? — DirtyBit, Feb 20 '19 at 08:59
`(?=U)(?=S)(?=A)` actually means that the char next to the current location must be both `U` , `S` and `A` - it will never match. You need to add `.*` before these letters. Or `\w*`, if you need them to appear in 1 word. — Wiktor Stribiżew, Feb 20 '19 at 08:59
@user5173426 because the text contains all three letters (doesn't matter if any repetition or any additional capital letter is present). — Saurav--, Feb 20 '19 at 09:10
There is no elegant solution other than I linked to. `[USA]` is a character class that matches one char out of the specified set/ranges. `USA` matches a *sequence* of chars, `USA`. That is all. If you need to martch words having `U`, `S` and `A` in any order and in any quantity but at least once, you need `\b(?=\w*U)(?=\w*S)(?=\w*A)\w+` or its variations. There is no other way. — Wiktor Stribiżew, May 06 '19 at 18:02

score 0 · Answer 1 · answered Feb 20 '19 at 09:08

0

You might be looking for all() in combination with in:

string = "USA"

texts = ["United States of America", "United States", "United States of America and Iraq", "Germany"]
vector = [all([x for c in string for x in [c in text]]) for text in texts]

This yields

[True, False, True, False]

So, in combination with filter() you don't need any regular expression:

new_text = list(
    filter(
        lambda text: all([x for c in string for x in [c in text]]),
        texts
    )
)
print(new_text)

The latter yields

['United States of America', 'United States of America and Iraq']

answered Feb 20 '19 at 09:08

Jan

42,290
8
54
79

Thanks. However, I have a really really long list and I fear that list comprehension might not be an efficient way. I am therefore opting for a regex match. – Saurav-- Feb 20 '19 at 09:12
@Saurav--: THen please look in the linked duplicate answer. – Jan Feb 20 '19 at 09:14
There is a reason I had to ask this. I have already mentioned that the solution of the duplicate question (?=) doesn't seem to handle corner cases here. – Saurav-- Feb 20 '19 at 09:17

Regular Expressions: How to check if a text contains at least all the letters in a character set?

1 Answers1