3

I wanted to convert sets of strings to regular expression using java.

I searched many things for it but there was no such satisfying answer available on the internet which resolves my issue. so I prefer to ask here.

First is it possible to convert it if yes, then kindly suggest me the way to get rid of this issue I'm facing?

Let's suppose I have sets of strings

abb
abababb
babb
aabb
bbbbabb
...

and I want to make a regular expression for it such as

(a+b)*abb

how it can be possible?

Sabaoon Bedar
  • 3,113
  • 2
  • 31
  • 37
  • 1
    What would you gain from converting a string to a "regular expression"? a regex is a rule that cannot be defined by one given string unless you are planning on having a regex that accepts that very string alone – Avi Meltser Jun 07 '19 at 18:37
  • Let's suppose I have sets of strings {abb, abababb,babb,aabb,bbbbabb... } and I want to make a regular expression for it such as "(a+b)*abb" how it can be possible? – Sabaoon Bedar Jun 07 '19 at 18:41
  • 1
    How about abb|abababb|babb|aabb|bbbbabb – David Zimmerman Jun 07 '19 at 18:54
  • I only wanted to convert the specific set of strings to regular expression, I am not searching for any automatic generator in which we put up strings and we get the regular expression, my focus is on the above mentioned regular expression. – Sabaoon Bedar Jun 07 '19 at 19:05
  • I don't understand the question. Just make a regular expression, like you write in your question? Are you looking for an automated way to create such a regex? – Robert Jun 07 '19 at 19:20
  • At this point, it is more a software engineering problem. You could create an automata that accepts this language and then minimize it using regular minimization algorithms. Read [Regular language](https://en.wikipedia.org/wiki/Regular_language) and [Automata theory](https://en.wikipedia.org/wiki/Automata_theory) if you want to go such a route. – Zabuzard Jun 07 '19 at 19:21

3 Answers3

4

If you have a collection of strings, and want to build a regex that matches any of those strings, you should build a regex that uses the | OR pattern.

Since the strings could contain regex special characters, they need to be quoted.

To make sure the best string matches, you need to match longest string first. E.g. if aba and abax are both on the list, and text to scan contains abax, we'd want to match on the second string, not the first one.

So, you can do it like this:

public static String toRegex(Iterable<String> strings) {
    return StreamSupport.stream(strings.spliterator(), false)
            .sorted(Comparator.comparingInt(String::length).reversed())
            .map(Pattern::quote)
            .collect(Collectors.joining("|"));
}
Andreas
  • 154,647
  • 11
  • 152
  • 247
0

You can use the Pattern.compile method described here.

double-beep
  • 5,031
  • 17
  • 33
  • 41
Anthony
  • 189
  • 1
  • 15
  • 1
    This answer is wrong. Pattern, in conjunction with Matcher, are for something completely different. Pattern will compile a regular expression, as long as it is valid, if not PatternSyntaxException is thrown. Then, you can create Matchers for a given Pattern. With the Matcher you can do things like count the number occurrences for the given Pattern. Unless you provide a regexp like "aaa|bbb|ccc|..." for all the strings he is talking about, which to me is totally crazy. – Perimosh Jun 07 '19 at 19:14
  • Correct, this answer is no longer relevant once the question was updated. – Anthony Jun 07 '19 at 19:15
  • Ok then I was writing while the editor was editing :D – Perimosh Jun 07 '19 at 19:17
0

I don't believe you can.

The problem is that you want to provide only some of the total collection of valid strings and the algorithm has no way of inferring the exact complete set from the given subset. If you do provide the complete set of valid strings (and it doesn't seem like you can), then you can use David Zimmerman's answer in the comments. Or, perhaps more efficiently, just use a Set to hold the complete set of valid strings and to test candidate strings.

Chris Gerken
  • 16,221
  • 6
  • 44
  • 59