How to make a regex to see if string contains a certain letter

Question

On a website I found some alternatives to "The quick brown fox jumps over the lazy dog" and I decided to write a little program to check if the alternatives were valid.

For those interested, I wrote the following program (using the filereader idea of this post) that checks a file with the sentences under each other:

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

public class TestClass {
    public static void main(String... aArgs)  {
        TestClass tc = new TestClass();
        try {
            String[] pieces = tc.splitFile("/home/user2609980/Desktop/text");
            for (String line : pieces) {
                if (line.contains("a") &&
                        line.contains("b") &&
                        line.contains("c") &&
                        line.contains("d") &&
                        line.contains("e") &&
                        line.contains("f") &&
                        line.contains("g") &&
                        line.contains("h") &&
                        line.contains("i") &&
                        line.contains("j") &&
                        line.contains("k") &&
                        line.contains("l") &&
                        line.contains("m") &&
                        line.contains("n") &&
                        line.contains("o") &&
                        line.contains("p") &&
                        line.contains("q") &&
                        line.contains("r") &&
                        line.contains("s") &&
                        line.contains("t") &&
                        line.contains("u") &&
                        line.contains("v") &&
                        line.contains("w") &&
                        line.contains("x") &&
                        line.contains("y") &&
                        line.contains("z")) {
                    System.out.println("Matches: " + line);
                } else {
                    System.out.println("Does not match: " + line);
                }
            }

        } catch (Exception ex) {
            System.out.println(ex.getMessage());
        }
    }

    public String[] splitFile(String file) throws IOException {

        BufferedReader br = new BufferedReader(new FileReader(file));
        try {
            StringBuilder sb = new StringBuilder();
            String line = br.readLine();

            while (line != null) {
                sb.append(line);
                sb.append('\n');
                line = br.readLine();
            }
            String everything = sb.toString();
            String[] pieces = everything.split("\n");
            return pieces;
        } finally {
            br.close();
        }

    }
}

And this was the output:

Matches: The quick brown fox jumps over the lazy dog
Does not match: Pack my box with five dozen liquor jugs.
Matches: Several fabulous dixieland jazz groups played with quick tempo.
Does not match: Back in my quaint garden, jaunty zinnias vie with flaunting phlox.
Does not match: Five or six big jet planes zoomed quickly by the new tower.
Matches: Exploring the zoo, we saw every kangaroo jump and quite a few carried babies.
Matches: I quickly explained that many big jobs involve few hazards.
Does not match: Jay Wolf is quite an expert on the bass violin, guitar, dulcimer, ukulele and zither.
Matches: Expect skilled signwriters to use many jazzy, quaint old alphabets effectively.
Matches: The wizard quickly jinxed the gnomes before they vaporized.

I want to improve this program in two ways. One, and that is my question, is how to make a more efficient piece of code instead of checking for each letter of the alphabet seperately. How can I make something like:

line.Contains([regex])

if that is possible?

The bonus question is how I can make this program so that it prints out exactly where it does not match. Of course I can do an if-else for every letter, but I hope there is a prettier way.

Thanks for your attention and I look forward to read your possible response.

Please no bonus question, it's better to ask separate questions so it's more useful to to others — Ruan Mendes, Dec 05 '13 at 22:21
Why does `Pack my box with five dozen liquor jugs.` not match? What's the missing letter(s)? — Ruan Mendes, Dec 05 '13 at 22:58
It does... So there is something wrong with the code. And of course, I did not match uppercase... — user2609980, Dec 06 '13 at 07:02

score 3 · Answer 1 · answered Dec 05 '13 at 22:06

3

Simplest in my opinion is to use a loop like this:

boolean allChars = true;
String uline = line.toUpperCase();
for (char c='A'; c<='Z'; c++) {
    if (uline.indexOf(c) < 0) {
        allChars = false;
        break;
    }
}

i.e. run a loop from 65 (A) to 90 (Z) and check existence of each character in the input string.

answered Dec 05 '13 at 22:06

anubhava

761,203
64
569
643

3

Avoid magic numbers. You can use `for (char c='A'; c<='Z'; c++)`. – Pshemo Dec 05 '13 at 22:09
What about lower case? – user2609980 Dec 05 '13 at 22:09
@Pshemo: Thanks, Good idea, edited. OP: I have already converted input to all uppercase. – anubhava Dec 05 '13 at 22:10
1

@user2609980: I have added `String uline = line.toUpperCase();` to convert input to all uppercase first. – anubhava Dec 05 '13 at 22:11
Okay nice :-). Judging by your response, I guess it is not possible to use a regex in the `contains` method or use another similar one? And I of course also want to find out *which* letter does not match. – user2609980 Dec 05 '13 at 22:14
regex is not impossible for this but will be very long & ugly. It would still require you to check for each and every letter using lookahead. – anubhava Dec 05 '13 at 22:17
Do you want to know all the letters that didn't match? Or just the first letter that didn't match? – anubhava Dec 05 '13 at 22:18
This is `O(n2)`, and what the OP posted was `O(n)`, fewer lines of code, a lot more work? – Ruan Mendes Dec 05 '13 at 22:22
1

@JuanMendes entire condition in OP code can be counted as `n` so it is also `O(n2)` – Pshemo Dec 05 '13 at 22:25
@JuanMendes: I believe both are `O(n2)`. Only difference between OP's code and this code is `contains` vs `indexOf` method calls but # of comparisons are going to be same (this one is definitely less code). – anubhava Dec 05 '13 at 22:28
@anubhava please explain? And I am looking for all the letter that do not match. But if you have an easy method to find the first I will be happy to hear! – user2609980 Dec 05 '13 at 23:00
@anubhava I see, the loop here is not `n x n` it's `26 X n`, since it will scan for all the letters in the alphabet over the whole string, so it's still a constant. In my answer, the constant is only two, which may give the impression that it would be faster. However, the overhead of the map seems to make it not worth it. I created a performance test, it's in JS , but probably holds true for other languages. One thing that is not a surprise is that the regex approach is the slowest of all. http://jsperf.com/look-for-all-letters – Ruan Mendes Dec 05 '13 at 23:29
@user2609980: Your question was not about order of complexity. It was all about doing writing the code cleanly and reducing those multiple if conditions. Had you posted a question on `O(n) vs O(n2)` my answer would have been different. – anubhava Dec 06 '13 at 07:51

score 0 · Accepted Answer · edited May 23 '17 at 12:13

0

Here's a solution that is O(n) that attempts to be faster by avoiding the loop within a loop. However, performance tests indicate that this is not really worth it in this case.

Note that I was mistaken, when I assumed your code was O(n2). It is not, even though you do have a loop within a loop. That's because the outer loop iterates over a constant number (26 letters)

Map<char, boolean> letters = new HashMap<String,boolean>
String uline = line.toUpperCase();

for (int i=0, i < uline.length; i++) {
    letters.put(uline.charAt(i), true );
}


boolean allChars = true;
for (char c='A'; c<='Z'; c++) {
if (letters.get(c) == null) {
   allChars = false;
   break;
}

If you want a regex that represents an AND operation, you can mimic it with positive lookahead assertions, but I have a feeling it's going to be slow. See https://stackoverflow.com/a/470602/227299

(?=.*a)(?=.*b)(?=.*c)(?=.*d)(?=.*e)(?=.*f)(?=.*g)(?=.*h)(?=.*i)(?=.*j)(?=.*k)(?=.*l)(?=.*m)(?=.*n)(?=.*o)(?=.*p)(?=.*q)(?=.*r)(?=.*s)(?=.*t)(?=.*u)(?=.*v)(?=.*w)(?=.*x)(?=.*y)(?=.*z)

Be sure to use the case insensitive modifier

See it in action http://regex101.com/r/yJ4cU6

I created some performance tests to see if it made sense to use my suggested approach and it doesn't. I would stick to what anubhava suggested. Hopefully, the answer helps you think about performance (and premature optimization).

edited May 23 '17 at 12:13

Community

1
1

answered Dec 05 '13 at 22:31

Ruan Mendes

90,375
31
153
217

Can you explain wny one is `O(n)` and some others (we have to avoid) are `O(n^2)`? What does big O-notation mean and how do you calculate it (quickly)? – user2609980 Dec 05 '13 at 22:58
1

Usually `O(n2)` means you have a loop within a loop, So if your line is 100 characters long (your `n`), it will take 10,000 comparisons. This apprach will only run as many comparisons as there are letters, it will do that twice, once to create the map, and once to see if all the letters are in there. So for the same 100 character line, it runs 200 comparisons, instead of 100,000. Here's a nice explanation http://www.javacodegeeks.com/2011/04/simple-big-o-notation-post.html – Ruan Mendes Dec 05 '13 at 23:02
Ah. And why would you say my original code was O(n)? I check every character in every line against the 26 others which will run 26 times roughly times amount of letters (the `O`)? Why is the method of @anubavha O(n^2)? He basically does the same thing right? / edit Thanks for the link! – user2609980 Dec 05 '13 at 23:05
Oh and how do I write the same thing in a functional language, for example Erlang? That seems like a nice exercise and I think things like this will be prettier there. – user2609980 Dec 05 '13 at 23:09
@downvoter Please explain why you downvoted the answer. I'd like to improve it. – Ruan Mendes Dec 05 '13 at 23:35
1

@user2609980 Yes, I was mistaken, both yours and anubhava's code are O(n), my code is also O(n) but the constant factor is lower (2 as opposed to 26). Note that the lower constant didn't end up making it faster, probably because of the map overhead. See the performance tests I linked to – Ruan Mendes Dec 05 '13 at 23:39

How to make a regex to see if string contains a certain letter

2 Answers2