0

I am trying to fetch a page using HTTPGET apache method but it throws me this exception :

Illegal character in path at index 65: http://doctorat.tuiasi.ro/Htm/Proiecte_POSDRU_17.02.2013/Proiecte europene.html

I know the space out there may be the cause of the problem but I am trying to filter the url like that

String url=everyUrl.getUrl().replaceAll(" ", "%20");
                if (url.contains("http://")) {
                    Pattern allowedUrlCharacters = Pattern
                            .compile("([A-Za-z0-9_.~:/?\\#\\[\\]@!$&'()*+,;" + "=-]|%[0-9a-fA-F]{2})+");
                    Matcher matcher = allowedUrlCharacters.matcher(url);
                    if (matcher.find()) {
                        pushInFrontQueues(url);
                    }
                    // System.out.println(this.frontQueues.get(0).size());

                }
            }

What I am doing wrong ? Can anyone help me please?

1 Answers1

1

The thing is, your regex is finding a valid string. In fact, it's finding two valid strings. Take a look at this to see what I mean. It has found two matching groups.

You need to make sure to only match if the entire string matches. You can do that by surround your regex with ^ and $, like so:

"^([A-Za-z0-9_.~:/?\\#\\[\\]@!$&'()*+,;" + "=-]|%[0-9a-fA-F]{2})+$"

However, this pattern will likely match things you don't want such as something%2else. To only allow valid percent encoding, you might want something like this:

"^(%[0-9a-fA-F]{2}|[^%][A-Fa-f0-9]|[G-Zg-z_.~:/?\\#\\[\\]@!$&'()*+,;=-])+$"
dumptruckman
  • 114
  • 7