1

I am processing a corpus containing around 10 million files. Some files have urls containing a backslash ('\') inside them. I want to replace all the URLs from those file. The following works fine until it finds an URL containing a backslash.

public static String removeUrl(String str)
{
    String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure):((//)|(\\\\))[\\w\\d:#@%/;$~_?\\+-=\\\\\\.&]*)";
    Pattern p = Pattern.compile(urlPattern, Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(str);
    while (str!=null && m.find()) {
        str = str.replaceAll(m.group(0)," ").trim(); // ERROR is occuring here when m.group(0) has URL with '\'
    }
    return str;
}

Any Help?

Jordi Castilla
  • 26,609
  • 8
  • 70
  • 109

2 Answers2

0

It has to do with escaping the backslash: removeUrl("http://go.com\\\\") doesnt throw the error, but removeUrl("http://go.com\\") does. You might have to manipulate the strings before you replaceAll, like str.replaceAll("\\\\", "");.

Also, the exception is only thrown if you str.replaceAll("\\", "");, and not str.replace("\\", "");

Edit: Just saw this

Community
  • 1
  • 1
0

This regex works for me.

[a-zA-Z]+:\/\/([a-zA-Z0-9\.\-_])+(:[0-9]+)?([\/\\][a-zA-Z0-9\._\-]*)*(\?(&?[a-zA-Z0-9_\-\.]+=[a-zA-Z0-9_\-\.]+)+)?

It matches all of these

http://test.test.test:123/test.test/test?blah=23&bluh=23
http://test.test.test/test.test/?blah=blah
http://ttes-test.comsa234/ase/ase
abc://test.test
abc://test.test:900
abc://test.test/
abc://test.test\
abc://test.test\test
abc://test.test:90/test\test/test
abc://wow/test?this=works&and=worksagain
cde://yay/what/yes.com/hi_there\?param=value&param=value
withdash://its-dash/another-dash\okay

You can test with regex101

Ferdinand Neman
  • 680
  • 3
  • 6