I am processing a corpus containing around 10 million files. Some files have urls containing a backslash ('\') inside them. I want to replace all the URLs from those file. The following works fine until it finds an URL containing a backslash.
public static String removeUrl(String str)
{
String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure):((//)|(\\\\))[\\w\\d:#@%/;$~_?\\+-=\\\\\\.&]*)";
Pattern p = Pattern.compile(urlPattern, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(str);
while (str!=null && m.find()) {
str = str.replaceAll(m.group(0)," ").trim(); // ERROR is occuring here when m.group(0) has URL with '\'
}
return str;
}
Any Help?