In Android/Java, given a website's HTML source code, I would like to extract all XML and CSV file paths.
What I am doing (with RegEx) is this:
final HashSet<String> urls = new HashSet<String>();
final Pattern urlRegex = Pattern.compile(
"[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|].(xml|csv)");
final Matcher url = urlRegex.matcher(htmlString);
while (url.find()) {
urls.add(makeAbsoluteURL(url.group(0)));
}
public String makeAbsoluteURL(String url) {
if (url.startsWith("http://") || url.startsWith("http://")) {
return url;
}
else if (url.startsWith("/")) {
return mRootURL+url.substring(1);
}
else {
return mBaseURL+url;
}
}
Unfortunately, this runs for about 25 seconds for an average website with normal length. What is going wrong? Is my RegEx just bad? Or is RegEx just so slow?
Can I find the URLs faster without RegEx?
Edit:
The source for the valid characters was (roughly) this answer. However, I think the two character classes (square brackets) must be swapped so that you have a more limited character set for the first char of the URL and a broader character class for all remaining chars. This was the intention.