I am trying to crawl URLs in order to extract other URLs inside of each URL. To do such, I read the HTML code of the page, read each line of each, match it with a pattern and then extract the needed part as shown below:
public class SimpleCrawler {
static String pattern="https://www\\.([^&]+)\\.(?:com|net|org|)/([^&]+)";
static Pattern UrlPattern = Pattern.compile (pattern);
static Matcher UrlMatcher;
public static void main(String[] args) {
try {
URL url = new URL("https://stackoverflow.com/");
BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()));
while((String line = br.readLine())!=null){
UrlMatcher= UrlPattern.matcher(line);
if(UrlMatcher.find())
{
String extractedPath = UrlMatcher.group(1);
String extractedPath2 = UrlMatcher.group(2);
System.out.println("http://www."+extractedPath+".com"+extractedPath2);
}
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
However, there some issue with it which I would like to address them:
- How is it possible to make either
http
andwww
or even both of them, optional? I have encountered many cases that there are links without either or both parts, so the regex will not match them. - According to my code, I make two groups, one between
http
until the domain extension and the second is whatever comes after it. This, however, causes two sub-problems: 2.1 Since it is HTML codes, the rest of the HTML tags that may come after the URL will be extracted to. 2.2 In theSystem.out.println("http://www."+extractedPath+".com"+extractedPath2);
I cannot make sure if it shows right URL (regardless of previous issues) because I do not know which domain extension it is matched with. - Last but not least, I wonder how to match both
http
andhttps
as well?