I am using storm crawler 1.10 and Elastic Search 6.3.x. For Example I have a main website https://www.abce.org
and it has subpages like https://abce.org/def
and https://abce.org/ghi
. I want to crawl specifically the pages under https://www.abce.org/ghi
.
My seed Url is https://www.abce.org/ghi/
.
Currently I applied below different regex filters at each time.
+^https:\/\/www.abce.org\/ghi*
+^(?:https?:\/\/)www.abce.org\/ghi(.+)*$
+^(?:https?:\/\/)?(?:www\.)?abce\.[a-zA-Z0-9.\S]+$
I tested my regex expressions regexr its shows valid. But when I check on statusindex its displaying only discovered seed url and nothing else.