How to efficiently check page text for a list of keywords in Selenium?

Question

I have a list of keywords (['a', 'b', 'c']) and I'd like to check which appear on a given page, using Selenium (ideally each with the number of occurrences).

The naive way would be to look for each separately using XPATH (//*[contains(text(),'a')]) (or body text, page source etc.) but it seems to be an overkill to go over the entire page again and again for each of the strings.

I have quite a few sites to go over so I'd like to do it efficiently. Do I just get all text from the entire <html> (so including the title and the description on top of the <body>) and then perform all the searching on my own outside of the scope of Selenium (e.g. Rabin-Karp etc.) or is there a reasonable out of the box solution?

score 1 · Answer 1 · answered Dec 20 '21 at 11:09

1

You can search for elements containing any of the given strings like

//*[contains(text(),'a') or contains(text(),'b') or contains(text(),'c')]

and after that to check what specific keyword is presented there and update the counters etc.

answered Dec 20 '21 at 11:09

Prophet

32,350
22
54
79

how would I then perform the actual counts - take all the resulting elements and concatenate their text values? Is this more efficient than just taking the entire text to begin with? Do I need to do any "waiting" in order to ensure the elements are actually "present"? – rudolfovic Dec 20 '21 at 11:12
1) Once you have some specific web element that you know it contains one of the specified string you can check what string is exactly there. You asked for `perform all the searching on my own outside of the scope of Selenium` – Prophet Dec 20 '21 at 11:17
2) Will this be more efficient? I think yes. Since each of the entire web page with Selenium takes significant time while doing that on existing object, without going with Selenium through the entire web page will take much less time – Prophet Dec 20 '21 at 11:19
3) Selenium generally requires some delay. The page should be completely loaded before you can clearly examine it content. – Prophet Dec 20 '21 at 11:21
1 - of course I'd prefer to just use some built-in Selenium function that will take a list of keywords from me and will return a dictionary with their respective counts but I imagine this doesn't exist so I'm assuming I need to perform the search myself but I don't actually know – rudolfovic Dec 20 '21 at 11:22
AFAIK there is no such tool. What exactly you don't know? – Prophet Dec 20 '21 at 11:23
I have tens of these words and also something I forgot to specify is that I need my search to be case-insensitive so I suspect the ideal way of doing this would involve somehow getting the entire text and then iterating through all the words (`split()`) to check if it's a keyword and then incrementing the count for that keyword if necessary – rudolfovic Dec 20 '21 at 11:33
so I could call `.lower()` on the entire text (or each word) before checking for equivalence – rudolfovic Dec 20 '21 at 11:33
You can use `translate` with your XPath expression to handle text case insensitiveness as described here https://stackoverflow.com/questions/8474031/case-insensitive-xpath-contains-possible/23388974 – Prophet Dec 20 '21 at 11:36
Anyway, your use case is quite complicated so I don't believe this will have s simple, short and quick solution. – Prophet Dec 20 '21 at 11:38

score 0 · Answer 2 · answered Dec 20 '21 at 14:44

If you do not need to somehow account the structure of the content it will be completely fine to take the entire text of the page and count the keyword occurrences.

Here is short demo:

public static void main(String[] args) throws IOException {
    WebDriver driver = null;
    List<String> keyWords = Arrays.asList(new String[]{"selenium", "http", "something"});
    try{
        driver = new RemoteWebDriver(
                new URL("http://selenium-hub:4444"),
                new ChromeOptions()
        );
        driver.get("https://www.webelement.click/en/welcome");
        String total = driver.findElement(By.tagName("body")).getText();
        for(String keyWord: keyWords){
            Pattern p = Pattern.compile(keyWord, Pattern.CASE_INSENSITIVE);
            Matcher m = p.matcher(total);
            int i = 0;
            while (m.find())
                i++;
            System.out.println("Keyword [" + keyWord + "] has [" + i + "] occurrences");
        }
    }finally {
        if (driver != null){
            driver.quit();
        }
    }
}

So this is O(mn) where m is the number of keywords and n is the length of the text. Can this not be improved to avoid endless passes? — rudolfovic, Dec 20 '21 at 15:35
I am not sure how to estimate the complexity of this because this involves regex matching. However I believe this approach is near-efficient, because you look up DOM only once. If you are not parsing hundred thousands of keywords the bottle neck is DOM tree traversing. — Alexey R., Dec 20 '21 at 15:47

How to efficiently check page text for a list of keywords in Selenium?

2 Answers2