0

I'm trying to figure out a way to use regular expressions to find duplicate words on a webpage, I'm completely clueless and apologise in advance if I'm using the incorrect terminology.

So far I've found the following regular expressions which work well but only on words that are consecutively (e.g. hello hello) but not words that are placed in different parts of the webpage or separated by another word (e.g. hello food hello)

\b(\w+)(\s+\1\b)*

\b(\w+(?:\s*\w*))\s+\1\b

I would be super grateful to anyone that can help, I realise I might not be in the right place since I'm basically a noob.

Cfreak
  • 19,191
  • 6
  • 49
  • 60
  • why do you want/need to use regex? What other logic have you considered? – Scary Wombat Sep 19 '18 at 00:24
  • 2
    Welcome to SO. You really shouldn't try to parse web pages with regex. HTML is not a regular language so you'll be hard pressed to find something that world work with any web page. If you have a specific web page that has some other known elements it's easier but unless you're in control of it there's no guarantee your code would continue to work, and if you are in control of it then you could store your data in a more convenient way. The best way to parse web pages is to use an HTML parser. There are several good libraries in various languages for doing this. – Cfreak Sep 19 '18 at 00:26
  • I'm going to pile on and agree, regex is not the right tool for this. – Daedalus Sep 19 '18 at 01:04
  • You should consider reading **[this](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)** before comitting too much resources into regex-based HTML parsers. Also, you should say what your purpose is. Maybe doing this server side (php) is not appropriate to your use case. – YvesLeBorg Sep 19 '18 at 01:06
  • Thanks for the welcome! I'll be 100% honest, I'm trying to do this through the webpage as it's for a web application which I work with. I believe the web application is also running with Ruby language. it'll make my work so much better if we can get this working. Thank you & everyone for the help so far, it has been very useful, this is completely new territory for me. – John Wayne Sep 19 '18 at 01:14

2 Answers2

0

Capture the first word (surrounded by word boundaries) in a group, and then backreference it later in a lookahead, after repeating optional characters in between:

\b(\w+)\b(?=.*\b\1\b)

https://regex101.com/r/TcS1UW/3

CertainPerformance
  • 356,069
  • 52
  • 309
  • 320
  • Thank you very much for your help. I have tried your regular expression and it works well however still not 100%, I'm not completely sure why this is the case and I'm completely out of my deph here, the example I'm trying on is this BBC article – John Wayne Sep 19 '18 at 00:58
  • Many words which are duplicate are highlighted but EU isn't for example. Once again, I'm so grateful for the help. – John Wayne Sep 19 '18 at 00:59
  • Keep in mind that dot does not match newlines by default - if you don't want that, use `DOTALL`. EU on the same line looks to perform as intended: https://regex101.com/r/TcS1UW/4 – CertainPerformance Sep 19 '18 at 01:07
0

I would use Jsoup to get the text from the webpage. Then you could keep track of the counts using a HashMap, and then search the map for any number of occurrences you want:

    String url = "https://en.wikipedia.org/wiki/Jsoup";

    String body = Jsoup.connect(url).get().body().text();

    Map<String,Integer> counts = new HashMap<>();

    for ( String word : body.split(" ") )
    {
        counts.merge(word, 1, Integer::sum);
    }
    for ( String key : counts.keySet() )
    {
        if ( counts.get(key) >= 2 )
        {
            System.out.println(key + " occurs " + counts.get(key) + " times.");
        }
    }

You may need to clean up the map to get rid of some entries that aren't words, but this will get you most of the way.

spectrum
  • 379
  • 4
  • 11