Regex, removing duplicate non interrupted strings

Question

I recently tried making a regex for deleting strings which stand after each other without being interrupted by an other string, and then let only one string stay. My work so far : https://regex101.com/r/Cs0bmY/7 . It should work with all possible urls which maybe dont have www. before them or an other ending like .com or .nl etc The strings (list of urls) looks like this:

operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
amazon.de
fonts.gstatic.com
fonts.gstatic.com
fonts.gstatic.com
erovoyeurism.net
tugtechnologyandbusiness.com

The end result should look like this:

operator.livrareflori.md
amazon.de
fonts.gstatic.com
erovoyeurism.net
tugtechnologyandbusiness.com

You can see that the duplicate strings which are not interrupted by an other string are gone and only 1 result stays.

Waht language/tool are you using? What did you try? What didn't work? What did you get? — Toto, Aug 28 '18 at 09:04
You may use [`^((?:https?://)?(?:www\.)?\S+\.\S+)\n(?=\1$)`](https://regex101.com/r/Cs0bmY/9) — Wiktor Stribiżew, Aug 28 '18 at 09:08
Possible duplicate of [• Removing duplicate rows in Notepad++](https://stackoverflow.com/questions/3958350/removing-duplicate-rows-in-notepad) or [• How do I find and remove duplicate lines from a file using Regular Expressions?](https://stackoverflow.com/questions/1573361/how-do-i-find-and-remove-duplicate-lines-from-a-file-using-regular-expressions) — bobble bubble, Aug 28 '18 at 10:13

score 1 · Answer 1 · answered Aug 28 '18 at 09:05

You can match

^(.+)$(?:\n\1)+

thus capturing the first line, and matching subsequent duplicate lines, and then replace everything matched with the first capture group:

\1

(or the equivalent keyword for the first group in whatever environment you're in)

https://regex101.com/r/Cs0bmY/8

score 1 · Answer 2 · answered Aug 28 '18 at 09:11

Using Notepad++, you can do:

Ctrl+H
Find what: ^(.+)$(?:\R\1)+
Replace with: $1
check Wrap around
check Regular expression
DO NOT CHECK . matches newline
Replace all

Explanation:

^(.+)$      : group 1, a whole line
(?:         : non capture group
    \R      : any kind of line break
    \1      : backreference to group 1
)+          : group must appear 1 or more times

Replacement:

$1          : content of group 1

Result for given example:

operator.livrareflori.md
amazon.de
fonts.gstatic.com
erovoyeurism.net
tugtechnologyandbusiness.com

score 1 · Answer 3 · answered Aug 28 '18 at 09:12

The trick is to capture the line and use a lookahead to verify that it exists later in the subject. This expression matches duplicates, and substituting with "" makes it keep the last occurrences:

(?s)^((?:https?://)?(?:www\.)?\S+\.\S+)\n(?=.*^\1$)

https://regex101.com/r/Cs0bmY/10

score 1 · Answer 4 · answered Aug 28 '18 at 09:21

1

((?:https?://)?(?:www\.)?\S+\.\S+)\s(?=[\s\S]*\1)

You can try this.See demo.

https://regex101.com/r/Cs0bmY/11

answered Aug 28 '18 at 09:21

vks

67,027
10
91
124

Regex, removing duplicate non interrupted strings

4 Answers4