1

I recently tried making a regex for deleting strings which stand after each other without being interrupted by an other string, and then let only one string stay. My work so far : https://regex101.com/r/Cs0bmY/7 . It should work with all possible urls which maybe dont have www. before them or an other ending like .com or .nl etc The strings (list of urls) looks like this:

operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
operator.livrareflori.md
amazon.de
fonts.gstatic.com
fonts.gstatic.com
fonts.gstatic.com
erovoyeurism.net
tugtechnologyandbusiness.com

The end result should look like this:

operator.livrareflori.md
amazon.de
fonts.gstatic.com
erovoyeurism.net
tugtechnologyandbusiness.com

You can see that the duplicate strings which are not interrupted by an other string are gone and only 1 result stays.

birdTryingToCode
  • 71
  • 1
  • 1
  • 5
  • Waht language/tool are you using? What did you try? What didn't work? What did you get? – Toto Aug 28 '18 at 09:04
  • You may use [`^((?:https?://)?(?:www\.)?\S+\.\S+)\n(?=\1$)`](https://regex101.com/r/Cs0bmY/9) – Wiktor Stribiżew Aug 28 '18 at 09:08
  • Possible duplicate of [• Removing duplicate rows in Notepad++](https://stackoverflow.com/questions/3958350/removing-duplicate-rows-in-notepad) or [• How do I find and remove duplicate lines from a file using Regular Expressions?](https://stackoverflow.com/questions/1573361/how-do-i-find-and-remove-duplicate-lines-from-a-file-using-regular-expressions) – bobble bubble Aug 28 '18 at 10:13

4 Answers4

1

You can match

^(.+)$(?:\n\1)+

thus capturing the first line, and matching subsequent duplicate lines, and then replace everything matched with the first capture group:

\1

(or the equivalent keyword for the first group in whatever environment you're in)

https://regex101.com/r/Cs0bmY/8

CertainPerformance
  • 356,069
  • 52
  • 309
  • 320
1

Using Notepad++, you can do:

  • Ctrl+H
  • Find what: ^(.+)$(?:\R\1)+
  • Replace with: $1
  • check Wrap around
  • check Regular expression
  • DO NOT CHECK . matches newline
  • Replace all

Explanation:

^(.+)$      : group 1, a whole line
(?:         : non capture group
    \R      : any kind of line break
    \1      : backreference to group 1
)+          : group must appear 1 or more times

Replacement:

$1          : content of group 1

Result for given example:

operator.livrareflori.md
amazon.de
fonts.gstatic.com
erovoyeurism.net
tugtechnologyandbusiness.com
Toto
  • 89,455
  • 62
  • 89
  • 125
1

The trick is to capture the line and use a lookahead to verify that it exists later in the subject. This expression matches duplicates, and substituting with "" makes it keep the last occurrences:

(?s)^((?:https?://)?(?:www\.)?\S+\.\S+)\n(?=.*^\1$)

https://regex101.com/r/Cs0bmY/10

jaytea
  • 1,861
  • 1
  • 14
  • 19
1
((?:https?://)?(?:www\.)?\S+\.\S+)\s(?=[\s\S]*\1)

You can try this.See demo.

https://regex101.com/r/Cs0bmY/11

vks
  • 67,027
  • 10
  • 91
  • 124