-2

Is it possible to remove duplications with ignoring the punctation marks and spaces in Notepad++? I would keep one of them matching lines (doesn't matter which to keep).

My examples are from the txt file:

Rough work iconoclasm but the only way to get the truth. Oliver Wendell Holmes
Rough work, iconoclasm, but the only way to get the truth. Oliver Wendell Holmes

Rule No. 1: Never lose money. Rule No. 2: Never forget rule No. 1. Warren Buffett
Rule No.1: Never lose money. Rule No.2: Never forget rule No.1. Warren Buffett

Self-esteem isn't everything, it's just that there's nothing without it. Gloria Steinem 
Self-esteem isn't everything it's just that there's nothing without it. Gloria Steinem

You said she's a senior? Babe we're all crazy.
You said, she's a senior! Babe we're ALL crazy.
You said, she's a senior? Babe we're ALL crazy!

Result I need:

Rough work iconoclasm but the only way to get the truth. Oliver Wendell Holmes

Rule No. 1: Never lose money. Rule No. 2: Never forget rule No. 1. Warren Buffett

Self-esteem isn't everything, it's just that there's nothing without it. Gloria Steinem 

You said, she's a senior! Babe we're ALL crazy.

I can delete 100% matching duplications with regex, but can't find a regex rule to ignore spaces and marks.

3 Answers3

1

I don't think regex is the best tool for this task, but it's a nice challenge. You can match single words using a nested structure like:

((\w+)\W+((\w+)\W+( ... ((\w+)\W+)? ... )?)?(\w*))

When matching this, capture groups 2 to n contain the words 1 to n-1 of a line. The nested structure is necessary to make it non-ambiguous - otherwise, running the regex takes too long.

To match the duplicate lines, we use a similar structure with back-references:

\1\W+(\2\W+( ... (\9\W+)? ... )?)?

This will also match lines that are substrings of the previous line, which is again helpful to improve performance.

Notice that you have to use the \g{n}-notation when using more than 9 references in Notepad++. Moreover, to avoid matching line breaks you should use [^\w\n\r] instead of \W. To further improve performance, unnecessary groups should be non-matching, i.e., (?: ... ).

To generate the rather long regex that solves the problem for, e.g., up to 20 words per line, you can use the following script:

MAX_WORDS = 20
punct = "[^\\w\\n\\r]"
backref = (i) => `\\g{${i}}`
patternKeep = (i) => "(\\w+)[^\\w\\n\\r]+" + (i < 0 ? "" : `(?:${patternKeep(i-1)})?`)
patternRemove = (i) => `${backref(MAX_WORDS-i + 2)}(?:${punct}+` + (i < 0 ? "" : patternRemove(i-1)) + ")?"
console.log("^(" + patternKeep(MAX_WORDS) + "(\\w*))(\\r?\\n" + patternRemove(MAX_WORDS)+ `${punct}*${backref(MAX_WORDS+4)}${punct}*)+$`)

When copying this to Notepad++ with settings "Wrap around" on and "Match case" off and replacing with $1, it will remove all duplicate lines in your example.

Corylus
  • 736
  • 5
  • 16
  • Thank you! Not sure, but looks like there [is an issue](https://regex101.com/r/HHE3PL/1). I marked those with ✖ (same start, but shorter / no duplicates imho). Btw. I also think regex is not well suited for this task but like you say it's challenging. – bobble bubble Jul 03 '22 at 18:46
  • 1
    Yes, that's what I meant with "it also matches substrings". From what I came up with, this simplified the regex and made it run faster and it seemed like a compromise that would be ok for most applications of such a regex. – Corylus Jul 04 '22 at 19:25
0

I doubt that it can be done purely with regular expressions. If it can then I imagine that the expression would be difficult to understand and difficult to maintain. Instead I would suggest a multi-step approach.

Step 1 - modify each line to be: original-line separator original-line.

Step 2 - convert it to be line-without-punctuation separator original-line.

Step 3 - sort the lines

Step 4 - remove duplicated lines

Step 5 - remove line-without-punctuation and separator leaving just the original line.

In more detail:

In all the replaces below: select "Wrap around", unselect "Dot matches newline", unselect "Match whole word only" and unselect "Match case".

Step 1 - choose a separator, some text that is not punctuation and does not occur in the file. Here I use qqq. Do a regular expression replace of ^(.+)$ with \1qqq\1.

Step 2 - remove any punctuation before the separator. Repeatedly do a regular expression replace of [!',-.:?]+(.*qqq) with \1 until no more replacements are made. This expression matches all the punctuation in the example, but you may need to add more for your full text. Also need to reduce multiple spaces to singles, so repeatedly do a regular expression replace of +(.*qqq) with \1 until no more replacements are made. One final step to handle spaces before the qqq do a regular expression replace of qqq with qqq (this could also use a non-regular expression replace).

Step 3 - sort the lines lexicographically.

Step 4 - remove duplicated lines. Repeatedly do a regular expression replace of ^(.*qqq).*\R\1 with \1 until no more replacements are made.

Step 5 - Remove unwanted text leaving the original line. Do a regular expression replace of ^.*qqq with nothing (the empty string).


If all punctuation can be deleted and the result being a line without punctuation then could simple do a regular expression replace of [!',-.:? ]+ with , a sort and finally a remove duplicates.

AdrianHHH
  • 13,492
  • 16
  • 50
  • 87
  • Thanks for the interesting answer! I've just added a comment on top with the assumption that duplicate lines would be consecutive (as in OP's sample). Like you mentioned, certainly it won't be possible to sort and remove duplicate lines in only one step. – bobble bubble Jul 06 '22 at 17:30
0

Previously this question attracted an answer, but the author deleted it. To me it was so interesting because a special technique was illustrated. In a comment the answerer pointed me towards another thread to read more about it.

After experimenting a bit with that answer, an idea was the following pattern. Settings in NP++ are to uncheck: [ ] match case, [ ] .matches newline - Replace with emptystring.

^(?>[^\w\n]*(\w++)(?=.*\R(\2?+[^\w\n]*\1\b)))+[^\w\n]*\R(?=\2[^\w\n]*$)

Here is the demo in Regex101 - Assumption is, that duplicate lines are consecutive (like sample).

Most of the used regex-tokens can be looked up in the Stack Overflow Regex FAQ.


In short words, the mechanism used is to capture words from one line to the first group (\w++) while inside the lookahead (?=.*\R(\2?+...\1\b))) a second group in the consecutive line is "growing" from itself plus the captures until \R(?=\2...$) it either matches all words or fails.

Illustration of some steps from the regex101 debugger:

enter image description here

The second group holds the substring of the consecutive line that matches words and order of the previous line. It expands at each repetition from optionally itself and a word from the previous line. Separated by [^\w\n]* any amount of characters that are not word characters or newline.

For making it work, matching is done without giving back at crucial points (prevent backtracking).

halfer
  • 19,824
  • 17
  • 99
  • 186