0

I have 350 files of data with each containing about 4,000 rows. There are 3,000 unique rows but some rows are duplicated e.g.

"2021-02-02",20.1,99,0,3.4  
"2021-02-03",22.6,95,0,2.9  
"2021-02-04",18.8,90,0,5.2  
"2021-02-02",20.1,99,0,3.4  
"2021-02-03",22.6,95,0,2.9  
"2021-02-05",21.9,96,0.8,4.2  
"2021-02-06",20.8,95,0,3.3 

I will like to remove only the duplicate lines in each of the 350 files. However, the duplicate lines are different in each file. i.e., some files may have other dates duplicated apart from the sample shown. The duplicate lines are random and not in any particular order. I used Line Operations in Notepad++ to sort the lines in ascending order and then remove duplicates. It works okay for one file but it will take a long time repeating this step 350 times.

help-info.de
  • 6,695
  • 16
  • 39
  • 41
tingalee
  • 9
  • 2
  • 1
    You'd better write a script in your favorite scripting language. – Toto Aug 30 '22 at 17:19
  • There are several Notepad++ questions on this site about removing duplicated lines. Try searching for `[notepad++] remove duplicated lines` and similar phrases. As far as I know, none are for multiple files, although the Notepad++ "find/replace in files" (see menu -> **search** -> **find in files**") along with answers to those other questions may do the trick. – AdrianHHH Aug 31 '22 at 10:09

1 Answers1

0

As mentioned in comments a script in your favorite scripting language is the best way.

But you may have a look at the screenshots below and try for your needs.

I assume you have all files or part of them in one directory. Please think about a backup copy for your test.

  • Open one file in your workspace
  • Open the dialog e.g. by STRG+F
  • Try for your needs Find What: ^(.*?)$\s+?^(?=.*^\1$)
  • Choose Regular Expression and matches newline
  • Open Find in Files tab e.g. by STRG+Shift+F
  • Replace with: Nothing
  • Set Filter
  • Set Directory
  • Press Replace in Files (at your own risk!)

Before:

enter image description here

After:

enter image description here

help-info.de
  • 6,695
  • 16
  • 39
  • 41
  • thank you for the suggestion and the regex worked as you've displayed. It appears that the first two repeated lines were removed in your solution, hence, the first line is "2021-02-04". This does not the solution to be chronologically arranged. Is there a regex that allows the subsequent repeated lines (e.g. lines 4 and 5) be deleted instead of lines 1 and 2 in my original post? I am grateful for your help. – tingalee Aug 31 '22 at 12:36
  • My answer is essentially to show the processing of multiple files. The RegEx is explained in: https://stackoverflow.com/a/16293580/1981088. Maybe you can sort the files with [Powershell](https://community.spiceworks.com/topic/2325686-sort-all-text-files-in-a-folder-with-powershell) after using the result of my answer. – help-info.de Aug 31 '22 at 15:56