Find partially duplicated lines, keep first instance and leave the rest untouched

Question

I've tried several things with RegEx in Notepad++, but I'm not even sure if it's even possible. I tried one or two things with PowerShell as well and nothing works.

Data comes like this:

007.130.0001;E2
007.130.0001;E4
007.130.0001;M4 20.1
007.130.0001;M4 20.1 NX
007.130.0002;E2
007.130.0002;E4
007.130.0002;M2_duplicate
007.130.0002;M4 20.1
007.130.0002;M4 20.1 NX
007.130.0008;M4 20.1 NX
007.130.0008;M4 20.3_M4 25.3
007.130.0008;M4 20.3_M4 25.3 NX
011.130.0124;E-Serie_duplicate
011.130.0124;M4 20.1
011.130.0124;M4 20.1 NX

and I want it like that (Option A):

007.130.0001;E2
;E4
;M4 20.1
;M4 20.1 NX
007.130.0002;E2
;E4
;M2_duplicate
;M4 20.1
;M4 20.1 NX
007.130.0008;M4 20.1 NX
;M4 20.3_M4 25.3
;M4 20.3_M4 25.3 NX
011.130.0124;E-Serie_duplicate
;M4 20.1
;M4 20.1 NX

or that (Option B):

007.130.0001;E2;E4;M4 20.1;M4 20.1 NX
007.130.0002;E2;E4;M2_duplicate;M4 20.1;M4 20.1 NX
007.130.0008;M4 20.1 NX;M4 20.3_M4 25.3;M4 20.3_M4 25.3 NX
011.130.0124;E-Serie_duplicate;M4 20.1;M4 20.1 NX

So basically I want to split a line at a special character (;) and check the first part for duplicates with the next line, deleting all of them but the first and keep the rest of the line untouched.

The closest I've got was this RegEx:

Find: ^([^;]+;).+\R(.*?\1.+(?:\R|$))+
Replace: \2

But then I'll end up with this:

007.130.0001;M4 20.1 NX
007.130.0002;M4 20.1 NX
007.130.0008;M4 20.3_M4 25.3 NX
011.130.0124;M4 20.1 NX

Please update your question with what you are currently trying along with the specific issues or errors you are getting? — I.T Delinquent, Jun 19 '19 at 10:45
[get-previous-line...](https://stackoverflow.com/questions/54362073/get-previous-line-while-reading-log-file-in-powershell) might help you. — T-Me, Jun 19 '19 at 10:59
the `Group-Object` cmdlet can group based on _calculated properties_. that would let you group the lines on the 1st part of each line. then you can use the resulting groups to gather the data from AFTER the 1st part and use that to build a `[PSCustomObject]` that contains the items you want in the structure you need. — Lee_Dailey, Jun 19 '19 at 11:37

collapsar · Answer 1 · 2019-06-19T12:44:15.343

The following powershell command sequence does the trick:

$repeats = [Linq.Enumerable]::Count([System.IO.File]::ReadLines("<path to current dir>\\data.txt")) - 1; copy-item -path data.txt -destination work.txt; for ($i=1; $i -le $repeats; $i++) { (Get-Content -Raw work.txt) -replace '(?s)(\d{3}\.\d{3}\.\d{4};)(([^\r\n]+[\r\n]+)*)\1', '$1$2' | Out-File result.txt; move-item -path result.txt -destination work.txt -force }; move-item -path work.txt -destination result.txt -force

Explanation

Scripting

For the discussion the command line is split into one command per line. It is assumed that the original data is in 'data.txtand a temp filework.txtcan be used.result.txt` will contain the result.

Basic idea:

Design a regex using backreferences to express a repeated occurrence of a match.
Repeatedly execute this regex.
Each run removes 1 duplicate for each value in the first column.
Conservatively estimate the max number of repetitions beforehand.

The solution is by far from being elegant and efficient (see review section for some ideas).

Estimate the number of runs. As we will see, each run removes 1 duplicate for each value in the first column. Thus, in the worst case (ie. each line starting with the same prefix) this means no. of lines - 1 runs. Determine that number , store it in variable $repeats.
Credits: This line has been taken from another SO answer.
```
$repeats = [Linq.Enumerable]::Count([System.IO.File]::ReadLines("<path to current dir>\\data.txt")) - 1;
```

Clerical work: Copy original to work file

copy-item -path data.txt -destination work.txt;

Repeat the replacement $repeats times
```
for ($i=1; $i -le $repeats; $i++) {
```
Regex-based replacement.
- Match a line prefix + the remainder of the line + any number of lines without a prefix + the matched prefix occurring again.
- Clerical work: Rename the result file to the work file

Credits: Command to apply a regex to a text file taken from this SO answer
```
    (Get-Content -Raw work.txt) -replace '(?s)(\d{3}\.\d{3}\.\d{4};)(([^\r\n]+[\r\n]+)*)\1', '$1$2' | Out-File result.txt;
    move-item -path result.txt -destination work.txt -force 
};
```
Clerical Work: move last instance of work file to result file
```
move-item -path work.txt -destination result.txt -force
```

Regex

The regex dialect for powershell is .NET.

The challenge is the removal of each prefix copy while keeping the intervening material. One-time execution of a regex will not succeed as consecutive matches would overlap.

Step by step discussion:

a. Choose single line matching. Necessary since the matches will cross line boundaries

(?s)

b. Prefix match pattern Obviously this sub pattern needs to be changed according to the actual prefix format. This form ( 3-3-4 decimal digit vlock separated with . ) is derived from the example.
Note the trailing ; and the parentheses to define a capture group for matches of this subpattern. This capture group / match is referenced later

(\d{3}\.\d{3}\.\d{4};)

c. Intervening text
Remainder of the line where the subexpression of b. matches + line separator sequence + an arbitrary number of lines.

  Due to the greedy greedy ( 'match as much as you can' ) nature of repetition operators ( `*` ), this part would match the remainder of the file (assuming it ends with a line separator).

(([^\r\n]+[\r\n]+)*)

d. Prefix clone The prefix matched by the subexpression from b. must occur again for a replacement to take place. In fact, this matches the last clone of the prefix matched by b.

\1

As it is designed the regex only detects clones at the beginning of the line

Review

While it would be possible to match the whole set of prefix clones and their intervening strings in a pattern similar to the one given - basically opting for non-greedy ( 'match as little as you can' ) matching - I do not know of any way to drop precisely the prefix clones when specifying the replacement.

The number of repeats could be reduced by matching only consecutive lines with the same prefix, eliminating the second occurrence in each match. Thus there would be multiple matches / replacements per pass. Basically this reduces the iteration number log ( no. of lines ). It mandates the modified regex to cater for 1 intervening line between 2 consecutive prefix occurrences. This modificartion should only be relevant for very large files

The tabular form of the original file suggests that the data comes from a database or a spreadsheet. These work environments would be much better suited to fulfil the task at hand, so if there is any chance to modify the data before being dumped as a file that should be the preferred way to go.

More suitable tools allowing for some sort of column parsing and deduplication in the first column may be available in the form of appropriate powershell commands or command line tools.

Thanks a lot - it's working!! As you mentioned yourself it's not very efficient and due to the fact that the input may contain up to 10k lines of data it takes some time. But with some patience, and that's the point, it would do the trick. — Batscha, Jun 19 '19 at 13:03

score 0 · Accepted Answer · answered Jun 19 '19 at 12:36

Not really smart solution, but it works.

You have to hit Replace all as many times as needed to achieve the task.

Ctrl+H
Find what: ^([^;]+;)(.+)\R(?:\1|((?=[^;]+;)))
Replace with: $1$2(?3\n$3:;)
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all

Explanation:

^                   # beginning of line
  ([^;]+;)          # group 1, 1 or more non semi-colon then a semi-colon
  (.+)              # group 2, 1 or more any character but newline
  \R                # any kind of linebreak
  (?:               # start non capture group
    \1              # same as group 1
   |                # OR
    (               # start group 3
      (?=[^;]+;)    # positive lookahead, make sure whave after: 1 or more non semi-colon then a semi-colon
    )               # end group 3
  )                 # end group

Replacement:

$1              # content of group 1
$2              # content of group 2
(?3             # if group 3 exists
  \n$3          # linefeed then content of group 3  (you can use \r\n if you want)
 :              # else
  ;             # semicolon
)               # end conditional

Result for given example:

007.130.0001;E2;E4;M4 20.1;M4 20.1 NX
007.130.0002;E2;E4;M2_duplicate;M4 20.1;M4 20.1 NX
007.130.0008;M4 20.1 NX;M4 20.3_M4 25.3;M4 20.3_M4 25.3 NX
011.130.0124;E-Serie_duplicate;M4 20.1;M4 20.1 NX

Screen capture:

Sir, you are the hero!!! It works perfectly in my testfile with ~4000 lines. In that particular case I had to press the button 6 times. Might be more in other files, but compared to going through the data by hand it seems pretty smart to me. Anyhow, a solution to fix that fact.. I recorded a little macro in Notepad doing this, so I ended up with just 1 simple click. Thank you again!! — Batscha, Jun 19 '19 at 13:35

score 0 · Answer 3 · answered Jun 19 '19 at 15:21

If you have an xslt processor available, this might be a viable approach:

Turn the csv-like input file into a simple xml file
Apply an xslt stylesheet to:
- Group the data by the first column's content
- Deduplicate in the first column
- Write the result in textual format

The xsl stylesheet used ( gcsv.xslt in the command ) used is:

<?xml version="1.0" encoding="UTF-8"?>
<!--
    SO
    https://stackoverflow.com/questions/56665631/find-partially-duplicated-lines-keep-first-instance-and-leave-the-rest-untouche/56667131#56667131

    19.06.2019 14:57:14
-->
<xsl:stylesheet
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:arc="http://xml.solusy.eu/oryco/mail/archive/190214"
    exclude-result-prefixes="#all"
    version="3.0"
>
    <!-- textual output and handy variables -->
    <xsl:output method="text"/>
    <xsl:variable name="delimiter" select="';'"/>
    <xsl:variable name="newline"   select="'&#x0a;'"/>

    <!-- group rows by the first column's content --> 
    <xsl:template match="/">
        <xsl:for-each-group
            select="/file/r"
            group-by="./c[1]/text()"
        >
                <xsl:apply-templates select="current-group()[position() = 1]/c"/>
                <xsl:apply-templates select="current-group()[position() > 1]"/>
        </xsl:for-each-group>
    </xsl:template>

    <!-- Deduplicate the first column in all but the first row of a group -->
    <xsl:template match="r">
        <xsl:apply-templates select="./c[position() > 1]"/>
    </xsl:template>

    <!-- Write out column content as plain text -->
    <xsl:template match="c">
        <xsl:value-of select="."/>
        <xsl:choose>
            <xsl:when test="position() = last()">
                <xsl:value-of select="$newline"/>
            </xsl:when>
            <xsl:otherwise>
                <xsl:value-of select="$delimiter"/>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>

    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="child::node() | @*"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

The whole process can be driven by a powershell command sequence as follows:

(Get-Content -Raw data.txt) -replace ';', '</c><c>' -replace '(?s)[\r\n]+$', '' -replace '(?m)^', '<r><c>' -replace '(?m)$', '</c></r>' -replace '(?s)^', "<?xml version=`"1.0`" encoding=`"UTF-8`"?>`n<file>" -replace '(?s)$', '</file>' | Out-File -Encoding UTF8 work.xml; java -jar "<path_to_saxon>" -s:"<path_to_work_dir>\work.xml" -xsl:"<path_to_work_dir>\gcsv.xslt" -o:"<path_to_work_dir>\result.txt"

Step-by-step explanation:

Turn the original file into xml.
This is easy for csv-like content if no character escaping takes place: - the csv fields are not delimited
- the field separator char (;) does not occur in field content
- all file characters can be used as-is in xml

Each line of the file is converted into a <r> element, each field in a row is turned into a <c> element with the field data as textual content. The complete file is wrapped into a single root element ( <file> ) and in order to keep picky xslt processors happy, a standard xml prologue is added.

These tasks can be implemented by a series of regexp-base replacement operations turning ; into </c><c> and inserting <r><c>and </c></r>at the start and the end of each line, respectively, in multiline mode (check for yourself that the result is syntactically valid xml).
```
(Get-Content -Raw data.txt) -replace ';', '</c><c>' -replace '(?s)[\r\n]+$', '' -replace '(?m)^', '<r><c>' -replace '(?m)$', '</c></r>' -replace '(?s)^', "<?xml version=`"1.0`" encoding=`"UTF-8`"?>`n<file>" -replace '(?s)$', '</file>' | Out-File -Encoding UTF8 work.xml;
```
1. Process the input file with an xslt processor.
  The example command uses Saxon which is free (Saxon HE; check their homepage for license details). Any other xslt2 processor should be fine.
```
   java -jar "<path_to_saxon>" -s:"<path_to_work_dir>\work.xml" -xsl:"<path_to_work_dir>\gcsv.xslt" -o:"<path_to_work_dir>\result.txt"
```

score 0 · Answer 4 · answered Jun 19 '19 at 16:11

Here is a simple Perl script that does the job:

Run this in the directory where the input file is

perl -nE 'chomp;($k,$v)=split(/;/,$_,2);$h{$k}.=";$v";}{say $_.$h{$_} for sort keys%h' file > output

cat output
007.130.0001;E2;E4;M4 20.1;M4 20.1 NX
007.130.0002;E2;E4;M2_duplicate;M4 20.1;M4 20.1 NX
007.130.0008;M4 20.1 NX;M4 20.3_M4 25.3;M4 20.3_M4 25.3 NX
011.130.0124;E-Serie_duplicate;M4 20.1;M4 20.1 NX

. Explanation:

perl                        # invoke the perl interpreter
-nE                         # options, n:process 1 line at a time, E: execute
'                           # code delimiter
  chomp;                    # suppress linebreak
  ($k,$v)=split(/;/,$_,2);  # split on semi-colon, keep only 2 occurrences (key=before the semi-colon value=after the semi-colon)
  $h{$k}.=";$v";            # populate a hash table
  }{                        # end loop (-n option)
  say $_.$h{$_}             # display key and its values
  for sort keys%h           # for all sorted keys
'                           # code delimiter
file                        # input file
>                           # redirect output to
output                      # output file

Find partially duplicated lines, keep first instance and leave the rest untouched

4 Answers4