The following powershell command sequence does the trick:
$repeats = [Linq.Enumerable]::Count([System.IO.File]::ReadLines("<path to current dir>\\data.txt")) - 1; copy-item -path data.txt -destination work.txt; for ($i=1; $i -le $repeats; $i++) { (Get-Content -Raw work.txt) -replace '(?s)(\d{3}\.\d{3}\.\d{4};)(([^\r\n]+[\r\n]+)*)\1', '$1$2' | Out-File result.txt; move-item -path result.txt -destination work.txt -force }; move-item -path work.txt -destination result.txt -force
Explanation
Scripting
For the discussion the command line is split into one command per line. It is assumed that the original data is in 'data.txtand a temp file
work.txtcan be used.
result.txt` will contain the result.
Basic idea:
- Design a regex using backreferences to express a repeated occurrence of a match.
- Repeatedly execute this regex.
Each run removes 1 duplicate for each value in the first column.
- Conservatively estimate the max number of repetitions beforehand.
The solution is by far from being elegant and efficient (see review section for some ideas).
Estimate the number of runs.
As we will see, each run removes 1 duplicate for each value in the first column. Thus, in the worst case (ie. each line starting with the same prefix) this means no. of lines - 1
runs. Determine that number , store it in variable $repeats
.
Credits: This line has been taken from another SO answer.
$repeats = [Linq.Enumerable]::Count([System.IO.File]::ReadLines("<path to current dir>\\data.txt")) - 1;
Clerical work: Copy original to work file
copy-item -path data.txt -destination work.txt;
Repeat the replacement $repeats
times
for ($i=1; $i -le $repeats; $i++) {
Regex-based replacement.
- Match a line prefix + the remainder of the line + any number of lines without a prefix + the matched prefix occurring again.
- Clerical work: Rename the result file to the work file
Credits: Command to apply a regex to a text file taken from this SO answer
(Get-Content -Raw work.txt) -replace '(?s)(\d{3}\.\d{3}\.\d{4};)(([^\r\n]+[\r\n]+)*)\1', '$1$2' | Out-File result.txt;
move-item -path result.txt -destination work.txt -force
};
Clerical Work: move last instance of work file to result file
move-item -path work.txt -destination result.txt -force
Regex
The regex dialect for powershell is .NET.
The challenge is the removal of each prefix copy while keeping the intervening material. One-time execution of a regex will not succeed as consecutive matches would overlap.
Step by step discussion:
a. Choose single line matching.
Necessary since the matches will cross line boundaries
(?s)
b. Prefix match pattern
Obviously this sub pattern needs to be changed according to the actual prefix format. This form ( 3-3-4 decimal digit vlock separated with .
) is derived from the example.
Note the trailing ;
and the parentheses to define a capture group for matches of this subpattern. This capture group / match is referenced later
(\d{3}\.\d{3}\.\d{4};)
c. Intervening text
Remainder of the line where the subexpression of b.
matches + line separator sequence + an arbitrary number of lines.
Due to the greedy greedy ( 'match as much as you can' ) nature of repetition operators ( `*` ), this part would match the remainder of the file (assuming it ends with a line separator).
(([^\r\n]+[\r\n]+)*)
d. Prefix clone
The prefix matched by the subexpression from b.
must occur again for a replacement to take place. In fact, this matches the last clone of the prefix matched by b.
\1
As it is designed the regex only detects clones at the beginning of the line
Review
While it would be possible to match the whole set of prefix clones and their intervening strings in a pattern similar to the one given - basically opting for non-greedy ( 'match as little as you can' ) matching - I do not know of any way to drop precisely the prefix clones when specifying the replacement.
The number of repeats could be reduced by matching only consecutive lines with the same prefix, eliminating the second occurrence in each match. Thus there would be multiple matches / replacements per pass. Basically this reduces the iteration number log ( no. of lines )
. It mandates the modified regex to cater for 1 intervening line between 2 consecutive prefix occurrences. This modificartion should only be relevant for very large files
The tabular form of the original file suggests that the data comes from a database or a spreadsheet. These work environments would be much better suited to fulfil the task at hand, so if there is any chance to modify the data before being dumped as a file that should be the preferred way to go.
More suitable tools allowing for some sort of column parsing and deduplication in the first column may be available in the form of appropriate powershell commands or command line tools.