1

I have 2 large files each containing long strings separated by newlines in different formats. I need to find similarities and differences between them. The Problem is that the formats of the two files differ.

File a:

9217:NjA5MDAxNdaeag0NjE5NTIx.XUwXRQ.gat8MzuGfkj2pWs7z8z-LBFXQaE:dasda97sda9sdadfghgg789hfg87ghf8fgh87

File b:

NjA5MDAxNdaeag0NjE5NTIx.XUwXRQ.gat8MzuGfkj2pWs7z8z-LBFXQaE

So now I want to extract the whole line containing NjA5MDAxNdaeag0NjE5NTIx.XUwXRQ.gat8MzuGfkj2pWs7z8z-LBFXQaE from File a to a new file and also delete this line in File a.

I have tried achieving this with meld and got to the point that it will at least show me the similarities only. Say File a has 3000 lines and File b has 120 lines, now I want to find the the lines with at least n consecutive identical chars and remove these from File a.

I found this and accordingly tried to use diff like this:

  diff  --unchanged-line-format='%L' --old-line-format='' \
  --new-line-format='' a.txt b.txt

This didn't do anything I got no output whatsoever so I guess it exited with 0 and didn't find anything.

How can I make this work? I have Linux and Windows available.

Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206
tamut
  • 61
  • 10
  • So, file `b` contains 120 strings, and you want to remove any line from file `a` that contains any of the 120 strings as substrings? – Mathias R. Jessen Aug 10 '19 at 20:13
  • thats exactly what im trying yes – tamut Aug 10 '19 at 20:16
  • And the file `a` strings are always the same format? ie. `[some text or numbers]:[potential substring from b]:[more text or numbers]`? – Mathias R. Jessen Aug 10 '19 at 20:23
  • yes precisely this is the case. though one needs to take into account more `:` delimiters might follow `[8549]:[NjA4NzMyedea212RELEVANTSUBSTRING5zz-JqDtvpAIkZq8oLiX2cBVI]:[irrelevantstring]:[irrelevantstring]:[irrelevantstring]` – tamut Aug 10 '19 at 20:30
  • I checked again the additional `:` always follow so the above structure is valid for all entries in the file – tamut Aug 10 '19 at 20:40

1 Answers1

3

Given the format of the files, the most efficient implementation would be something like this:

  1. Load all b strings into a [hashtable] or [HashSet[string]]
  2. Filter the contents of a by:
    • Extracting the substring from each line with String.Split(':') or similar
    • Check whether it exists in the set from step 1
$FilterStrings = [System.Collections.Generic.HashSet[string]]::new(
    [string[]]@(
        Get-Content .\path\to\b
    )
)

Get-Content .\path\to\a |Where-Object {
    # Split the line into the prefix, middle, and suffix;
    # Discard the prefix and suffix
    $null,$searchString,$null = $_.Split(":", 3)

    if($FilterStrings.Contains($searchString)){
        # we found a match, write it to the new file
        $searchString |Add-Content .\path\to\matchedStrings.txt

        # make sure it isn't passed through
        $false
    }
    else {
        # substring wasn't found to be in `b`, let's pass it through
        $true
    }
} |Set-Content .\path\to\filteredStrings.txt
Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206