Combining Two Text Files Removing Duplicates

Question

I have 2 text files that are as follows (large numbers like 1466786391 being unique timestamps):

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 49 packets received, 2% packet loss
round-trip min/avg/max = 20.917/70.216/147.258 ms
1466786342
PING 10.0.0.6 (10.0.0.6): 56 data bytes

....

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 50 packets received, 0% packet loss
round-trip min/avg/max = 29.535/65.768/126.983 ms
1466786391

and this:

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 49 packets received, 2% packet loss
round-trip min/avg/max = 20.917/70.216/147.258 ms
1466786342
PING 10.0.0.6 (10.0.0.6): 56 data bytes

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 50 packets received, 0% packet loss
round-trip min/avg/max = 29.535/65.768/126.983 ms
1466786391
PING 10.0.0.6 (10.0.0.6): 56 data byte

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 44 packets received, 12% packet loss
round-trip min/avg/max = 30.238/62.772/102.959 ms
1466786442
PING 10.0.0.6 (10.0.0.6): 56 data bytes
....

So the first file ends with the timestamp 1466786391 and the second file has the same data block somewhere in the middle and more data afterwards, the data before the specific timestamp is exactly same as the first file.

So the output I want is this:

--- 10.0.0.6 ping statistics ---
    50 packets transmitted, 49 packets received, 2% packet loss
    round-trip min/avg/max = 20.917/70.216/147.258 ms
    1466786342
    PING 10.0.0.6 (10.0.0.6): 56 data bytes

    ....

    --- 10.0.0.6 ping statistics ---
    50 packets transmitted, 50 packets received, 0% packet loss
    round-trip min/avg/max = 29.535/65.768/126.983 ms
    1466786391

 --- 10.0.0.6 ping statistics ---
    50 packets transmitted, 44 packets received, 12% packet loss
    round-trip min/avg/max = 30.238/62.772/102.959 ms
    1466786442
    PING 10.0.0.6 (10.0.0.6): 56 data bytes
....

That is, concatenate the two files, and create a third one removing the duplicates of the second file (the text blocks that's already present in the first file. Here's my code:

public static void UnionFiles()
{ 

    string folderPath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http");
    string outputFilePath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http\\union.dat");
    var union = Enumerable.Empty<string>();

    foreach (string filePath in Directory
                .EnumerateFiles(folderPath, "*.txt")
                .OrderBy(x => Path.GetFileNameWithoutExtension(x)))
    {
        union = union.Union(File.ReadAllLines(filePath));
    }
    File.WriteAllLines(outputFilePath, union);
}

This is the wrong output I am getting (the file structure is destroyed):

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 49 packets received, 2% packet loss
round-trip min/avg/max = 20.917/70.216/147.258 ms
1466786342
PING 10.0.0.6 (10.0.0.6): 56 data bytes

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 50 packets received, 0% packet loss
round-trip min/avg/max = 29.535/65.768/126.983 ms
1466786391
round-trip min/avg/max = 30.238/62.772/102.959 ms
1466786442
round-trip min/avg/max = 5.475/40.986/96.964 ms
1466786492
round-trip min/avg/max = 5.276/61.309/112.530 ms

EDIT: This code was written to handle multiple files, however I am happy even if just 2 can be done correctly.

However, this doesn't remove the textblocks as it should, it removes several useful lines and makes the the output utter useless. I am stuck.

How to achieve this? Thanks.

`union = union.Union(File.ReadAllLines(filePath));` should this not create a boolean union, thereby removing the duplicate blocks? — Jishan, Jun 28 '16 at 14:33
yes it should, I'm assuming a format (UTF8?) or white space problem? — Ouarzy, Jun 28 '16 at 14:38
You need to actually _parse_ the files and extract the individual blocks for comparison as Ouarzy suggested. Everything else will lead to ugly, unmaintainable hacks. — Good Night Nerd Pride, Jun 28 '16 at 15:55

score 3 · Accepted Answer · edited May 23 '17 at 12:15

I think you want to compare block, not really line per line.

Something like that should work:

public static void UnionFiles()
{
    var firstFilePath = "log1.txt";
    var secondFilePath = "log2.txt";

    var firstLogBlocks = ReadFileAsLogBlocks(firstFilePath);
    var secondLogBlocks = ReadFileAsLogBlocks(secondFilePath);

    var cleanLogBlock = firstLogBlocks.Union(secondLogBlocks);

    var cleanLog = new StringBuilder();
    foreach (var block in cleanLogBlock)
    {
        cleanLog.Append(block);
    }

    File.WriteAllText("cleanLog.txt", cleanLog.ToString());
}

private static List<LogBlock> ReadFileAsLogBlocks(string filePath)
{
    var allLinesLog = File.ReadAllLines(filePath);

    var logBlocks = new List<LogBlock>();
    var currentBlock = new List<string>();

    var i = 0;
    foreach (var line in allLinesLog)
    {
        if (!string.IsNullOrEmpty(line))
        {
            currentBlock.Add(line);
            if (i == 4)
            {
                logBlocks.Add(new LogBlock(currentBlock.ToArray()));
                currentBlock.Clear();
                i = 0;
            }
            else
            {
                i++;
            }
        }
    }

    return logBlocks;
}

With a log block define as follow:

public class LogBlock
{
    private readonly string[] _logs;

    public LogBlock(string[] logs)
    {
        _logs = logs;
    }

    public override string ToString()
    {
        var logBlock = new StringBuilder();
        foreach (var log in _logs)
        {
            logBlock.AppendLine(log);
        }

        return logBlock.ToString();
    }

    public override bool Equals(object obj)
    {
        return obj is LogBlock && Equals((LogBlock)obj);
    }

    private bool Equals(LogBlock other)
    {
        return _logs.SequenceEqual(other._logs);
    }

    public override int GetHashCode()
    {
        var hashCode = 0;
        foreach (var log in _logs)
        {
            hashCode += log.GetHashCode();
        }
        return hashCode;
    }
}

Please be careful to override Equals in LogBlock and to have a consistent GetHashCode implementation as Union use both of them, as explained here.

No, I checked the MSDN sample app. It keeps the duplicates, a single copy of them. — Jishan, Jun 28 '16 at 14:57
Yes but I try to improve it thanks to your remark, still on it. — Ouarzy, Jun 28 '16 at 15:21
Plus, it will work only if you have full blocks of 5 lines in each file. It won't work if you have only half a block, is it ok? — Ouarzy, Jun 28 '16 at 15:22
Ok, fix little error in read as block. Also I agree with you according to the doc I should be able to remove the where filter and write: "var cleanLogBlock = firstLogBlocks.Union(secondLogBlocks);" instead. But it doesn't work and I still don't know why^^. Anyway you get the idea: organise your logs in blocks, and then compare them. — Ouarzy, Jun 28 '16 at 15:43
Finally get it: my GetHashCode function was not consistent with the equals. Fixed and update the code. Hope it helps. — Ouarzy, Jun 28 '16 at 15:57
OMG! thanks a million times. I wish I could give you more credits. Thanks again!!. — Jishan, Jun 28 '16 at 16:00

score 1 · Answer 2 · answered Jun 28 '16 at 16:53

A rather hacky solution using a regular expression:

var logBlockPattern = new Regex(@"(^---.*ping statistics ---$)\s+"
                              + @"(^.+packets transmitted.+packets received.+packet loss$)\s+"
                              + @"(^round-trip min/avg/max.+$)\s+"
                              + @"(^\d+$)\s*"
                              + @"(^PING.+$)?",
                                RegexOptions.Multiline);

var logBlocks1 = logBlockPattern.Matches(FileContent1).Cast<Match>().ToList();
var logBlocks2 = logBlockPattern.Matches(FileContent2).Cast<Match>().ToList();

var mergedLogBlocks = logBlocks1.Concat(logBlocks2.Where(lb2 => 
    logBlocks1.All(lb1 => lb1.Groups[4].Value != lb2.Groups[4].Value)));

var mergedLogContents = string.Join("\n\n", mergedLogBlocks);

The Groups collection of a regex Match contains each line of a log block (because in the pattern each line is wrapped in parantheses ()) and the complete match at index 0. Hence the matched group with index 4 is the timestamp that we can use to compare log blocks.

Working example: https://dotnetfiddle.net/kAkGll

Thanks a lot! A nice solution! – Jishan Jun 28 '16 at 17:04 — Jishan, Jun 28 '16 at 17:04

score -2 · Answer 3 · answered Jun 28 '16 at 14:41

There is issue in concating unique record. Can you please check below code?

public static void UnionFiles()
{ 

    string folderPath =     Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http");
    string outputFilePath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http\\union.dat");
    var union =new List<string>();

    foreach (string filePath in Directory
            .EnumerateFiles(folderPath, "*.txt")
            .OrderBy(x => Path.GetFileNameWithoutExtension(x)))
    {
         var filter = File.ReadAllLines(filePath).Where(x => !union.Contains(x)).ToList();
    union.AddRange(filter);

    }
    File.WriteAllLines(outputFilePath, union);
}

Combining Two Text Files Removing Duplicates

3 Answers3