0

Like the new .NET 6, 7 and so on, we have an Except class for lists.

List<int> A = new List<int>();
List<int> B = new List<int>();
List<int> C = A.Except(B).ToList();

My question is, how would one best go about a string version of the same class:

string A = "<div><p>One</p><p>Two</p></div>";
string B = "<div><p>One</p><p>Two</p><p>Three</p></div>";
string C = A.Except(B).ToString();

Getting the Result = <p>Three</p>

Instead I get:

System.Linq.Enumerable+<ExceptIterator>d__73`1[System.Char]

What am I doing wrong?

EDIT:

Simply using the largest string to Except the smallest string, reversing the array order:

string C = B.Except(A);

and using: Nick's new string(C.ToArray()); gives me:

hr

A slightly un expected result after using the reverse.

Rusty Nail
  • 2,692
  • 3
  • 34
  • 55
  • 1
    What is .NET 6 and 7? I know the latest .NET Framework 4.7 and .NET Core 2.2. – dymanoid Jan 10 '19 at 14:43
  • 3
    I wonder which algorithm would produce `

    Three

    ` from the two strings. You will need to think about this first, perhaps with pencil and paper.
    – Klaus Gütter Jan 10 '19 at 14:47
  • It seems that we have an [XY problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) with this question. Please tell us what you are **actually** trying to achieve? – dymanoid Jan 10 '19 at 14:50
  • Looking for the difference in the strings A and B. Thus the Except statement that works with Lists. – Rusty Nail Jan 10 '19 at 14:51
  • 1
    If you want to achieve a EXCEPT-like effect, you'll have to convert your strings to lists of substrings, and then apply EXCEPT. So you'd need to parse your HTML fragment into a list of `"

    One

    "`,`"

    Two

    "`, and so on. **Guillermo Gutiérrez** gives you the basics for that.
    – Ann L. Jan 10 '19 at 15:10
  • 1
    **Ciprian Vilcan** did a pretty good job of explaining why your original code doesn't work. – Ann L. Jan 10 '19 at 15:11

4 Answers4

5

There are two issues with your solution.

Behavior of ToString()

When doing .ToString() on an IEnumerable it will always print out the type. This is due to the fact that IEnumerable does not override the behavior of ToString(). See ToString for more info on this.
If you'd like to convert an IEnumerable<char> (the return type of Except) to a string, you'll have to do

var C = new string(A.Except(B));


Behavior of A.Except(B)

The Except method doesn't quite work the way you think it does.

Take for example the following code:

var a = new List<int> { 1, 2, 3 };
var b = new List<int> { 2, 3, 4 };
var c = a.Except(b);

The result of this would be { 1 }. What the method effectively does is return a new enumeration of all ints that are present in a, but not in b.

Now, strings are just an enumeration of chars - more precisely, your

var A = "<div><p>One</p><p>Two</p></div>";

from LINQ's perspective, is equivalent to

var A = new List<char> { '<', 'd', 'i', 'v', '>', ..., '<', '/', 'd', 'i', 'v', '>' };

The same goes for B.

So, when you do A.Except(B), what LINQ will actually do is go through each character and see if it can find it in B. If it does, it does not end up in the result set. Now, since all the chars in A are also present in B, you'll get an empty string. To see that this is actually the case, slightly modify A so it contains a character that is not in B:

string A = "<div><p>One</p><p>Two</p></div>ApplePie";

If you now do

string A = "<div><p>One</p><p>Two</p></div>ApplePie";
string B = "<div><p>One</p><p>Two</p><p>Three</p></div>";
string C = new string(A.Except(B).ToArray());

what you'll get is "AlP".

Solution

In my opinion, the best way to do your except is to parse your strings, transform them into objects, and then doing the Except. No built-in algorithm has the ability to tell that your strings are actually structured and how to differentiate among them. And as a working solution, using HtmlAgilityPack (a nuget package)

var docB = new HtmlDocument();
docB.LoadHtml(B);

var docA = new HtmlDocument();
docA.LoadHtml(A);
var nodes = docB.DocumentNode.FirstChild.Descendants("p").Select(node => node.InnerHtml)
    .Except(docA.DocumentNode.FirstChild.ChildNodes.Select(node => node.InnerHtml));
// take note that we are actually doing whatIsInB.Except(whatIsInA), since doing the reverse would result in nothing. There is no &lt;p&gt; in A that is not also present in B

var result = string.Join(Environment.NewLine, nodes); // will resut in "Three"
var otherResult = $"<p>{result}</p>"; // "<p>Three</p>"

I'll let you make a more general approach :)
But the idea is that if you want except to work the way you expect it, you'll have to ask it to work with strings, not chars.

Whether you do the parsing required to extract the components of your string (the <p> elements in this example) using HtmlAgilityPack or Regex, as suggested in other solutions, is entirely up to you.

Ciprian Vilcan
  • 104
  • 1
  • 5
1

When you use the Except() extension method, the return type is a List of Char.

Documentation

Also, A.Excepts(B) will never produce what you want, because it converts the string to arrays of char. So, it will remove every char from A that is present in B.

You need to think a different algorithm to do that.

andre
  • 113
  • 1
  • 10
0

Wnat you want is not Except, because it is a set operation known as set difference or relative complement, in which you say that you want the elements from a set that are not present in another.

You can achieve the result that you expect with regular expressions groups instead:

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        // Input string.
        string input = "<div><p>One</p><p>Two</p><p>Three</p></div>";

        // Use named group in regular expression.
        Regex expression = new Regex(@"^<div><p>One</p><p>Two</p>(?<middle>[<>/\w]+)</div>$");

        // See if we matched.
        Match match = expression.Match(input);
        if (match.Success)
        {
            // Get group by name.
            string result = match.Groups["middle"].Value;
            Console.WriteLine("Middle: {0}", result);
        }

        // Done
        Console.ReadLine();
    }
}

With the regular expression ^<div><p>One</p><p>Two</p>(?<middle>[<>/\w]+)</div>$ you say that the string should start (^) with <div><p>One</p><p>Two</p>, and end ($) with </div>, and that whatever in between that contains <, >, /, or any alphanumeric character (\w) more than once (+), will be added to the group named middle.

However, I wouldn't recommend you to try to parse HTML with regex...

Guillermo Gutiérrez
  • 17,273
  • 17
  • 89
  • 116
-2

Use string C = new string (A.Except(B).ToArray());

Nick
  • 4,787
  • 2
  • 18
  • 24