1

I have incoming data that is in DataTable form. There are no static classes to fall back on. I have a 2 tables, customer and billing. There are 7000 customers, 1200 billing records.

All customer records have a "ResponsiblePartyID", multiple customers can have the same ID, which references the ID of the billing table.

DataTable customer= ETL.ParseTable("customer"); // 7000 records
DataTable billing= ETL.ParseTable("billing");   // 1200 records

var JoinedTables = (from c in customer.AsEnumerable()
            join p in billing.AsEnumerable() on (string) c["ResponsiblePartyID"] equals (string) p["ID"] into ps
            from p in ps.DefaultIfEmpty()
            select new {c, p}
        );

So this doesn't work as it should, even if it spit out the results in the wrong format I would be happy, but it only returns 2200 results rather than the 7000.

It seems like it would make sense if it returned only 1200, or if it returned all 7000, but 2200 is a weird place for it to stop.

I am manually parsing binary data as my datasource, I chose a DataTable as the destination because it seemed like the right way, but after dealing with Linq and trying to do joins I am wondering if I should rethink things.

It seems that Linq wasn't designed to query DataTables since I have to do the .AsEnumerable() on everything, and then .CopyToDataTable() as I get done with each step.

I don't have static classes defined for all my data because the properties of each value was defined already in the DataTable, so what is the "right" way of taking 2 DataTables, doing a LEFT JOIN (like in SQL) where the results on the left aren't excluded by the results on the right? If I start off with a table on the left with 7000 rows, I want to end up with 7000. If there are no matching records, fill it in with null.

I would like to not have to define every column, it should return a flattened Array / DataTable - something like this:

var JoinedTables = (from c in customer.AsEnumerable()
            join p in billing.AsEnumerable() on (string) c["ResponsiblePartyID"] equals (string) p["ID"] into ps
            from p in ps.DefaultIfEmpty()
            select ALL_COLUMNS
        );

UPDATE:

I used the sample from Jon Skeet's answer that was linked in the comments (Linq return all columns from all tables in the join) His solution really wasn't any different than my first attempt, it still doesn't address how to flatten the results into a single DataTable. Here is a sample of the data and the current output:

Customers
ID  Resp_ID Name
1   1   Fatafehi
2   2   Dan
3   1   Anthony
4   1   Sekona
5   1   Osotonu
6   6   Robert
7   1   Lafo
8   1   Sarai
9   9   Esteban
10  10  Ashley
11  11  Mitch
12  64  Mark
13  11  Shawn
14  53  Kathy
15  53  Jasmine
16  16  Aubrey
17  17  Peter
18  18  Eve
19  19  Brenna
20  20  Shanna
21  21  Andrea

Billing
ID  30_Day  60_Day
2   null    null
6   null    null
9   null    null
10  null    null
11  null    null
64  null    null
53  null    null
16  null    null
17  null    null
18  null    null
19  null    null
20  -36.52  null
21  1843.30 null

Output:
2   2   Dan 2      null   null  
6   6   Robert  6      null   null  
9   9   Esteban 9      null   null  
10  10  Ashley  10     null   null  
11  11  Mitch   11     null   null  
12  64  Mark    64  -131.20   null
13  11  Shawn   11     null   null  
14  53  Kathy   53     null   null  
15  53  Jasmine 53     null   null  
16  16  Aubrey  16     null   null  
17  17  Peter   17     null   null  
18  18  Eve 18     null   null  
19  19  Brenna  19     null   null  
20  20  Shanna  20   -36.52   null
21  21  Andrea  21  1843.30   null

Notice that anyone with Resp_ID of 1 is missing from the results. To show output I used the following and then inserted the null values for visualization:

foreach (var row in joinedRows)
{
    Console.WriteLine(row.r1["ID"] + " " + row.r1["Resp_ID"] + " " + row.r1["Name"] + " " + row.r2["ID"] + " " + row.r2["30_Day"] + " " + row.r2["60_Day"]);
}
Alan
  • 2,046
  • 2
  • 20
  • 43
  • One approach to solving a problem like this is to find a counterexample - that is, an example in the database of two records that you feel should be part of the join, but aren't. You can even create a simple set of test tables with only those two records in them and try to debug that situation, if closely looking at the counterexample doesn't answer your question alone. – Grismar Jan 31 '19 at 00:23
  • Check the [Jon's answer](https://stackoverflow.com/questions/3940304/linq-return-all-columns-from-all-tables-in-the-join) – NaDeR Star Jan 31 '19 at 00:42
  • Thanks guys, I updated my question with sample data – Alan Jan 31 '19 at 05:24

3 Answers3

2

So you have Customers and Billings. Every Customer has a primary key in Id and a foreign key to the Billing in RespId.

Several Customers can have the same value for this foreign key. Normally this would be a one-to-many relationship between Billings and Customers. However, some of your Customers have foreign key values that don't point to any Billing.

class Customer
{
    public int Id {get; set;}            // primary key
    ... // other properties

    // every Customer has exactly one Billing, using foreign key:
    public int RespId {get; set;}        // wouldn't BillingId be a better Name?
}
class Billing
{
    public int Id {get; set;}            // primary key
    ... // other properties
}

Now let's do some separation of concerns:

We separate the transformation of your DataTables to IEnumerable<...> from your LINQ handling. This will not only make your problem clearer to understand, but also make it better testable, re-usable and maintainable: if your DataTables change to for instance a Database, or a CSV file, you won't have to change your LINQ statements.

Create extension methods of DataTable to convert to IEnumerable and back. See extension methods Demystified

public static IEnumerable<Customer> ToCustomers(this DataTable table)
{
    ... // TODO: implement
}
public static IEnumerable<Billing> ToBillings(this DataTable table)
{
    ... // TODO: implement
}

public static DataTable ToDataTable(this IEnumerable<Customer> customers) {...}
public static DataTable ToDataTable(this IEnumerable<Billing> billings) {...}

You know DataTables better than I do, so I'll leave the coding to you. For more information: Convert DataTable to IEnumerable and Convert IEnumerable to DataTable

So now we have the following:

DataTable customersTable = ...
DataTable billingsTable = ...
IEnumerable<Customer> customers = customersTable.ToCustomers();
IEnumerable<Billing> billings = billingsTable.ToBillings();

We are ready to LINQ!

Your Linq Query

If there is a relationship between two sequences using a foreign key, and you do a full inner join, you won't get the Customers that have no matching Billing. If you do want them, you need a left-outer-join: Customers without a Billing will have some default value for Billing, usually null.

LINQ does not have a left-outer-join. You can find several solutions on Stackoverflow on how to mimic a left-outer-join. You can even write an Extension function for this.

public static IEnumerable<TResult> LeftOuterJoin<TLeft, TRight, TKey, TResult>(
    this IEnumerable<TLeft> leftCollection,     // the left collection
    IEnumerable<TRight> rightCollection,        // the right collection to join
    Func<TLeft, TKey> leftKeySelector,          // the function to select left key
    Func<TRight, TKey> rightKeySelector,        // the function to select right key  
    Func<TLeft, TRight, TResult> resultSelector // the function to create the result
    TRight defaultRight,                        // the value to use if there is no right key   
    IEqualityComparer<TKey> keyComparer)        // the equality comparer to use
{
    // TODO: exceptions if null input that can't be repaired
    if (keyComparer == null) keyComparer = EqualityComparer.Default<TKey>();
    if (defaultRight == null) defaultRight = default(TRight);

    // for fast Lookup: put all right elements in a Lookup using the right key and the keyComparer:
    var rightLookup = rightCollection
        .ToLookup(right => rightKeySelector(right), keyComparer);

    foreach (TLeft leftElement in leftCollection)
    {
         // get the left key to use:
         TKey leftKey = leftKeySelector(leftElement);
         // get all rights with this same key. Might be empty, in that case use defaultRight
         var matchingRightElements = rightLookup[leftKey]
             .DefaultIfEmtpy(defaultRight);
         foreach (TRight rightElement in matchingRightElements)
         {
             TResult result = ResultSelector(leftElement, rightElement);
             yield result;
         }
    }
}

To make this function more re-usable, create an overload without the keyComparer and defaultRight parameters:

public static IEnumerable<TResult> LeftOuterJoin<TLeft, TRight, TKey, TResult>(
    this IEnumerable<TLeft> leftCollection,     // the left collection
    IEnumerable<TRight> rightCollection,        // the right collection to join
    Func<TLeft, TKey> leftKeySelector,          // the function to select left key
    Func<TRight, TKey> rightKeySelector,        // the function to select right key    
    Func<TLeft, TRight, TResult> resultSelector)// the function to create the result

{    // call the other overload with null for keyComparer and defaultRight
     return LeftOuterJoin(leftCollection, rightCollection,
        leftKeySelector, rightKeySelector, restultSelector, 
        null, null);
}

Now that you've got this very re-usable function, let's create a function to left-outer-join your Customers and Billings:

public static IEnumerable<TResult> LeftOuterJoin<TResult>(
    this IEnumerable<Customer> customers,
    IEnumerable<Billing> billings,
    Func<Customer, Billing, TResult> resultSelector)
{
    return customers.LeftOuterJoin(billings,  // left outer join Customer and Billings
       customer => customer.RespId,           // from every Customer take the foreign key
       billing => billing.Id                  // from every Billing take the primary key
       // from every customer with matching (or default) billings
       // create one result:
       (customer, billing) => resultSelector(customer, billing));                                
}

You didn't specify what you want in the result, you'll have to write that function yourself:

 public static IEnumerable<CustomerBilling> LeftOuterJoinCustomerBilling(
    this IEnumerable<Customer> customers,
    IEnumerable<Billing> billings)
 {
      // call the LeftOuterJoin with the correct function to create a CustomerBilling, something like:
      return customers.LeftOuterJoin(billings,
    (customer, billing) => new CustomerBilling()
    {    // select the columns you want to use:
         CustomerId = customer.Id,
         CustomerName = customer.Name,
         ...

         BillingId = billing.Id,
         BillingTotal = billing.Total,
         ...
    });

Put everything together in a LINQ way

DataTable customersTable = ...
DataTable billingsTable = ...
IEnumerable<Customer> customers = customersTable.ToCustomers();
IEnumerable<Billing> billings = billingsTable.ToBillings();
IEnumerable<CustomerBilling> customerBillings = customers.ToCustomerBillings(billing);
DataTable customerBillingTable = result.ToDataTable();

Note, all functions except the last use deferred execution: nothing is enumerated until you call ToDataTable.

If desired, you can put everything together into one big LINQ statement. This won't speed up your process very much, however it will deteriorate readability, testability and maintainability.

Note that because we separated the way your data is saved from the way your data is processed, your changes will be minimal if you decide to save your data in CSV files, or a database, or if you want different values in a CustomerBilling, or if your Customer gets some extra fields.

Harald Coppoolse
  • 28,834
  • 7
  • 67
  • 116
  • If you say LINQ doesn't have a left-outer-join, what is the difference between a left-outer-join and the LINQ method `.GroupJoin()`? – Oliver Jan 31 '19 at 09:48
  • If you have one-to-many with Schools and Students, then GroupJoin will return "Schools with their Students", so [School 1 with every Student with a foreign key to School 1], [School 2 with every Student with a Foreign key to School 2]. Left outer join is the SelectMany after the GroupJoin: [School 1, Student A with SchoolId 1], [School 1, Student B with ShoolId 1], [School 2, Student C with SchoolId 2], [School 2, Student C with SchoolId 2], etc, – Harald Coppoolse Jan 31 '19 at 11:38
1

Given some example data is great, but given it in a format that can be copy/paste to use it, would be much better.

The relationship between your customers and the billings is one-to-many. Where many can be zero, one or multiple. Due to this fact, you have to use .GroupJoin() instead of .Join() (which is one-to-one relationship):

var customers = new[]
{
    new Customer{ Id = 1, Resp_Id = 1, Name = "Fatafehi" },
    new Customer{ Id = 2, Resp_Id = 2, Name = "Dan" },
    new Customer{ Id = 3, Resp_Id = 1, Name = "Anthony" },
    new Customer{ Id = 4, Resp_Id = 1, Name = "Sekona" },
    new Customer{ Id = 5, Resp_Id = 1, Name = "Osotonu" },
    new Customer{ Id = 6, Resp_Id = 6, Name = "Robert" },
    new Customer{ Id = 7, Resp_Id = 1, Name = "Lafo" },
    new Customer{ Id = 8, Resp_Id = 1, Name = "Sarai" },
    new Customer{ Id = 9, Resp_Id = 9, Name = "Esteban" },
    new Customer{ Id = 10, Resp_Id = 10, Name = "Ashley" },
    new Customer{ Id = 11, Resp_Id = 11, Name = "Mitch" },
    new Customer{ Id = 12, Resp_Id = 64, Name = "Mark" },
    new Customer{ Id = 13, Resp_Id = 11, Name = "Shawn" },
    new Customer{ Id = 14, Resp_Id = 53, Name = "Kathy" },
    new Customer{ Id = 15, Resp_Id = 53, Name = "Jasmine" },
    new Customer{ Id = 16, Resp_Id = 16, Name = "Aubrey" },
    new Customer{ Id = 17, Resp_Id = 17, Name = "Peter" },
    new Customer{ Id = 18, Resp_Id = 18, Name = "Eve" },
    new Customer{ Id = 19, Resp_Id = 19, Name = "Brenna" },
    new Customer{ Id = 20, Resp_Id = 20, Name = "Shanna" },
    new Customer{ Id = 21, Resp_Id = 21, Name = "Andrea" },
};

var billings = new[]
{
    new Billing{ Id = 2, Day30 = null, Day60 = null },
    new Billing{ Id = 6, Day30 = null, Day60 = null },
    new Billing{ Id = 9, Day30 = null, Day60 = null },
    new Billing{ Id = 10, Day30 = null, Day60 = null },
    new Billing{ Id = 11, Day30 = null, Day60 = null },
    new Billing{ Id = 64, Day30 = null, Day60 = null },
    new Billing{ Id = 53, Day30 = null, Day60 = null },
    new Billing{ Id = 16, Day30 = null, Day60 = null },
    new Billing{ Id = 17, Day30 = null, Day60 = null },
    new Billing{ Id = 18, Day30 = null, Day60 = null },
    new Billing{ Id = 19, Day30 = null, Day60 = null },
    new Billing{ Id = 20, Day30 = -36.52, Day60 = null },
    new Billing{ Id = 21, Day30 = 1843.30, Day60 = null },
};

var aggregate = customers.GroupJoin(
    billings, 
    customer => customer.Resp_Id, 
    billing => billing.Id, 
    (customer, AllBills) => new
    {
        customer.Id,
        customer.Resp_Id,
        customer.Name,
        AllBills
    });

foreach (var item in aggregate)
{
    Console.WriteLine($"{item.Id.ToString().PadLeft(2)}   {item.Resp_Id.ToString().PadLeft(2)}   {item.Name}");

    if(!item.AllBills.Any())
        Console.WriteLine("No bills found!");

    foreach (var bill in item.AllBills)
    {
        Console.WriteLine($"   {bill.Id.ToString().PadLeft(2)}   {bill.Day30}   {bill.Day60}");
    }

    Console.WriteLine();
}

Console.WriteLine("Finished");
Console.ReadKey();

Classes:

public class Customer
{
    public int Id { get; set; }
    public int Resp_Id { get; set; }
    public string Name { get; set; }
}

public class Billing
{
    public int Id { get; set; }
    public double? Day30 { get; set; }
    public double? Day60 { get; set; }
}
Oliver
  • 43,366
  • 8
  • 94
  • 151
0

Harald and Oliver have provided great answers, but I had discussed not having static classes. I am starting with a binary flat file database which is being parsed byte by byte into byte[] and after going through any binary conversions being added to DataRows using a JSON definition file along the way to determine datatypes. The result is an API that can query any flat file and return it into a DataTable that can then be queried without using static classes - which is then converted to JSON to be posted to a web API.

This way I can make rapid adjustments to my query and have the changes propagated without then also having to redefine static classes and complex relationships. I had originally planned on exporting to an SQLite database and then running queries on it before I found Linq last week.

Since I am still very new to Linq, I am learning a lot and having trouble identifying how to ask questions regarding data I am calling as .AsEnumerable() and then understanding how to modify answers that use static classes. While their answers are valuable, and may offer performance advantages, it didn't fit in with my use case because of flexibility requirements. Here is a stripped down version of what I used:

DataTable finalResults = ( from cus in customers.AsEnumerable()
    join bill in billing.AsEnumerable().DefaultIfEmpty() on  cus.Field<string>("Resp_ID")  equals age.Field<string>("ID")  into cs
    from c in cs.DefaultIfEmpty() 
    select new
    {
        reference_id = cus["CustomerId"],
        family_id = cus["Resp_ID"],
        last_name = cus["LastName"],
        first_name = cus["FirstName"],
        billing_31_60 = c == null ? "0" : c["billing_31_60"],
        billing_61_90 = c == null ? "0" : c["billing_61_90"],
        billing_over_90 = c == null ? "0" : c["billing_over_90"],
        billing_0_30 = c == null ? "0" : c["billing_0_30"]    
    }).CopyToDataTable();
Alan
  • 2,046
  • 2
  • 20
  • 43