I think there are several reasons why you (and other answers and comments so far) are struggling with the solution. Primarily, as stated, you do not have enough meta information to successfully construct the complex relationship of the overall operation.
Absent Metadata
In looking at your inline LINQ example, specifically to quote:
from cust in customerList
join prod in productList on cust.ProductId equals prod.Id
join veh in vehicleList on prod.VehicleId equals veh.Id into v
from veh in v.DefaultIfEmpty()
select new {customerName = cust.Name, customerVehicle=veh.VehicleName}
... if we are to parse the knowledge that is inherently stated in the above code, we'll identify the following:
- There are 3 separate data sets (of non-homogeneous types, though this is more evident from your
List<T>
examples at the beginning of the question) that serve as source of data. This meta information is available in the List<T>
setups as sources to LINQ, and thus this part is not an issue.
- The join order and type of join (i.e. AND implies
.Join()
and OR implies .GroupJoin()
). This meta information is more or less also available for the list approach setup.
- The relationship between the types, and the key to be used to compare one type to another. That is, that customer relates to product (as opposed to vehicle) and that customer-product relationship is defined as
Customer.ProductId = Product.Id
; or that vehicle relates to product (as opposed to customer) and that relationship is defined as Product.VehicleId = Vehicle.Id
. This meta information, as list setup presented in your question is NOT available.
- Projection of the resulting (interim and final) data set members. The example is not specific whether each data set is represented by a unique model (i.e. for all
List<T>
s that each T
is unique) or if repeats are possible. Because inline LINQ allows you to reference specific data set, having two data sets of the same type is not an issue when defined statically because each data set is referenced by name and thus relationship is clear. If type can appear more than once, and if metadata is available to determine type relationships dynamically, the trouble creeps in that you don't know which instance of multiple instances of the same type to relate to. In other words if it is possible to have Person join Friends join Person join Car
, it is not clear if Car should be matched to first Person or second Person. One possibility is to make assumption that in such cases you resolve relationship to the last instance of Person. Needless to say your lists setup doesn't have this meta information. For the purposes of this answer going forward, I'll assume that all types are unique and do not repeat.
- Unlike the intersect example you referenced in comments, whereas
Intersect
is a parameter-less operator (besides the other set to intersect over), Join
operator requires parameter(s) to identify the relationship by which to relate to the other data set. I.e. the parameter(s) is the meta information described in point 3 above.
Metadata
To close the gaps identified above is not simple, but is not insurmountable either. One approach is to simply annotate the data model types with relationship meta data. Something along the lines of:
class Vehicle
{
public int Id;
}
// PrimaryKey="Id" - Id refers to Vehicle.Id, not Product.Id
[RelationshipLink(BelongsTo=typeof(Product), PrimaryKey="Id", ForeignKey="VehicleId"]
class Product
{
public int Id;
public int VehicleId;
}
// PrimaryKey="Id" - Id refers to Product.Id, not Customer.Id
[RelationshipLink(BelongsTo=typeof(Product), PrimaryKey="Id", ForeignKey="ProductId"]
class Customer
{
public int Id;
public int ProductId;
}
This way, as you loop through the data sets as you're setting up joins, using reflection you can examine what type this data set is related to and how, lookup previous data sets for matching data type, and, again using reflection, setup .Join
's or .GroupJoin
s key selectors for matching the relationship of instances of data.
Interim Projections
In static definitions of LINQ statements (be it using inline join
or extension method .Join
) you control what result of the join looks like and how data is merged and transformed into a shape (aka another model) convenient for subsequent operations (usually by use of anonymous objects). With dynamic set up, this is very difficult if not altogether impossible because you'd need to know what to keep, what not, how to resolve name collision of data models' properties, etc.
To solve this issue, you can probably propagate all interim results (aka projections) as a Dictionary<Type, object>
, and simply carry through full models, each tracked by its type. And the reason you want to make it easy to track by its type is so that when you join previous interim result with the next dataset, and need to build the primary/foreign key functions, you have easy means to lookup the time that you discover from [RelationshipLink]
metadata.
The final project of the result, again, is not really stated in your question, but you need some way of dynamically determining what part of very wide result do you want (or all of it), or how to transform its shape back into whatever function that will be consuming the results of the giant join.
Algorithm
Finally, we can put the whole thing together. The code below is going to be just high-level of algorithm in C#-pseudocode, and not full C#. See footnote.
var datasets = GetListsOfDatasets().ToArray(); // i.e. the function that returns customerList, productList, vehicleList, etc as a set of List<T>'s
var joins = datasets.First().Select(item => new Dictionary<Type, object> {[item.GetType()] = item});
var joinTypes = stringList.ToQueue() // the "AND", "OR" that tells how to join next one. Convert to queue so we can pop of the top. Better make it enum rather than string.
foreach(dataset in datasets.Skip(1))
{
var outerKeyMember = GetPrimaryKeyMember(dataset.GetGenericEnumerableUnderlyingType());
var innerKeyMember = GetForeignKeyMember(dataset.GetGenericEnumerableUnderlyingType());
var joinType = joinTypes.Pop();
if ()
joins = joinType == "AND:
? joins.Join(
dataset,
outerKey => ReflectionGetValue(outerKeyMember.Member, outerKey[outerKeyMember.Type]),
innerKey => ReflectionGetValue(innerKeyMember.Member, innerKey),
(outer, inner) => {
outer[inner.GetType] = inner;
return outer;
})
: joins.GroupJoin(/* similar key selection as above */)
.SelectMany (i => i) // Flatten the list from IGrouping<T> back to IEnumerable<T>
}
var finalResult = joins.Select(v => /* TODO: whatever you want to project out, and however you dynamically want to determine what you want out */);
/////////////////////////////////////
public Type GetGenericEnumerableUnderlyingType<T>(this IEnumerable<T>)
{
return typeof(T);
}
public TypeAndMemberInfo GetPrimaryKeyMember(Type type)
{
// TODO
// Using reflection examine type, look for RelationshipLinkAttribute, and examine PrimaryKey specified on the attribute.
// Then reflect over BelongsTo declared type and find member declared as PrimaryKey
return new TypeAndMemberInfo {Type = __belongsToType, Member = __relationshipLinkAttribute.PrimaryKey.AsMemberInfo }
}
public TypeAndMemberInfo GetForeignKeyMember(Type type)
{
// TODO Very similar to GetPrimaryKeyMember, but for this type and this type's foreign key annotation marker.
}
public object ReflectionGetValue(MemberInfo member, object instance)
{
// TODO using reflection as member to return value belonging to instance.
}
So the high-level idea is that you take the first data set and wrap each member of the set with dictionary that specifies the type of the member and the member instance itself. Then, for each next dataset, you discover the underlying model type of the dataset, using reflection lookup the relationship metadata that tells you how to relate it to another type (that should have already been exposed in previous processed dataset or the code will blow up because join won't have anything to get key values from), lookup instance of the type from the outer enumerable's dictionary, get that instance and discovered key and get that instance's value as the value for outer key, and very similar reflect and discover value of the inner's foreign key member, and let .Join
do the rest of the joining. Keep looping to the end, with each iteration projection carrying full instances of each model.
Once done with all datasets, define what you want out of it using .Select
with whatever definition you want, and execute the complex LINQ to pump the data.
Performance Considerations
To perform a join, it means that at least one data-set must be fully read so that key membership may be probed into it while processing the other data-set for matches.
Modern DB engines like SQL Server are able to process joins of extremely large data sets because they go the extra step of having the ability to persist out interim results rather than build up everything in memory, and pull from disk as needed. As such, billions of items join billions of items does not blow up due to free memory starvation - once memory pressure is identified, the interim data and matched results are temporarily persisted to tempdb (or whatever disk storage that backs memory).
Here, default LINQ .Join
is an in-memory operator. Large enough data set will blow memory and cause OutOfMemoryException
. If you foresee processing many joins resulting in very large datasets, you may need to write your own implementation of .Join
and .GroupJoin
that use some sort of disk paging to store one data set in format that can be easily probed for membership when trying to match items from the other set, so as to relieve the memory pressure and use disk for memory.
Voila!
Footnotes
First, because you question (sans comments) is asked in the domain of a simple LINQ (meaning IEnumerable
and not IQueryable
and not SQL or stored procs, I have thus limited the scope of the answer to strictly that domain to follow the spirit of the question. This is not to say that at higher level this problem doesn't lend well to a solution in some other domain.
Second, even though SO rules are for good, compile-able, working code in answers, the reality of this solution is that it is probably at least a few hundred lines of code, and would require many lines of code to do reflection. How to do reflection in C# is, obviously, beyond the scope of the question. As thus, code presented is pseudo code and focuses on algorithm, reducing non-pertinent parts to comments describing what happens and leaving the implementation to the OP (or those finding this useful in the future.