2

Here is my class:

public class Record
{
   public string PersonName {get; set;}
   public string RequestID {get; set;}
}

I have a db table related to this class And I'm pulling all of it to memory at start. And I'm trying to find a relation between two people with the following algorithm :

  1. I pick two people(I mean records from that list, there can be multiple records of both properties)
  2. I list first person's RequestIDs
  3. For each RequestID, I list users with the same RequestID
  4. For each user, I check if that user has the same RequestID with the second user.
  5. If found, break and do stuff.

Here is my implementation of above algorithm :

foreach(var elem in listofFirstPerson)
{
  List<Record> listofRelatedPeople = RecordList.Where(r => r. RequestID == elem.RequestID).ToList(); //I actually get distinct records from actual list and the distinct version count is about 100k

  foreach(var relatedPerson in listofRelatedPeople )
  {
    List<Record> listofRecordsforRelatedPerson = RecordList.Where(r => r. PersonName ==  relatedPerson.PersonName).ToList();

    for(int i = 0; i < listofRecordsforRelatedPerson.Count; i++)
    {
      for(int j = 0; j < listofSecondPerson.Count; j++)
      {
       if(listofRecordsforRelatedPerson[i].RequestID ==listofSecondPerson[j].RequestID)
        //break all loops and do stuff
      }
    }
  }
}

This algorithm works. But it is incredibly slow. As I mentioned listofRelatedPeople is about 100k and it iterates only a few hundreds of records in about 20 seconds. How can I make this algorithm faster? Is there a faster approach? Thanks in advance.

EDIT :

In my list there are records like this :

  • Name : "Jason" RequestID : "1"
  • Name : "Larry" RequestID : "1"
  • Name : "Kevin" RequestID : "7"
  • Name : "Joshua" RequestID : "4"
  • Name : "Tom" RequestID : "1"
  • Name : "Tom" RequestID : "7"
  • Name : "Ben" RequestID :"7"

Suppose I pick Jason and Kevin, as you see their Request IDs are not same so I need to find a relation between them. So I list users with the same RequestID and they are Larry and Tom. Then I get all records with Larry and I see that he doesn't have a record with the same RequestID with Kevin. Therefore I go to Tom, I see that Tom has the same RequestID with Kevin, so I pick Tom and it is done.

jason
  • 6,962
  • 36
  • 117
  • 198
  • 1
    You can use Task Parallel Library to process the list. There will be multiple threads looping though your list of records at the same time – Varun Mehta May 18 '17 at 13:48
  • He can't do plinq or tasking because he needs to break all loops when certain criteria is met. Let's be honest though, you are basically creating 4 sub loops here. If you have a need for this than your overall data structure is fudged and there lies the real problem you need to address. – Travis Acton May 18 '17 at 13:54
  • @TravisActon My data structure is a list. What else would you suggest? I'm open to new algorithm suggestions as well. – jason May 18 '17 at 14:00
  • Could you please at least fix up your code? I'm pretty sure are a few syntax errors in there. – s.m. May 18 '17 at 14:04
  • @s.m. Sorry, I fixed it. – jason May 18 '17 at 14:08
  • Does it *really* compile for you? – s.m. May 18 '17 at 14:09
  • @s.m. What else is wrong? – jason May 18 '17 at 14:10
  • How can `listofRecordsforRelatedPerson.RequestID` compile? How can `public class RequestID;` compile? Please, come up with a complete example that one can cut and paste into VS and run and improve. – s.m. May 18 '17 at 14:12
  • @s.m. I'm sorry, I fixed them too. – jason May 18 '17 at 14:15
  • Converting the results of the Linq to a list unnecessarily iterates over the entire IEnumerable returned by it. I would skip using ToList and instead use the IEnumerable directly. So your for loops should be foreach loops. If you need i and j for later you could compute them yourself as you loop. – Mike May 18 '17 at 14:33
  • @Mike so converting all Lists to IEnumerable would solve the problem? – jason May 18 '17 at 14:38
  • Rather than plunging right into the algorithm you conceived, describe the goal you are trying to achieve - *all (unique?) pairs of names A and B where there is a name C and requests Ra and Rb such that there a records (A, Ra), (C, Ra), (C, Rb) and (B, Rb)*. related: [two-hop-friends](http://stackoverflow.com/a/41228509/3789665). – greybeard May 20 '17 at 16:12
  • @greybeard As I said in the question, I'm trying to find a relation between two people via another person. For example your request ID is 1 and mine is 2, so we are not directly related. But there is another guy with request ID equals to 1 and he has also another request ID with 2. So we can have a relation via that guy. I need to find that guy. Thanks. – jason May 20 '17 at 16:54
  • Can you provide more context? The code is far from perfect, but shouldn't be so slow except if it's called from some sort of a outer loop. Can you provide a more realistic use case? – Ivan Stoev May 22 '17 at 07:45
  • @IvanStoev Have you seen my edit? My aim is pretty much in the edit. I just need to find the middle person between two people. The middle person needs to have RequestIDs of both of the people. That's all. – jason May 22 '17 at 07:48
  • @IvanStoev There is no outer loop, it works that slow in this piece of code... – jason May 22 '17 at 07:49
  • Thanks, I think I see it now. – Ivan Stoev May 22 '17 at 11:47

5 Answers5

2

The way I understand it, your current algorithm can be expressed in LINQ as follows:

static Record FirstRelated(List<Record> records, string firstName, string secondName)
{
    var listofFirstPerson = records.Where(r => r.PersonName == firstName).ToList();
    var listofSecondPerson = records.Where(r => r.PersonName == secondName).ToList();
    var result = (
        from r1 in listofFirstPerson // (1)
        from r2 in records //(2)
        where r2.RequestID == r1.RequestID
        from r3 in records // (3)
        where r3.PersonName == r2.PersonName
        from r4 in listofSecondPerson // (4)
        where r4.RequestID == r3.RequestID
        select r2
    ).FirstOrDefault();
    return result;
}

So basically you have 4 nested loops. If we designate

N = records.Count
M1 = listofFirstPerson.Count
M2 = listofSecondPerson.Count

then the time complexity of the algorithm will be O(M1 * N * N * M2), which with big N is normal to cause performance issues.

Looking at the above implementation, it can be noticed that the same result can be achieved by consolidating (1) with (2), (3) with (4) and correlating the resulting sets by PersonName:

var firstRelated =
    from r1 in listofFirstPerson
    from r2 in records
    where r2.RequestID == r1.RequestID
    select r2;
var secondRelated =
    from r4 in listofSecondPerson
    from r3 in records
    where r3.RequestID == r4.RequestID
    select r3;
var result = (
    from r1 in firstRelated
    from r2 in secondRelated
    where r2.PersonName == r1.PersonName
    select r1
).FirstOrDefault();

So far we didn't improve anything - the algorithm is still the same quadratic time complexity. But it gives us the idea - since now the firstRelated and secondRelated are independent, there is no need to execute secondRelated for each record of the firstRelated, so instead we can prepare a fast hash lookup data structure from secondRelated in advance (with average O(1) lookup time complexity), and use it when iterating once the firstRelated, leading to much better O(M1 * N) time complexity (almost like eliminating the cost of the last two inner loops in your code which are causing the slowness).

Also note that we don't need to build the two initial lists anymore because we are going to process the firstRelated and secondRelated only once.

So the final solution would be something like this:

var firstRelated =
    from r1 in records
    where r1.PersonName == firstName
    from r2 in records
    where r2.RequestID == r1.RequestID
    select r2;
var secondRelated =
    from r4 in records
    where r4.PersonName == secondName
    from r3 in records
    where r3.RequestID == r4.RequestID
    select r3;

and now either use the LINQ join operator to do the efficient correlation for us:

var result = (
    from r1 in firstRelated
    from r2 in secondRelated
    where r2.PersonName == r1.PersonName
    select r1
).FirstOrDefault();

or do it manually by preparing and using a HashSet with PersonName from the secondRelated:

var secondRelatedNames = new HashSet<string>(secondRelated.Select(r => r.PersonName));
var result = firstRelated.FirstOrDefault(r => secondRelatedNames.Contains(r.PersonName));
Ivan Stoev
  • 195,425
  • 15
  • 312
  • 343
1

Both ".Where()" and ".ToList()" are quite slow operations.

You can map your "RecordList" to a two dictionaries one with "RequestID" as key another with "PersonName". Do it before forech. This should run much faster.

var dictionary1 = RecordList.GroupBy(f => f.RequestID).ToDictionary(f => f.Key, v => v.ToArray());
var dictionary2 = RecordList.GroupBy(f => f.PersonName).ToDictionary(f => f.Key, v => v.ToArray()); 

and then inside foreach you can use them as

var listofRelatedPeople = dictionary1[elem.RequestID];
var listofRecordsforRelatedPerson= dictionary2[relatedPerson.PersonName];

Of course, if there is a possibility that key will not exist, better to use dictionary1.TryGetValue()

UPDATE

If you need it C# way, one of the solution could be:

var recordList = new Record[]
{
    new Record() {RequestID = "1", PersonName = "User1"},
    new Record() {RequestID = "2", PersonName = "User1"},
    new Record() {RequestID = "3", PersonName = "User2"},
    new Record() {RequestID = "1", PersonName = "User2"},
    new Record() {RequestID = "4", PersonName = "User3"},
    new Record() {RequestID = "5", PersonName = "User3"},
    new Record() {RequestID = "1", PersonName = "User4"},
    new Record() {RequestID = "6", PersonName = "User4"},
    new Record() {RequestID = "7", PersonName = "User5"},
    new Record() {RequestID = "1", PersonName = "User5"},

};
var dictionary1 = recordList.GroupBy(f => f.RequestID).ToDictionary(f => f.Key, v => v.Select(z=>z.PersonName).ToArray());
var dictionary2 = recordList.GroupBy(f => f.PersonName).ToDictionary(f => f.Key, v => v.Select(z => z.RequestID).ToArray());

var rec1 = dictionary2["User1"]; //all requestsIds for User1
var rec2 = dictionary2["User2"]; //all requestsIds for User2

var ids = rec1.Intersect(rec2).Distinct();  //only request ids exists for both users in same time

foreach (var id in ids)
{
    var users = dictionary1[id];
    if (users.Length > 2)
        break;
    //users = User1, User2, User4, User5
}

UPDATE 2

SQL version (MSSQL) this will work a WAY much faster then C#

CREATE TABLE #tmp (ID varchar(max), Name varchar(max))
INSERT INTO #tmp (ID, Name) 
SELECT '1', 'User1' UNION ALL
SELECT '2', 'User1' UNION ALL
SELECT '3', 'User2' UNION ALL 
SELECT '1', 'User2' UNION ALL 
SELECT '4', 'User3' UNION ALL 
SELECT '5', 'User3' UNION ALL 
SELECT '1', 'User4' UNION ALL 
SELECT '6', 'User4' 


SELECT C.Name 
FROM #tmp A
INNER JOIN #tmp B ON A.ID = B.ID
INNER JOIN #tmp C ON A.ID = C.ID
WHERE A.Name = 'User1' and B.Name = 'User2' AND C.Name NOT IN ('User1', 'User2')

Response will be "User4"

Artyom
  • 1,065
  • 8
  • 12
  • Could you post an example? – jason May 18 '17 at 14:00
  • And i feel like this entire algorithm is not optimal as well. Can you post some examples of what are you trying to achieve. "listofFirstPerson", "listofRelatedPeople" , "listofSecondPerson" are they all have different source of data? – Artyom May 18 '17 at 14:21
  • They have the same source. – jason May 18 '17 at 14:22
  • Are you just trying to find two or more users with same RequestID? – Artyom May 18 '17 at 14:30
  • I'm trying to do this: User1 has connection with User2 via UserX by RequestID Y. I'm trying to find UserX and RequestID Y. – jason May 18 '17 at 14:33
  • Maybe you can try to go this way, get "user1" from "dictionary2", get "user2" from "dictionary2", this will give you two list of "RequestID", for both users, find intersection. Then find user from "dictionary1" by any ID from that intersection. Is it what you was looking for? – Artyom May 18 '17 at 14:40
  • Can you give an example? Thanks. – jason May 22 '17 at 06:42
  • Can you post an example data you have? Like for example 3-5 records where you actually see connection between user1 and user2 via userX – Artyom May 22 '17 at 07:05
  • Thanks for examples, is C# side mandatory to do this? Coz it really looks like DB side it will be much faster and easier. – Artyom May 22 '17 at 07:38
  • No, it's not mandatory. If you can show me how to do it with SQL and how I can use it in C#, it would be great. – jason May 22 '17 at 07:39
  • Probably this C# option is not the best, if you really need C# side, we could think out even better solution. Give me couple of minutes i will post SQL version too. – Artyom May 22 '17 at 07:48
  • Your C# approach helped me to find middle person. But it works slowly when I increase complexity a little bit. I mean I'm looking for another person who has the same RequestID with both middle person(the one you helped me to find) and the second person. Do you have any idea about how this could be done? Thank you very much. – jason May 22 '17 at 14:52
  • Actually C# code return you multiple middle persons (if i understand what do you need, you have not only "user4" but also "user5"). Regarding the SQL way, you just need one more same inner join. – Artyom May 22 '17 at 14:58
  • How does C# give me multiple middle persons? Could you please give with an example? – jason May 22 '17 at 15:00
  • it is already there, "var users = dictionary1[id];" this "users" array will contain all users with same requestId. So basically "user1" and "user2", connected via "user4" and/or "user5" – Artyom May 22 '17 at 15:02
1

I think you should let the database do the work, it will be much faster.

The query would look something like this:

SELECT
    r2.*
FROM
    record r1 INNER JOIN
    record r2 ON
        r2.requestId = r1.requestId AND
        r2.personName = 'NAME 2'
WHERE
    r1.personName = 'NAME 1'

We request all the requestIds of person 1 and see if it matches any of those from person 2.

maraca
  • 8,468
  • 3
  • 23
  • 45
  • Thank you for your answer. But I have only one table. It's the first table you wrote, "record" table. I'm doing all of the work in that table, I'm finding relations from that table too. – jason May 21 '17 at 05:17
  • Why should there be unique IDs? Wouldn't distinct parameter be enough? And can the query be somehow parametrized? – jason May 21 '17 at 10:20
  • And If you need any more clarification, I can write. The question seems pretty clear to me :/ – jason May 21 '17 at 10:21
  • And finally why joining tables is more efficient? I have only one table. Should I join the table with itself? – jason May 21 '17 at 10:22
  • @jason 1. if the combination of personName and requestId is unique then it is sufficient to uniquely identify a record yes. 2. Yes I'm sure C# has also a mechanism like prepared statements in Java, parametrizing should be no problem. 3. Yes this is what the query does. It is more efficient because database are designed to answer queries fast. Similar to asking why a chainsaw cuts wood faster than a knife, the chainsaw is specifically designed to cut wood fast, while the knife is a multi-purpose tool which can be used to cut wood (but not as fast). – maraca May 21 '17 at 12:46
  • It would seem to be [a bit more complicated](http://stackoverflow.com/questions/44049570/working-on-a-list-takes-too-much-time/44090425#comment75198846_44049570), but *two* INNER JOINs should do the trick. (And yes, given appropriate indexes, trying to better the DB engine sounds misguided.) – greybeard May 21 '17 at 20:16
1

The grouping could be done in one pass. The advantage of this is not only that it is faster because it's a single pass, but if you are doing LINQ to DB then it will be executed on the server by the DB, reducing the amount of data sent to your client and speeding up the process by using indexes etc.

        var source = new List<Record> { };
        var grouped = source
            .GroupBy(x => x.RequestID)
            //Only groups with more than one entry
            .Where(x => x.Count() > 1);

        //Loop through the data like so
        foreach(var group in grouped)
        {
            Console.WriteLine("Request: " + group.Key);
            foreach(Record record in group)
                Console.WriteLine("  " + record.PersonName);
        }

If you want the PersonName property to be some kind of unique identifier so that you can eliminate cases where the same person exists more than once per RequestID you can do this

        var source = new List<Record> { };
        var grouped = source
            .GroupBy(x => x.RequestID)
            //Select a key + only unique names
            .Select(x => new { Key = x.Key, Data = x.Select(r => r.PersonName).Distinct()})
            //Only groups with more than 1 entry
            .Where(x => x.Data.Count() > 1);

        //To loop through the data
        foreach(var group in grouped)
        {
            Console.WriteLine("Group: " + group.Key);
            foreach(var item in group.Data)
            {
                Console.WriteLine("  " + item.PersonName);
            }
        }
Peter Morris
  • 20,174
  • 9
  • 81
  • 146
  • Thank you for the answer. After grouping, what should I do? How can I iterate on that grouped list? – jason May 21 '17 at 10:40
  • Updated my answer to give examples. I typed them straight into the browser, so let me know if they don't compile so I can correct the source. – Peter Morris May 21 '17 at 11:14
  • The grouping is not applicable because in my `listofRelatedPeople` list the records are already unique and all of their `RequestID` is the same. – jason May 22 '17 at 06:37
  • @jason Have you tried it? It will let you work from a single list and produce a data set that tells you every person that has each RequestID. – Peter Morris May 22 '17 at 09:53
0

ums maybe a little thing but, u create use recordlist to much. you have created new objects but didn't use it. so you didn't do anything with your where clause.

List<Record> listofRecordsforRelatedPerson = RecordList.Where(r => r. PersonName  

--> change recordlist to relatedPeople and in your two for loops use:

for(int j = 0; j <= listofSecondPerson.Count; j++)
{...} 

otherwise for example if your count is 25 anj = 25 it will do nothing lol

Mong Zhu
  • 23,309
  • 10
  • 44
  • 76
Casper
  • 1