0

i have 2 datatables each datatable has 300,000 rows (datatables imported from 2 excel worksheets with OLEDB).

the first datatable is 'dtTosearch' and the second datatable is 'sourceDt'.

here is an example for the 2 tables:

datatables example

i need to find for each row in 'untagged' column (sourceDt) a match to each row in 'token' column (dtTosearch). the match conditions are:

  1. 'site' value = 'site' value
  2. 'cat' value = 'category' value
  3. untagged value must contain token value
  4. if all the conditions above exist for more then one match so the query must returen the match of the token with the maximum length. (and this is the tricky part that i didnt figure out how to perform with linq)
  5. this task must run in minimum process time - because its the requirement, because that's more professional and because i made a bet with friend-colleague-JAVA enthusiastic developer that .NET will run faster (plain as day (-:)

ive added the relevant parts of the code, it works fine but not in the way i want, i want to improve processing time, look at the foreach loop in linqQuery() function - i will be thankful if you help me to replace that loop by expanding my query to condition number 4, the loop operate for condition 4 because the result of the linq query ordered by 'token' length in descending order so it will exit and return the result with the maximum row length.

Private Sub startScanning()
    Dim siteNum As Double
    Dim categoryNum As Double
    Dim stringToSearchin As String

    For i = 0 To sourceDt.Rows.Count - 1
        siteNum = sourceDt.Rows(i).Item(0)
        categoryNum = sourceDt.Rows(i).Item(1)
        stringToSearchin = sourceDt.Rows(i).Item(3)
       Debug.WriteLine( linqQuery(siteNum, categoryNum, stringToSearchin) & " " &
           stringToSearchin)
    Next
End Sub

Private Function linqQuery(ByVal sitenum As Double, ByVal cat As Double,
                           ByVal stringToSearchIn As String) As String

    Dim query = From row In dtTosearch
                Where (row.Field(Of Double?)("site") = sitenum And
                row.Field(Of Double?)("category") = cat)
                Order By row.Field(Of String)("token").Length Descending
                Select New With {
                    .token = row.Field(Of String)("token")
         }


    For Each x In query
        If stringToSearchIn.Contains(x.token) Then
            Return x.token
            Exit Function
        End If
    Next

    Return ""
End Function

Thank you very much for your time and consideration!

Olivier Jacot-Descombes
  • 104,806
  • 13
  • 138
  • 188
Jonathan Applebaum
  • 5,738
  • 4
  • 33
  • 52
  • hi @gnicolas i tagged also C# because its dosent matter to me if an example will be C# or VB – Jonathan Applebaum Mar 27 '16 at 18:23
  • If pseudocode is ok: I think you need a group join here. You want to join every record from table1 with many records from table2. `var joinedTables = from r1 in table1 join r2 in table2 on new {r1.site, r1.cat , r1.untagged.Contains(r2.token)} equals new {r2.site, r2.category, true} into possibleMatches select new {row1 = r1, row2 = possibleMatches.MaxBy(p => p.token.Length)}` Then you have a flat ienumerable of your joined records. Sorry about the formatting I'm on mobile. – gnicholas Mar 27 '16 at 18:38
  • 1
    A LINQ-join will be much faster than a For loop. The For loop variant results in comparing each row of one table to each row of the other one O(m*n). A join on the other hand uses a hash table internally reducing the costs to approximately O(n). – Olivier Jacot-Descombes Mar 27 '16 at 18:43
  • tnx @gnicolas i understand the concept, the problem with linq to me is that i always struggling with the syntax, i will try to implement what you are suggesting. – Jonathan Applebaum Mar 27 '16 at 18:44

1 Answers1

5

In C# code you can find the records you want by

from r1 in sourceDt.AsEnumerable()
join r2 in dtToSearch.AsEnumerable() 
    on new { p1 = r1.Field<int>("cat"), p2 = r1.Field<int>("Site") }
    equals new { p1 = r2.Field<int>("category"), p2 = r2.Field<int>("site") }
    into r2g
select new
{
    r1,
    p2 = (from r2 in r2g
          let token = r2.Field<string>("token")
          where token.Contains(r1.Field<string>("untagged"))
          orderby token.Length descending, token
          select r2).FirstOrDefault()
}

That is, first group join the datatables on the fields they must have in common. The join - into syntax makes this a group join as explained here.

Then, within a group of matching rows, the rows containing the value of untagged in their tokens are filtered, and the one with the longest token is returned (also alphabetically the first one, in case there are ties on length).

Community
  • 1
  • 1
Gert Arnold
  • 105,341
  • 31
  • 202
  • 291
  • thank you very much @Gert, for the code, the explanation and the the link to your answer about groupjoin, exactly what i was looking for. – Jonathan Applebaum Mar 28 '16 at 16:26
  • This is the right answer but one small note: doing an orderby and then firstordefault is sorting an entire list to get a maximum (which probably doesn't matter for performance here but is worth noting). The "proper" solution is a generic extension method MaxBy like that found in moreLINQ. – gnicholas Mar 28 '16 at 17:56
  • @gnicholas I think that most of the times only a small list is sorted for each row, i.e. the list of joined items (4 items in the example). But as this is a speed challenge, tho OP may want to give it a try. – Gert Arnold Mar 28 '16 at 18:14