1

The aim of my application is to extract text from documents and search for specific entries matching records in a database.

  1. My application extracts text from documents and populates a textbox with the extracted text.
  2. Each document can have anywhere from 200 to 600,000 words (including a large amount of normal plain text).
  3. Extracted text is compared against database entries for specific values and matches are pushed into an array.
  4. My Database contains approximately 125,000 records

My code below loops through the database records, comparing against the extracted text. If a match is found in the text it is inserted into an array which I use later.

txtBoxExtraction.Text = "A whole load of text goes in here, " & _
       "including the database entries I am trying to match," & _
       "i.e. AX55F8000AFXZ and PP-Q4681TX/AA up to 600,000 words"

Dim dv As New DataView(_DBASE_ConnectionDataSet.Tables(0))
dv.Sort = "UNIQUEID"

'There are 125,000 entries here in my sorted DataView dv e.g.
'AX40EH5300
'GB46ES6500
'PP-Q4681TX/AA

For i = 0 to maxFileCount

    Dim path As String = Filename(i)

    Try
    If File.Exists(path) Then
        Try
           Using sr As New StreamReader(path)
               txtBoxExtraction.Text = sr.ReadToEnd()
           End using
        Catch e As Exception
           Console.WriteLine("The process failed: {0}", e.ToString())
        End Try
    end if

    For dvRow As Integer = 0 To dv.Table.Rows.Count - 1
        strUniqueID = dv.Table.Rows(dvRow)("UNIQUEID").ToString()
        If txtBoxExtraction.Text.ToLower().Contains(strUniqueID.ToLower) Then
            ' Add UniqueID to array and do some other stuff..
        End if
    next dvRow

next i

Whilst the code works, I am looking for a faster way of performing the database matching (the 'For dvRow' Loop).

If a document is small with around 200 words, the 'For dvRow..' Loop completes quickly, within a few seconds.

Where the document contains a large amount of text... 600,000 words and upwards, it can take several hours or longer to complete.

I came across a couple of posts that are similar, but not close enough to my issue to implement any of the recommendations.

High performance "contains" search in list of strings in C# https://softwareengineering.stackexchange.com/questions/118759/how-to-quickly-search-through-a-very-large-list-of-strings-records-on-a-databa

Any help is most gratefully appreciated.

Community
  • 1
  • 1
GoodJuJu
  • 1,296
  • 2
  • 16
  • 37
  • Do all the IDs you're looking for have a similar structure - eg within a certain length, all caps with numbers, hyphens and / etc? Rather than pulling back all the stuff from sql and doing a `Contains`, you could instead use a regular expression to pull out anything from the text that _might_ be an ID, then search SQL (which will be fast, that's what it's good at, assuming it's indexed correctly) for a match for each one. – James Thorpe Sep 06 '16 at 13:23
  • Hi James, thanks for your quick reply. If i replace all the alpha characters in my IDs with 'A' and all my numeric characters with '1' I end up with 139 unique possibilities. I had thought of regular expressions myself, but I think there are too many possibilities. Do you think the regular expression route would still be faster with the number of possibilities I have? The database grows frequently and I would need to check and create the expressions each time the application is opened. – GoodJuJu Sep 06 '16 at 13:43
  • If that's the actual code, I don't understand why you need to put the information in a textbox. You could save a bit of speed by not displaying the text on screen. If you have 125000 UNIQUEIDs, then it might be better to pull the id from your file and then search from that list. Instead of searching the whole text every time. Even just splitting your text by space and filtering by the "words" that are between a specific size could make it go faster. – the_lotus Sep 06 '16 at 13:59
  • Thanks @the_lotus , you are right, I don't need to put the text into a textbox, however that part of the process is very quick - less than 10 seconds. – GoodJuJu Sep 06 '16 at 14:32
  • How would I pull the ID from my file, without first searching it? Do you mean pull out text that matches a regex first? There would be 139 potential regex patterns. I could, like you say pull out all text that contains a hyphen and is longer than X chars. – GoodJuJu Sep 06 '16 at 14:37
  • I'd extract the Ids to a list(of string) and use ToLower there so it need not be done in the loop. Skip the textbox too. `Add UniqueID to array and do some other stuff..` may not be innocuous either depending on what that means – Ňɏssa Pøngjǣrdenlarp Sep 06 '16 at 14:40
  • If your database is a professional level product, then the fastest way is probably to add a full text index to the UNIQUEID column, stream/bulk insert the text file into a temporary table and then search for it using a SQL `CONTAINS` type function. Depends on the DBMS though. – RBarryYoung Sep 06 '16 at 15:46
  • Hi @RBarryYoung, thanks for your reply. Unfortunately the data is just pulled from an SQL query into a DataGridView - I don't really have a database engine available. I was thinking about integrating SQLLite into the application and using full text index as you mentioned, but SQLLite would be new ground for me and whilst I think I may do this in the future, it isn't practical for me to achieve right now. Thanks again! – GoodJuJu Sep 07 '16 at 09:20

1 Answers1

1

This is an example of the comment a wrote.

If that's the actual code, I don't understand why you need to put the information in a textbox. You could save a bit of speed by not displaying the text on screen. If you have 125000 UNIQUEIDs, then it might be better to pull the id from your file and then search from that list. Instead of searching the whole text every time. Even just splitting your text by space and filtering by the "words" that are between a specific size could make it go faster.

Since it seems you want to do a word check and not a per-character check. And that you only want to check for those ids and not each word. You should pull up the ids from each text before doing any search. This will reduce the searching that need to be done by a lot. This list of id could also be saved if the text never changes.

Module Module1

    Private UNIQUEID_MIN_SIZE As Integer = 8
    Private UNIQUEID_MAX_SIZE As Integer = 12

    Sub Main()

        Dim text As String
        Dim startTime As DateTime
        Dim uniqueIds As List(Of String)

        text = GetText()
        uniqueIds = GetUniqueIds()

        '--- Very slow

        startTime = DateTime.Now

        ' Search
        For Each uniqueId As String In uniqueIds
            text.Contains(uniqueId)
        Next

        Console.WriteLine("Took {0}s", DateTime.Now.Subtract(startTime).TotalSeconds)

        '--- Very fast

        startTime = DateTime.Now

        ' Split the text by words
        Dim words As List(Of String) = text.Split(" ").ToList()

        ' Get all the unique key, assuming keys are between a specific size
        Dim uniqueIdInText As New Dictionary(Of String, String)

        For Each word As String In words
            If word.Length < UNIQUEID_MIN_SIZE Or word.Length > UNIQUEID_MAX_SIZE Then
                If Not uniqueIdInText.ContainsKey(word) Then
                    uniqueIdInText.Add(word, "")
                End If
            End If
        Next

        ' Search
        For Each uniqueId As String In uniqueIds
            uniqueIdInText.ContainsKey(uniqueId)
        Next

        Console.WriteLine("Took {0}s", DateTime.Now.Subtract(startTime).TotalSeconds)

        Console.ReadLine()

    End Sub

    ' This only randomly generate words for testing
    ' You can ignore
    Function GetRandomWord(ByVal len As Integer) As String

        Dim builder As New System.Text.StringBuilder
        Dim alphabet As String = "abcdefghijklmnopqrstuvwxyz"
        Dim rnd As New Random()

        For i As Integer = 0 To len - 1
            builder.Append(alphabet.Substring(rnd.Next(0, alphabet.Length - 1), 1))
        Next

        Return builder.ToString()
    End Function

    Function GetText() As String

        Dim builder As New System.Text.StringBuilder
        Dim rnd As New Random()

        For i As Integer = 0 To 600000
            builder.Append(GetRandomWord(rnd.Next(1, 15)))
            builder.Append(" ")
        Next

        Return builder.ToString()
    End Function

    Function GetUniqueIds() As List(Of String)

        Dim wordCount As Integer = 600000
        Dim ids As New List(Of String)
        Dim rnd As New Random()

        For i As Integer = 0 To 125000
            ids.Add(GetRandomWord(rnd.Next(UNIQUEID_MIN_SIZE, UNIQUEID_MAX_SIZE)))
        Next

        Return ids
    End Function

End Module
the_lotus
  • 12,668
  • 3
  • 36
  • 53
  • Wow, @the_lotus what a difference!! Now that's what I call speeding things up!! I was using a 'split' function myself further down in my code, after I had already carried out my original 'contains' search. I thought (incorrectly as it now turns out) that the split would have a negative performance impact. I have now moved it up to the top and it works brilliantly. Thanks also for taking the time to provide a great example in code. As well as helping me, I am sure it will benefit many others who stumble across this question. Thanks also to James Thorpe for his valuable input. – GoodJuJu Sep 07 '16 at 09:28
  • @GoodJuJu the split will have an impact if you only have a few uniqueid in your database. But you have a lot to check. Glad I could help! – the_lotus Sep 07 '16 at 11:14