The aim of my application is to extract text from documents and search for specific entries matching records in a database.
- My application extracts text from documents and populates a textbox with the extracted text.
- Each document can have anywhere from 200 to 600,000 words (including a large amount of normal plain text).
- Extracted text is compared against database entries for specific values and matches are pushed into an array.
- My Database contains approximately 125,000 records
My code below loops through the database records, comparing against the extracted text. If a match is found in the text it is inserted into an array which I use later.
txtBoxExtraction.Text = "A whole load of text goes in here, " & _
"including the database entries I am trying to match," & _
"i.e. AX55F8000AFXZ and PP-Q4681TX/AA up to 600,000 words"
Dim dv As New DataView(_DBASE_ConnectionDataSet.Tables(0))
dv.Sort = "UNIQUEID"
'There are 125,000 entries here in my sorted DataView dv e.g.
'AX40EH5300
'GB46ES6500
'PP-Q4681TX/AA
For i = 0 to maxFileCount
Dim path As String = Filename(i)
Try
If File.Exists(path) Then
Try
Using sr As New StreamReader(path)
txtBoxExtraction.Text = sr.ReadToEnd()
End using
Catch e As Exception
Console.WriteLine("The process failed: {0}", e.ToString())
End Try
end if
For dvRow As Integer = 0 To dv.Table.Rows.Count - 1
strUniqueID = dv.Table.Rows(dvRow)("UNIQUEID").ToString()
If txtBoxExtraction.Text.ToLower().Contains(strUniqueID.ToLower) Then
' Add UniqueID to array and do some other stuff..
End if
next dvRow
next i
Whilst the code works, I am looking for a faster way of performing the database matching (the 'For dvRow' Loop).
If a document is small with around 200 words, the 'For dvRow..' Loop completes quickly, within a few seconds.
Where the document contains a large amount of text... 600,000 words and upwards, it can take several hours or longer to complete.
I came across a couple of posts that are similar, but not close enough to my issue to implement any of the recommendations.
High performance "contains" search in list of strings in C# https://softwareengineering.stackexchange.com/questions/118759/how-to-quickly-search-through-a-very-large-list-of-strings-records-on-a-databa
Any help is most gratefully appreciated.