0

I'm trying to find the location of a substring of text within a much larger string that contains question mark wildcard characters. The large string is the results of imprecise OCR software, and it contains wildcards because it could tell there was a character there, but couldn't identify which one.

Here's an oversimplified example of what I'd like to accomplish.

    Dim resultIndex As Integer = -1
    Dim LargeOcrText As String = "fAsD ?GjSDFpG HjDYA?C JLgD FHaYsV MKiI?oL XgXj?GN sHVKgG?"
    Dim searchText As String = "ABC"
    If searchText Like LargeOcrText Then resultIndex = LargeOcrText.IndexOf(searchText)

This should return a resultIndex = 18, but it doesn't work, even if I use searchText = "*ABC*" instead. I'm almost certain there's some way I can use regular expressions to do the Like comparison, but I'm not very practiced with them, and even then I'm at a complete loss for how to get the index of the substring.

Edit: To be clear, I'm aware that neither Like nor IndexOf support what I'm trying to do. That's exactly my problem. I'm searching for some other way to code it that does work.

  • 2
    IndexOf does not support wildcards. You'll have to use some other matching tool. – Raymond Chen Oct 12 '21 at 17:37
  • @RaymondChen I'm aware, and I would love to use _some other matching tool_, I just haven't found one yet. I've edited my initial post to clarify. – Odin Sonnah Oct 12 '21 at 20:44
  • You don't have *wildcards* - wildcards belong to the text being searched for, and if you had that, you would [convert](https://stackoverflow.com/a/30300521/11683) your Like pattern to regex and examined the [`Index` property](https://learn.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.capture.index?view=net-5.0#System_Text_RegularExpressions_Capture_Index). – GSerg Oct 12 '21 at 20:52

2 Answers2

0

In your search pattern, replace every letter with [<that letter>?] and feed it to Regex:

Dim resultIndex As Integer = -1
Dim LargeOcrText As String = "fAsD ?GjSDFpG HjDYA?C JLgD FHaYsV MKiI?oL XgXj?GN sHVKgG?"
Dim searchText As String = "[A?][B?][C?]"

With Regex.Match(LargeOcrText, searchText)
    If .Success Then resultIndex = .Index
End With
GSerg
  • 76,472
  • 17
  • 159
  • 346
0

In addition to GSerg's answer, it is possible to automatically generate the pattern [A?][B?][C?] from ABC.

Here is a working code sample.

Imports System.Linq
Imports System.Text.RegularExpressions

Module Module1

    Sub Main()
        Dim largeOcrText As String = "fAsD ?GjSDFpG HjDYA?C JLgD FHaYsV MKiI?oL XgXj?GN sHVKgG?"
        Dim searchText As String = "ABC"
        Dim index As Integer = GetOcrIndex(largeOcrText, searchText)
        Debug.WriteLine($"Index = {index}")
    End Sub

    Private Function GetOcrIndex(haystack As String, needle As String) As Integer
        Dim pattern As String = BuildPattern(needle)
        Debug.WriteLine($"Pattern = {pattern}")
        Dim match As Match = Regex.Match(haystack, pattern, RegexOptions.IgnoreCase)
        Return If(match.Success, match.Index, -1)
    End Function

    Private Function BuildPattern(needle As String) As String
        Return String.Concat(needle.SelectMany(AddressOf AddWildcard))
    End Function

    Private Function AddWildcard(c As Char) As String
        Return $"[{Regex.Escape(c)}?]"
    End Function

End Module

Output:

Pattern = [A?][B?][C?]
Index = 18
Ruud Helderman
  • 10,563
  • 1
  • 26
  • 45