0

I'm trying to extract a portion of html between 2 comments.

here is the test code:

Sub Main()

    Dim base_dir As String = "D:\"
    Dim test_file As String = base_dir & "72.htm"

    Dim start_comment As String = "<!-- start of content -->"
    Dim end_comment As String = "<!-- end of content -->"

    Dim regex_pattern As String = start_comment & ".*" & end_comment
    Dim input_text As String = start_comment & "some more html text" & end_comment 

    Dim match As Match = Regex.Match(input_text, regex_pattern)


    If match.Success Then
        Console.WriteLine("found {0}", match.Value)
    Else
        Console.WriteLine("not found")
    End If

    Console.ReadLine()

End Sub

The above works.

When I try to load actual data from disk the below code fails.

Sub Main()

    Dim base_dir As String = "D:\"
    Dim test_file As String = base_dir & "72.htm"

    Dim start_comment As String = "<!-- start of content -->"
    Dim end_comment As String = "<!-- end of content -->"

    Dim regex_pattern As String = start_comment & ".*" & end_comment
    Dim input_text As String = System.IO.File.ReadAllText(test_file).Replace(vbCrLf, "") 

    Dim match As Match = Regex.Match(input_text, regex_pattern)


    If match.Success Then
        Console.WriteLine("found {0}", match.Value)
    Else
        Console.WriteLine("not found")
    End If

    Console.ReadLine()

End Sub

The HTML file contains the start and end comments and a good amount of HTML in-between. Some content in the HTML file is in Arabic.

With thanks and regards.

Code Maverick
  • 20,171
  • 12
  • 62
  • 114
MoizNgp
  • 283
  • 1
  • 4
  • 17

2 Answers2

2

Try passing in RegexOptions.Singleline into Regex.Match(...) like this:

Dim match As Match = Regex.Match(input_text, regex_pattern, RegexOptions.Singleline)

This will make the Dot's . match newlines.

Robbie
  • 18,750
  • 4
  • 41
  • 45
0

I don't know vb.net, but does . match newlines or is there an option you have to set for that? Consider using [\s\S] instead of . to include newlines.

Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592