0

I am new to this forum and hoping to get some help. I have a an HTML string having text and several base64 images. I need to loop through all image tags adding a slash / before the closing tag > so that each image ends with /> and return a new html string with the changes.

so each

<IMG src="...."> 

should then be

<IMG src="...."/>

I am not versed with html and I am wondering how to do it (using regex?). Here is some pseudo code:

   Function GetSourceImges(Sourcehtml As String) As List(Of String)
    Dim listOfImgs As New List(Of String)()
       'use regex to find image tags
       'Return list of base64 image tags
   End Function

    For each image in list
        insert a slash appropriately
    next

Reconstitute a new html string with edited images Thanks

Gbhskk
  • 1
  • 2
  • SO is not a forum, it is a Q&A site. It seems you have access to a DOM structure, what package are you using? It looks like VB.NET. Please add relevant tags to the question so that the right users could see this question. – Wiktor Stribiżew Jun 10 '18 at 20:34
  • Thanks Just subscribed and did not understand tags. As a newbie I am using VB,net and partly c#. so should the tags be VB.net and c#? – Gbhskk Jun 10 '18 at 20:48
  • I added VB.NET tag since you posted the code in VB.NET. However, what is the code you tried to modify the tags? The one you have only shows how you extract and set src attribute values, which seems irrelevant to the question. Please update, or the question will be closed as off-topic. – Wiktor Stribiżew Jun 10 '18 at 20:50
  • There are some amusing and some detailed answers regarding parsing HTML with regexes at [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/q/1732348/1115360). A more reliable way is mentioned in [How do I use HTML Agility Pack to edit an HTML snippet](https://stackoverflow.com/q/9520932/1115360). – Andrew Morton Jun 10 '18 at 20:57
  • OK just edited my question. I missed a section of it when copying from text editor. – Gbhskk Jun 10 '18 at 21:13

3 Answers3

0

Map all "IMG" tags using LINQ and use their indexes as an anchor to fix the missing "/" characters. please see my comments inside the code.

Sub Main()
    Dim htmlstring As String = "<IMG src=""....""> " & vbCrLf _
& "<img src=""...."">" & vbCrLf _
& "<p>blahblah</p>" & vbCrLf _
& "<IMG src=""...."">" & vbCrLf _
& "<p>blahblah</p>"

    ' find all indxes of img using regex and lambda exprations '
    Dim indexofIMG() As Integer = Regex.Matches(htmlstring, "IMG", RegexOptions.IgnoreCase) _
.Cast(Of Match)().Select(Function(x) x.Index).ToArray()

    ' check from each index of "IMG" if "/" is missing '
    For Each itm As Integer In indexofIMG
        Dim counter As Integer = itm
        While counter < htmlstring.Length - 1
            If htmlstring(counter) = ">" Then
                If htmlstring(counter - 1) <> "/" Then
                    ' fix the missing "/" using Insert() method '
                    htmlstring = htmlstring.Insert(counter, "/")
                End If
                Exit While
            End If
            counter += 1
        End While
    Next

    Console.WriteLine(htmlstring)
    Console.ReadLine()
End Sub
Jonathan Applebaum
  • 5,738
  • 4
  • 33
  • 52
  • Like I said, the htmlstring has other tags not only images. (

    blahblah

    will be replaced wrongly.). so there need to be some looping to identify image tags do the thing and reconstitute a new string. Any help?
    – Gbhskk Jun 11 '18 at 05:03
  • Pabdev [Here](https://stackoverflow.com/questions/39785600/iterate-through-an-html-string-to-find-all-img-tags-and-replace-the-src-attribut?rq=1) is closer to my requirement – Gbhskk Jun 11 '18 at 05:19
  • Reason for my question is that I'm using an Itextsharp CustomImageTagProcessor (and xmlworker)which for some unknown reason will display base64 image with " />" but not ">" thanks – Gbhskk Jun 11 '18 at 17:28
  • @Gbhskk welcome to. Stackoverflow, If you found that answer helpeful please mark it as an answer by clicking the checkmark icon. – Jonathan Applebaum Jun 11 '18 at 17:46
  • Surprisingly only the first Image tag is modified. the logic seems very correct. – Gbhskk Jun 12 '18 at 01:55
  • Maybe i forgot to ignore case sensitive (img is not IMG..) i have made a little change in my answer and changed to this: `Regex.Matches(htmlstring, "IMG", RegexOptions.IgnoreCase)` is that was the problem? – Jonathan Applebaum Jun 12 '18 at 03:17
0

Surprisingly it works with the console app but doesn't when I view it on a richtextbox as in btnEditHTML method below. The generated pdf has only one red dot and not two. Can't say why. I must say you have been very helpfull.

'SetTable and customimagetagprocessor borrowed from [here] iTextsharp base64 embedded image in header not parsing/showing

Imports System.IO
Imports iTextSharp.text
Imports iTextSharp.tool.xml
Imports iTextSharp.text.pdf
Imports iTextSharp.tool.xml.parser
Imports iTextSharp.tool.xml.pipeline.css
Imports iTextSharp.tool.xml.pipeline.html
Imports iTextSharp.tool.xml.pipeline.end
Imports iTextSharp.tool.xml.html
Imports System.Text.RegularExpressions

Public Class Form1

    Dim dsktop As String = My.Computer.FileSystem.SpecialDirectories.Desktop
    Public Function GetFormattedHTML(str As String) As String
        'format images by changing > to />
        ' find all indxes of img using regex and lambda exprations '
        Dim indexofIMG() As Integer = Regex.Matches(str.ToString, "IMG", RegexOptions.IgnoreCase) _
        .Cast(Of Match)().Select(Function(x) x.Index).ToArray()

        ' check from each index of "IMG" if "/" is missing '
        For Each itm As Integer In indexofIMG
            Dim counter As Integer = itm
            While counter < str.ToString.Length - 1
                If str(counter) = ">" Then
                    If str(counter - 1) <> "/" Then
                        ' fix the missing "/" using Insert() method '
                        str = str.ToString.Insert(counter, " /")
                    End If
                    Exit While
                End If
                counter += 1
            End While
        Next
        Return str.ToString
    End Function
    Private Sub btnEditHTML_Click(sender As Object, e As EventArgs) Handles btnEditHTML.Click
        Rtb.Text = String.Empty
        'the 2 base64 images in the html below are actually just small red dots
        Dim RawHTML As String = "<P>John Doe</P><IMG " &
        "src="""">&nbsp;Jackson5<IMG " &
        "src="""">"
        Rtb.Text = GetFormattedHTML(RawHTML)
        'notice that the 2nd base64 string is not edited as required. 
    End Sub

    Private Sub btnGenerate_Click(sender As Object, e As EventArgs) Handles btnGenerate.Click
        'here I create a 2 column itextsharp table to parse my html into the cells

        Dim doc As New iTextSharp.text.Document(iTextSharp.text.PageSize.A4, 25, 25, 25, 30)
        Dim wri As PdfWriter = PdfWriter.GetInstance(doc, New System.IO.FileStream(dsktop & "\testtable.pdf", System.IO.FileMode.Create))
        doc.Open()
        'set table columnwidths -------------------------------------------------------------
        Dim MainTable As New PdfPTable(2) '2 column table
        MainTable.WidthPercentage = 100
        Dim Wth(1) As Single
        Dim u As Integer = 2
        For i As Integer = 0 To 1
            Wth(i) = CInt(Math.Floor(2 * 500 / u))
        Next
        MainTable.SetWidths(Wth)

        Dim htmlstr As String = GetFormattedHTML("<P>John Doe</P><IMG " &
        "src="""">&nbsp;Jackson5<IMG " &
        "src="""">")

        Dim Elmts = New ElementList()
        Elmts = XMLWorkerHelper.ParseToElementList(htmlstr, Nothing)
        Dim MinorTable As New PdfPTable(1)
        MinorTable = SetTable(Elmts, htmlstr)

        For i = 1 To 2
            Dim Cell As New PdfPCell
            Cell.AddElement(MinorTable)
            MainTable.AddCell(Cell)
        Next
        doc.Add(MainTable)
        doc.Close()

        Process.Start(dsktop & "\testtable.pdf")

    End Sub
    Public Function SetTable(ByVal elements As ElementList, ByVal htmlcode As String) As PdfPTable

        Dim tagProcessors As DefaultTagProcessorFactory = CType(Tags.GetHtmlTagProcessorFactory(), DefaultTagProcessorFactory)
        tagProcessors.RemoveProcessor(HTML.Tag.IMG) ' remove the default processor
        tagProcessors.AddProcessor(HTML.Tag.IMG, New CustomImageTagProcessor()) ' use our new processor

        Dim cssResolver As ICSSResolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(True)
        cssResolver.AddCssFile(Application.StartupPath & "\pdf.css", True)
        'see sample css file at https://learnwebcode.com/how-to-create-your-first-css-file/

        'Setup Fonts
        Dim xmlFontProvider As XMLWorkerFontProvider = New XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS)
        xmlFontProvider.RegisterDirectory(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "assets/fonts/"))

        Dim cssAppliers As CssAppliers = New CssAppliersImpl(xmlFontProvider)

        Dim htmlContext As HtmlPipelineContext = New HtmlPipelineContext(cssAppliers)
        htmlContext.SetAcceptUnknown(True)
        htmlContext.SetTagFactory(tagProcessors)

        Dim pdf As ElementHandlerPipeline = New ElementHandlerPipeline(elements, Nothing)
        Dim htmlp As HtmlPipeline = New HtmlPipeline(htmlContext, pdf)
        Dim css As CssResolverPipeline = New CssResolverPipeline(cssResolver, htmlp)

        Dim worker As XMLWorker = New XMLWorker(css, True)
        Dim p As XMLParser = New XMLParser(worker)

        'Dim holderTable As New PdfPTable({1})
        Dim holderTable As PdfPTable = New PdfPTable({1})
        holderTable.WidthPercentage = 100
        holderTable.HorizontalAlignment = Element.ALIGN_LEFT

        Dim holderCell As New PdfPCell()
        holderCell.Padding = 0
        holderCell.UseBorderPadding = False
        holderCell.Border = 0

        p.Parse(New MemoryStream(System.Text.Encoding.ASCII.GetBytes(htmlcode)))

        For Each el As IElement In elements
            holderCell.AddElement(el)
        Next
        holderTable.AddCell(holderCell)
        'Dim holderRow As New PdfPRow({holderCell})
        'holderTable.Rows.Add(holderRow)
        Return holderTable

    End Function

End Class

Public Class CustomImageTagProcessor
    Inherits iTextSharp.tool.xml.html.Image
    Public Overrides Function [End](ctx As IWorkerContext, tag As Tag, currentContent As IList(Of IElement)) As IList(Of IElement)
        Dim attributes As IDictionary(Of String, String) = tag.Attributes
        Dim src As String = String.Empty
        If Not attributes.TryGetValue(iTextSharp.tool.xml.html.HTML.Attribute.SRC, src) Then
            Return New List(Of IElement)(1)
        End If

        If String.IsNullOrEmpty(src) Then
            Return New List(Of IElement)(1)
        End If

        If src.StartsWith("data:image/", StringComparison.InvariantCultureIgnoreCase) Then
            ' data:[<MIME-type>][;charset=<encoding>][;base64],<data>
            Dim base64Data As String = src.Substring(src.IndexOf(",") + 1)
            Dim imagedata As Byte() = Convert.FromBase64String(base64Data)
            Dim image As iTextSharp.text.Image = iTextSharp.text.Image.GetInstance(imagedata)

            Dim list As List(Of IElement) = New List(Of IElement)()
            Dim htmlPipelineContext As pipeline.html.HtmlPipelineContext = GetHtmlPipelineContext(ctx)
            list.Add(GetCssAppliers().Apply(New Chunk(DirectCast(GetCssAppliers().Apply(image, tag, htmlPipelineContext), iTextSharp.text.Image), 0, 0, True), tag, htmlPipelineContext))
            Return list
        Else
            If File.Exists(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, src)) Then
                Dim imagedata As Byte() = File.ReadAllBytes(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, src))
                Dim image As iTextSharp.text.Image = iTextSharp.text.Image.GetInstance(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, src))

                Dim list As List(Of IElement) = New List(Of IElement)()
                Dim htmlPipelineContext As pipeline.html.HtmlPipelineContext = GetHtmlPipelineContext(ctx)
                list.Add(GetCssAppliers().Apply(New Chunk(DirectCast(GetCssAppliers().Apply(image, tag, htmlPipelineContext), iTextSharp.text.Image), 0, 0, True), tag, htmlPipelineContext))
                Return list
            End If
            Return MyBase.[End](ctx, tag, currentContent)
        End If
    End Function
End Class
Gbhskk
  • 1
  • 2
0

I highly recommend just using AngleSharp to parse the HTML, edit the document if required, and save it again.

There are many posts on here about why trying to parse HTML with regular expressions is a bad idea.

var doc = new HtmlParser().Parse(html);

As you aren't actually changing the HTML content, just fixing up the tags, your should be able to just parse it and save it with no changes to fix the tags.

George Helyar
  • 4,319
  • 1
  • 22
  • 20