2

I'm working on a VBA routine to import stuff from Excel into Sparx Enterprise Architect.

One of the challenges is to get the formatted text (bold, italic and underline) translated. EA uses some kind of html like formatting tags for it's text. So this:

this text has bold and italic

has to be translated to this:

this text has <b>bold</b> and <i>italic</i>

If found this routine on in another question that I slightly modified to fit my need. It does exactly what I need, but it is excruciating slow

'-------------------------------------------------------------
' Author:   Geert Bellekens (copied from stackoverflow: https://stackoverflow.com/questions/29916992/extract-text-content-from-cell-with-bold-italic-etc)
' Date:     02/09/2019
' Description: Returns a html formatted string for the (formatted) text in a cell
'-------------------------------------------------------------
Public Function getHTMLFormattedString(r As range) As String
 Dim startTimeStamp As Double
 startTimeStamp = Timer
 Dim isBold As Boolean
 Dim isItalic As Boolean
 Dim isUnderlined As Boolean
 isBold = False
 isItalic = False
 isUnderlined = False
 Dim text As String
 text = ""
 Dim cCount As Integer
 cCount = 0
 Dim modifiers As New Collection
 On Error Resume Next
 cCount = r.Characters.Count
 On Error GoTo 0
 If cCount > 0 Then
    For i = 1 To cCount
        Set c = r.Characters(i, 1)
        If isBold And Not c.Font.Bold Then
            isBold = False
            text = removeModifier("b", text, modifiers)
        End If
        If isItalic And Not c.Font.Italic Then
            isItalic = False
            text = removeModifier("i", text, modifiers)
        End If
        If isUnderlined And c.Font.Underline = xlUnderlineStyleNone Then
            isUnderlined = False
             text = removeModifier("u", text, modifiers)
        End If
        If c.Font.Bold And Not isBold Then
            isBold = True
            text = addModifier("b", text, modifiers)
        End If
        If c.Font.Italic And Not isItalic Then
            isItalic = True
            text = addModifier("i", text, modifiers)
        End If
        If Not (c.Font.Underline = xlUnderlineStyleNone) And Not isUnderlined Then
            isUnderlined = True
            text = addModifier("u", text, modifiers)
        End If
        text = text & c.text
        If i = cCount Then
            text = closeAllModifiers(text, modifiers)
        End If
    Next i
 Else
    text = r.text
    If r.Font.Bold Then
        text = "<b>" & text & "</b>"
    End If
    If r.Font.Italic Then
        text = "<i>" & text & "</i>"
    End If
    If Not (r.Font.Underline = xlUnderlineStyleNone) Then
        text = "<u>" & text & "</u>"
    End If
 End If
 'replace newline with CRLF
 text = Replace(text, Chr(10), vbNewLine)
 'return
 getHTMLFormattedString = text
 'get processingtime
 MsgBox "processed " & Len(text) & " characters in " & Format(Timer - startTimeStamp, "00.00") & " seconds"
End Function

I tested this code with a lorem ipsum string of 1000 characters, without any formatting, and that processes in 4.89 seconds.

Question: What can I do to improve the performance?

  • Is there a better way to loop all characters?
  • Can I somehow detect if a cell has no formatting at all (and thus skip the whole routine)?
Geert Bellekens
  • 12,788
  • 2
  • 23
  • 50
  • Sorry but this code is idiotic. You are testing every character of every string for bold, italic and underline. You should first split each string into entirely bold, entirely not bold, mixed bold and not bold. Only for mixed bold and not bold do you need to examine every character. Ditto for italic and underline. Even with better code, examining every character of a string is slow but, if you only do so when necessary, the total runtime can be minimised. – Tony Dallimore Oct 09 '19 at 08:48
  • @TonyDallimore No sure how I should do that. How do I detect if the string is mixed bold/not bold? If I'm not mistaken the properties Font.Bold and Font.Italic are booleans, so for any range of characters it would return either True (I guess if the whole range is bold) or False (if not the whole range is Bold?) but never "mixed" – Geert Bellekens Oct 09 '19 at 08:53
  • 1
    `Range.Characters.Font.Bold` is `Null` when some of the characters are bold. If you right-click `.Bold` and click on Definition, you can see that it is Variant instead of Boolean – Slai Oct 09 '19 at 08:56
  • You are wrong about Booleans, those are Variant with Null when mixed. This way you can detect if any of the 3 formatting types were applied. Also there is no short circuiting in VBA - you can split those if's testing booleans first. – BrakNicku Oct 09 '19 at 08:56
  • @Slai @ BrakNicku That's good news. That will at least allow me to skip the whole thing if there is no formatting at all. And then it could still be optimized by splitting the string in parts, each time checking the formatting of each part. Thanks – Geert Bellekens Oct 09 '19 at 09:00
  • You can also test all cells on the sheet or used range at once with `Range.Font` – Slai Oct 09 '19 at 09:07
  • I have a VBA routine that performs this conversion but it has some weaknesses for you, (1) it handles all Excel formatting; font, font face, font size, interior and so on. I assume this is overload for you. (2) It handles overlapping formatting which again may be overload. If not, I doubt this routine would handle overlaps.(3) I wrote it years ago before I knew how to use a StringBuilder class within VBA. – Tony Dallimore Oct 09 '19 at 09:28
  • Have you had a look at [Range.Value(ValueType)](https://fastexcel.wordpress.com/2017/03/13/excel-range-valuevaluetype-what-is-this-parameter/). You could use `xlRangeValueXMLSpreadsheet` value type (which gives you the XML for your cell). You could potentially then extract what you need from the XML – Zac Oct 09 '19 at 09:39

3 Answers3

3

As I suggested in my comments, I used Range.Value(Value.Type) to extract XML and then created a parser to extract the text in HTML format

Sub ConvertCellTextToHTML()

    Dim oWS As Worksheet: Set oWS = ThisWorkbook.Worksheets("Sheet3")                   '< Change worksheet name
    Dim iLastRow As Long: iLastRow = oWS.Range("A" & oWS.Rows.Count).End(xlUp).Row
    Dim oRng As Range: Set oRng = oWS.Range("A1:A" & iLastRow)                          '< Change range as required
    Dim oCell As Range
    Dim oXml As MSXML2.DOMDocument                                                      '< Requires reference to Microsoft XML
    Dim sCellXML As String, sHTMLString As String

    ' Loop to go through all cells in the range
    For Each oCell In oRng.Cells

        ' Load XML for current cell
        Set oXml = New MSXML2.DOMDocument
        oXml.LoadXML (oCell.Value(xlRangeValueXMLSpreadsheet))

        ' Capture the XML just for the cell
        sCellXML = oXml.SelectSingleNode("/Workbook/Worksheet/Table/Row/Cell").XML

        ' Call the function to return HTML formated string
        sHTMLString = ExtractTextWithFont(sCellXML)
        Debug.Print sHTMLString
    Next

End Sub

Function ExtractTextWithFont(ByVal sXMLString As String) As String

    Dim sRetVal As String
    Dim aXML As Variant
    Dim iC As Long

    ' Split XML string
    aXML = Split(sXMLString, ">")

    ' Loop to go through all elements in the array - starting from third element because first 2 are just headers from what i can see
    For iC = LBound(aXML) + 2 To UBound(aXML)

        ' Building string - this is based on strings that i tested. You might need to amend this bit to meet your needs
        If Mid(Replace(Trim(LCase(aXML(iC))), "/", ""), 2, 4) <> "font" Then
            If Left(LCase(Trim(aXML(iC))), 4) <> "</ss" And Left(LCase(Trim(aXML(iC))), 4) <> "</ce" Then
                If Left(aXML(iC), 1) = "<" Then
                    sRetVal = sRetVal & Replace(aXML(iC), "</Font", "") & ">"
                Else
                    sRetVal = sRetVal & Replace(aXML(iC), "</Font", "")
                End If
                If LCase(Right(Trim(sRetVal), 6)) = "</data" Then
                    sRetVal = Mid(Trim(aXML(iC)), 1, Len(Trim(aXML(iC))) - 6)
                End If
            End If
        End If

    Next

    ' Set return value
    ExtractTextWithFont = sRetVal

End Function
Zac
  • 1,924
  • 1
  • 8
  • 21
2

For your second question: To check if text is unformatted (No bold, italic or underline), you may use:

If r.Font.FontStyle = "Regular" And r.Font.Underline = xlUnderlineStyleNone Then 
    getHTMLFormattedString = r.text
    Exit Function
End If

You may also check if text contains mix formatting using following code:

If IsNull(r.Font.Bold) Or IsNull(r.Font.Italic) Or IsNull(r.Font.Underline) Then 
'multiple format, check character-wise formatting here
Dhirendra Kumar
  • 333
  • 2
  • 5
  • 1
    I think this will handle 99% of cases and the 1% High-CPU won't be noticeable. To expand on the strategy of "Range-Wide" (instead of per character) formatting the OP can borrow some more formats from me: https://stackoverflow.com/questions/39210329/export-the-datagridview-to-excel-with-all-the-cells-format – Jeremy Thompson Oct 09 '19 at 09:24
  • This is the one I ended up using. Minimal change to the code with a major improvement of performance (since about 90% of the data doesn't have any formatting anyway) – Geert Bellekens Oct 09 '19 at 14:30
1

The Microsoft Office libraries is unusually one of the slowest ways to read and write Office documents (I am not aware of any slower way). Also, VBA is limited to single thread. Most of the delay is from the communication between the Office library and the language. In your case there are multiple calls to the Excel library per each character.
Much faster alternatives are libraries like OpenXML, ClosedXML, EPPlus, etc. Most of them are limited to xlsx file format, but the NPOI library seems to support xls too. I am not sure if there is a way to use any of them with VBA, as most of them are for .Net and Java.

If for some reason you are limited to VBA, Range.Value(11) can be used to retrieve formatting information in XML Spreadsheet format that can be processed with VBA and XML processing library.

Another alternative could be to save the file as html or mhtml and process that.

Slai
  • 22,144
  • 5
  • 45
  • 53
  • If you are going t suggest using a StringBuilder class, which is good, you ought to say where the OP can find one. – Tony Dallimore Oct 09 '19 at 09:19
  • For now I'm going to stay with VBA, but the suggestion to us the formatting info in xml format is certainly valuable. I'll look into that. – Geert Bellekens Oct 09 '19 at 09:44