Parsing the HTML string when reading a SharePoint Multiline field type using PowerShell PNP

Question

I've got a PowerShell script that's pretty good, it reads most field types of SharePoint lists and writes them to a SQL file so I can export a SP list and Import to SQL Server in 2 steps. But Multiline field types always export as HTML such as "<div class="ExternalClass1692C959C37B4AE2B2A5251BCEB89A3A"><p>?<span style="color:#444444;font-family:"segoe ui", "segoe ui", segoe, tahoma, helvetica, arial, sans-serif;">TEXT TO EXTRACT</span>?<br></p></div>"

Is there a way to just read the core text string and ignore all the HTML using PNP? I found a very poorly documented solution that looked like CSOM, most everything else I find on this is how to create a new multi-line column using code.

For other field types I can use logic to find the field type and handle accordingly for names, emails, urls, etc. It flattens arrays of values into delimited strings. But the HTML output is a pain.

#Override the default, if applicable
If($listItem[$title]){
$field = $listItem[$title].ToString()                            
Else {
If(($field -like "Microsoft.SharePoint.Client.*") -and ($field -notlike '*`[`]')){
    If($field.Split('.')[3] -eq 'FieldUserValue') {
        $item += $listItem[$title].LookupValue.ToString()
    } ElseIf($field.Split('.')[3] -eq 'FieldUrlValue') {
        $item += $listItem[$title].URL.ToString()
    } ElseIf ($listItem[$title].LookupValue) {
        $item += $listItem[$title].LookupValue.ToString()
    } Else {
        $item += ''
    }
} ElseIf(($field -like "Microsoft.SharePoint.Client.*") -and ($field -like '*`[`]')){
    If($field.Split('.')[3] -eq 'FieldUserValue') {
        #For each person in the array
        #Get the email value and convert to a name
        foreach($arr_value in $listItem[$title]) {
            $item += $arr_value.Email.ToString().Split("@")[0].Replace(".", " ") -Replace('[^a-zA-Z ]','')
            $item += "$delim"
        }
    } Else {
        #Get each Lookup value in the arracy and add it to a comma-delimited string
        foreach($arr_value in $listItem[$title]) {
            If ($arr_value.LookupValue) {
                $item += $arr_value.LookupValue.ToString()
            } Else {
                $item += ''
            }
            $item += "$delim"
        }
    $item = $item.Substring(0, $item.Length - $($delim).Length) 
    }
}

My current best bet is:

If($field.Substring(0,11) -eq '<div class='){
        $item += $($field.Split('>')[3]).Replace('</span', '')
}

But the output leaves html codes for rendering characters like   in place so I'd have to tack on those "replace" every time I see them and that's not ideal. Is there an HTML parsing logic for PowerShell PNP?

As an aside, [try to avoid the increase assingment operator (`+=`) for building strings](https://stackoverflow.com/a/70093215/1701026) — iRon, Jun 23 '23 at 05:13

iRon · Accepted Answer · 2023-06-24T07:24:01.333

It is generally a bad idea to peek and/or poke in structured text using text methods as replace. Instead, it is better to use a proper (html) parser to manipulate your data. See also: Extract text from html with powershell - bad pattern

function ParseHtml($String) {
    $Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
    $Html = New-Object -Com 'HTMLFile'
    if ($Html.PSObject.Methods.Name -Contains 'IHTMLDocument2_Write') {
        $Html.IHTMLDocument2_Write($Unicode)
    } 
    else {
        $Html.write($Unicode)
    }
    $Html.Close()
    $Html
}

$Html = ParseHtml '<div class="ExternalClass1692C959C37B4AE2B2A5251BCEB89A3A"><p>?<span style="color&#58;#444444;font-family&#58;&quot;segoe ui&quot;, &quot;segoe ui&quot;, segoe, tahoma, helvetica, arial, sans-serif;">TEXT TO EXTRACT</span>?<br></p></div>'

To get all the text in the html string (div node):

@($Html.getElementsByTagName('div')).innerText

?TEXT TO EXTRACT?

To get the the text inside the inner span node:

$Div = @($Html.getElementsByTagName('div'))
@($Div.$Html.getElementsByTagName('span'))

Or just:

@($Html.getElementsByTagName('span')).innerText

TEXT TO EXTRACT

Sorry, I deleted that comment because I realized that the question marks in my example data might throw us off. Some "Zero Width Space" unicode characters come through and SQL engine reads this as question marks on import. I solve like this `IF($HTMLval) {$HTMLval = $HTMLval.Replace(([char]8203).ToString(),"")}`, if there's a preferable way I'm interested in knowing it. — n8., Jun 23 '23 at 20:22
I have updated my any to narrow the results to just the `innerText` in the `span` node. — iRon, Jun 24 '23 at 06:58

Parsing the HTML string when reading a SharePoint Multiline field type using PowerShell PNP

1 Answers1