0

There is a news website I frequent that has a series of headlines on their main page. Clicking the headline takes you to the individual story. I am trying to write a Powershell script that will loop through all the headlines on the main page and write each story to a text file.

The problem I am having is the stories are in Spanish and the Spanish characters with accent marks do not show up properly in my text file (actually the weird thing is, sometimes they do, but the majority of the time they don't). I've checked the headers of each story and the charset is set to UTF8 so I think the web pages themselves are formatted correctly. I've tried every way I know of to set the output file as UTF8 as well, but I can't seem to get it fixed.

Anyone have any ideas? Here is the code:

$ie = New-Object -ComObject 'InternetExplorer.Application'
$url = "https://www3.nhk.or.jp/nhkworld/es/news/"

#$ie.Visible = $true
$ie.Navigate($url)
while($ie.busy) {Start-Sleep 1}

$file = "C:\temp\nhk.txt"
if(Test-Path $file) { Remove-Item $file }

$lastLink = $null
foreach($link in $ie.Document.getElementsByTagName("a")) {
    if($link.href -match "\d{6}") { #the links to the stories we want are numbered with 6 digits
        if(-not($link.href -eq $lastLink)) {
            $uri = $link.href
            $w = Invoke-WebRequest -Uri $uri

            ForEach($element in $w.AllElements | where tagname -eq "p") {
                $text = $element | select -expand innerText
                $text = $text + "`r`n"
       
                Add-Content -Path $file -Value $text
                }
           
            $lastLink = $link.href
            }
        }
       
    }
James_
  • 3
  • 1
  • [Add-Content](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.management/add-content) has an `-Encoding` parameter.. – Theo Feb 27 '21 at 16:40
  • @Theo Yup, I tried that parameter with UTF8 and it didn't help. I'm wondering if the problem is actually occurring before writing to the file? I've used the debugger to step through the code and when I view the contents of $w.AllElements for example, the character don't show up correctly either. But I wasn't sure if that is just maybe because of the Powershell ISE? – James_ Feb 27 '21 at 16:55
  • Can you give the link to a specific article which is failing? – mclayton Feb 27 '21 at 18:05
  • @mclayton If you go to https://www3.nhk.or.jp/nhkworld/es/news/ there are 7 headlines on that page. They are all failing. Here is the first one: https://www3.nhk.or.jp/nhkworld/es/news/286214/ – James_ Feb 27 '21 at 18:13
  • 1
    What happens if you do `$w = Invoke-WebRequest -Uri $url -ContentType 'application/x-www-form-urlencoded; charset=utf-8'` (just tested the first url and that came out fine..) – Theo Feb 28 '21 at 12:07
  • @Theo I can give that a try too. What is weird, as I explained to mclayton, is that my code actually does output fine sometimes. I've noticed on the weekends it _usually_ works fine. Weekdays is when the output is generally guaranteed to not be correct. I have no idea why it would work some days and others no. The only difference I see is that weekends have fewer stories on the main page. – James_ Feb 28 '21 at 15:39

2 Answers2

1

I think it's the same basic problem as this question:

PowerShell Invoke-RestMethod Umlauts issues with UTF-8 and Windows-1252

The issue is the server is sending a response which is encoded using UTF8, but it's not correctly setting the Content-Type header to tell the client it's doing that, so the client is assuming it's encoded with the default ISO-8859-1 encoding.

This means, for example, the character ó is being sent by the server as the UTF8 byte sequence C3 B3 but the client is decoding that as an ISO-8859-1 byte sequence which becomes ó.

Since you can't presumably control the server's behaviour you might need to do some processing on the mangled text to recover the original version. I posted one way of doing that in an answer to the question above (see https://stackoverflow.com/a/58542493/3156906), but here it is again...

PS> $text = "ó"
PS> $bytes = [System.Text.Encoding]::GetEncoding("ISO-8859-1").GetBytes($text)
PS> $text = [System.Text.Encoding]::UTF8.GetString($bytes)
PS> write-host $text
ó
mclayton
  • 8,025
  • 2
  • 21
  • 26
  • Thanks for this I will give it a try. I am wondering if the server response is sometimes setting Content-Type correctly because sometimes my code does work and outputs perfectly (I've noticed mostly on the weekends it seems to work but weekdays it typically fails). I wonder if NHK is not consistent in this. For example, I just tried with my unedited code this morning (Sunday) and it worked. I will add your code and try on Monday and see if it works as well. Thanks! – James_ Feb 28 '21 at 06:55
  • If you post a link that works as well we can compare the two side by side in a HTTP trace tool like Fiddler... – mclayton Feb 28 '21 at 15:43
  • This seems to have fixed it! I've tried it yesterday and today and now the characters are outputted correctly. Thanks! – James_ Mar 03 '21 at 02:08
0

Did some more experimenting, and apparently, if you send the result of the Invoke-WebRequest call straight to a file using the -OutFile parameter, this file gets written in UTF8.

This should then (hopefully) do it:

# create a temporary file
$tempFile = (New-TemporaryFile).FullName

if(-not($link.href -eq $lastLink)) {
    $uri = $link.href
    $w = Invoke-WebRequest -Uri $uri

    Invoke-WebRequest -Uri $uri -OutFile $tempFile
    # read the file with encoding UTF8
    $content = Get-Content -Path $tempFile -Raw -Encoding UTF8
    # parse the html
    $html = New-Object -Com "HTMLFile"
    $html.IHTMLDocument2_write($content)
    # and append the innerText to your file "C:\temp\nhk.txt"
    Add-Content -Path $file -Value $html.body.innerText
    Add-Content -Path $file # add extra empty line

    # clean up
    [System.Runtime.Interopservices.Marshal]::ReleaseComObject($html) | Out-Null
    [System.GC]::Collect()
    [System.GC]::WaitForPendingFinalizers()
    $html = $null

       
    $lastLink = $link.href
}

# remove the temp file
Remove-Item -Path $tempFile -Force
Theo
  • 57,719
  • 8
  • 24
  • 41
  • if you write to a file using ```-OutFile``` it just streams the bytes as-is without trying to decode the content, so you don't hit the mismatch where it's trying to decode a UTF8 byte stream using the ISO-8859-1 decoder. – mclayton Mar 03 '21 at 09:40