1

I need to get data from a page that has the win-1251 codepage.

$SiteAdress = "http://www.gisinfo.ru/download/download.htm"
$HttpContent = Invoke-WebRequest -URI $SiteAdress
echo $HttpContent

And it shows me:

> StatusCode        : 200 StatusDescription : OK Content           :
> <!DOCTYPE html>
>                     <html><!-- #BeginTemplate "/Templates/panorama.dwt" --><!-- DW6 -->
>                     <head>
>                     <!-- #BeginEditable "doctitle" --> 
>                     <title>ÃÈÑ ÏÀÍÎÐÀÌÀ - Ñêà÷àòü ïðîãðàììû</title>
>                     <meta name="keywords" con... RawContent        : HTTP/1.1 200 OK
>                     Transfer-Encoding: chunked
>                     Connection: keep-alive
>                     Keep-Alive: timeout=20
>                     Content-Type: text/html
>                     Date: Fri, 16 Oct 2015 12:40:45 GMT
>                     Server: nginx/1.5.7
>                     X-Powered-By: PHP/5.2.17...

Title is Cyrillic. I have tried the variant below, but the result is the same.

$HttpContent = Invoke-WebRequest -URI $SiteAdress -ContentType "text/html; charset=windows-1251"
Peter
  • 1,674
  • 4
  • 27
  • 44

2 Answers2

1

The -ContentType parameter to Invoke-WebRequest sets the content type for the request, not the response. Since you don't sent any content with your request it's quite irrelevant here.

I didn't find an easy way of enforcing a particular encoding for the response. Since the encoding is only specified within the HTML, and not the response header, there's little you can do here, I fear, as Invoke-WebRequest isn't smart enough to figure that out on its own.

You can, however, convert the text you read:

filter Convert-Encoding {
  $1251 = [System.Text.Encoding]::GetEncoding(1251)
  $1251.GetString([System.Text.Encoding]::Default.GetBytes($_))
}

$HttpContent.Content | Convert-Encoding

will then yield the proper Cyrillic text.

<!DOCTYPE html>
<html><!-- #BeginTemplate "/Templates/panorama.dwt" --><!-- DW6 -->
<head>
<!-- #BeginEditable "doctitle" -->
<title>ГИС ПАНОРАМА - Скачать программы</title>
<meta name="keywords" content="ГИС, карта, геодезия, картография, фотограмметрия, топография, электронная карта, классификатор, трехмерное моделирование, модель местности, карта Москвы, Ногинск, кадастр, межевое дело, Гаусс, эллипсоид Красовского, 1942, оротофотоснимок, WGS, растр, план, схема, бланковка, фотодокумент, земля, право, документация, map, sit, mtw, mtr, rsw, rsc, s57, s52, gis, 2003, 2004, Tool, Kit">
<meta name="description" content="Новые версии ГИС Карта 2000, GIS ToolKit , СУРЗ Земля и Право, документации, библиотек и примеров электронных карт">
<!-- #EndEditable -->

In any case, you need to know the exact encoding beforehand, regardless of how you solve it. You could try finding it in the HTML source, though:

[Regex]::Matches($HttpContent.Content, 'text/html;\s*charset=(?<encoding>[1-9a-z-]+)')

[System.Text.Encoding]::GetEncoding can cope with a string like windows-1251, at least.

Joey
  • 344,408
  • 85
  • 689
  • 683
0

My working variant:

$client = New-Object System.Net.WebClient
$url = "http://www.gisinfo.ru/download/download.htm"
$results = [System.Text.Encoding]::GetEncoding('windows-1251').GetString([Byte[]]$client.DownloadData($url))

Thanks Joey for help