0

i am screen scraping a website which is in danish language.. i am unable to scrape certain characters as like må .. any idea to solve this? thanks

Marcelo Cantos
  • 181,030
  • 38
  • 327
  • 365
SAK
  • 3,780
  • 7
  • 27
  • 38
  • 2
    Will you show us the relevant code you're using to scrape the content? – Jan K. May 28 '10 at 11:33
  • 1
    What library/code are you using to scrape? – Darin Dimitrov May 28 '10 at 11:34
  • i am not using library files.. i am just using regex – SAK May 28 '10 at 11:41
  • Oh my... Take a look at this why you shouldn't be using regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Darin Dimitrov May 28 '10 at 11:43
  • @sam I don't know the structure of your website, but if you're scraping only one website and it's half-way decently built, I'd consider you chunking it and parsing the relevant information with a find-in-string function or similar. But of course, I have no clue what you're doing so I'm afraid we're of limited help until you fill us in :-) – Jan K. May 28 '10 at 11:47
  • an example : The website displays text like this "Dine behov - vores mål!" when i right click and view source of the page, the same content displays like this Dine behov - vores mål! i feel like it has to do with charset – SAK May 28 '10 at 12:05

5 Answers5

1

Try UTF-8 or Windows-1252 charset.

Zenzer
  • 6,100
  • 1
  • 16
  • 6
0

Its better to use the same encoding that the HttpWebResponse object has, Below is the code that will work with all langauges and characters .

        response = (HttpWebResponse)request.GetResponse();
        string Charset = response.CharacterSet;

        Encoding encoding = Encoding.GetEncoding(Charset);

        if (response.StatusCode == HttpStatusCode.OK)
        {
            response_stream = new StreamReader(response.GetResponseStream(), encoding);

            html = response_stream.ReadToEnd();
        }
Ishti
  • 325
  • 1
  • 7
  • 19
0

If you are using a Web browser control, you can set the page encoding to whatever language that can show that character. Then just extract the page source.

seagulf
  • 380
  • 3
  • 5
0

i just used System.Web.HttpContext.Current.Server.HtmlDecode() it works ..

SAK
  • 3,780
  • 7
  • 27
  • 38
0

I use iso-8859-1 for decoding. HTH

Minh Le
  • 1,145
  • 1
  • 12
  • 20