screen scraping

Question

i am screen scraping a website which is in danish language.. i am unable to scrape certain characters as like må .. any idea to solve this? thanks

Will you show us the relevant code you're using to scrape the content? — Jan K., May 28 '10 at 11:33
Oh my... Take a look at this why you shouldn't be using regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Darin Dimitrov, May 28 '10 at 11:43
@sam I don't know the structure of your website, but if you're scraping only one website and it's half-way decently built, I'd consider you chunking it and parsing the relevant information with a find-in-string function or similar. But of course, I have no clue what you're doing so I'm afraid we're of limited help until you fill us in :-) — Jan K., May 28 '10 at 11:47
an example : The website displays text like this "Dine behov - vores mål!" when i right click and view source of the page, the same content displays like this Dine behov - vores mål! i feel like it has to do with charset — SAK, May 28 '10 at 12:05

score 1 · Answer 1 · answered May 28 '10 at 11:56

1

Try UTF-8 or Windows-1252 charset.

answered May 28 '10 at 11:56

Zenzer

6,100
1
16
6

thanks ..i tried that..here website has " windows-1252" encoding – SAK May 28 '10 at 12:07

score 0 · Answer 2 · answered Oct 13 '12 at 13:55

Its better to use the same encoding that the HttpWebResponse object has, Below is the code that will work with all langauges and characters .

        response = (HttpWebResponse)request.GetResponse();
        string Charset = response.CharacterSet;

        Encoding encoding = Encoding.GetEncoding(Charset);

        if (response.StatusCode == HttpStatusCode.OK)
        {
            response_stream = new StreamReader(response.GetResponseStream(), encoding);

            html = response_stream.ReadToEnd();
        }

score 0 · Answer 3 · answered May 29 '10 at 01:35

0

If you are using a Web browser control, you can set the page encoding to whatever language that can show that character. Then just extract the page source.

answered May 29 '10 at 01:35

seagulf

380
3
5

score 0 · Accepted Answer · answered Jun 01 '10 at 13:26

0

i just used System.Web.HttpContext.Current.Server.HtmlDecode() it works ..

answered Jun 01 '10 at 13:26

SAK

3,780
7
27
38

score 0 · Answer 5 · answered Aug 03 '11 at 20:50

0

I use iso-8859-1 for decoding. HTH

answered Aug 03 '11 at 20:50

Minh Le

1,145
1
12
20

screen scraping

5 Answers5