i am screen scraping a website which is in danish language.. i am unable to scrape certain characters as like må .. any idea to solve this? thanks
Asked
Active
Viewed 564 times
0
-
2Will you show us the relevant code you're using to scrape the content? – Jan K. May 28 '10 at 11:33
-
1What library/code are you using to scrape? – Darin Dimitrov May 28 '10 at 11:34
-
i am not using library files.. i am just using regex – SAK May 28 '10 at 11:41
-
Oh my... Take a look at this why you shouldn't be using regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Darin Dimitrov May 28 '10 at 11:43
-
@sam I don't know the structure of your website, but if you're scraping only one website and it's half-way decently built, I'd consider you chunking it and parsing the relevant information with a find-in-string function or similar. But of course, I have no clue what you're doing so I'm afraid we're of limited help until you fill us in :-) – Jan K. May 28 '10 at 11:47
-
an example : The website displays text like this "Dine behov - vores mål!" when i right click and view source of the page, the same content displays like this Dine behov - vores mål! i feel like it has to do with charset – SAK May 28 '10 at 12:05
5 Answers
0
Its better to use the same encoding that the HttpWebResponse object has, Below is the code that will work with all langauges and characters .
response = (HttpWebResponse)request.GetResponse();
string Charset = response.CharacterSet;
Encoding encoding = Encoding.GetEncoding(Charset);
if (response.StatusCode == HttpStatusCode.OK)
{
response_stream = new StreamReader(response.GetResponseStream(), encoding);
html = response_stream.ReadToEnd();
}

Ishti
- 325
- 1
- 7
- 19
0
If you are using a Web browser control, you can set the page encoding to whatever language that can show that character. Then just extract the page source.

seagulf
- 380
- 3
- 5