I have problem with prepare String using Regex. I wrote this function:
private String parseAnswer(String res)
{
String[] pattern = new String[16] { "<head[^>]*?>.*?</head>", "<style[^>]*?>.*?</style>", "<script[^>]*?.*?</script>", "<object[^>]*?.*?</object>", "<embed[^>]*?.*?</embed>", "<applet[^>]*?.*?</applet>", "<noframes[^>]*?.*?</noframes>", "<noscript[^>]*?.*?</noscript>", "<noembed[^>]*?.*?</noembed>", "</?((address)|(blockquote)|(center)|(del))", "</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))", "</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))", "</?((table)|(th)|(td)|(caption))", "</?((form)|(button)|(fieldset)|(legend)|(input))", "</?((label)|(select)|(optgroup)|(option)|(textarea))", "</?((frameset)|(frame)|(iframe))" };
String[] replacement = new String[16] { " ", " ", " ", " ", " ", " ", " ", " ", " ", "\n$0", "\n$0", "\n$0", "\n$0", "\n$0", "\n$0", "\n$0" };
for (int i = 0; i < pattern.Length; i++)
{
res = Regex.Replace(res, pattern[i], replacement[i]);
}
return res;
}
This function get code of HTML as input. I want to clear some of HTML tags. To do it I prepare array of pattern. But it appear that my function doesn't clear code of HTML. My patterns are list of HTML tag which I want to remove. Some of tags I don't remove but only add \n.
Can you help me with this Regex? Or give me any library to do it task? My aim is remove HTML tag to receive only text of website to parse.
EDIT: Ok I can use HTMLAgilityPack but I have a few questions: htmlDoc.LoadHtml(URL); - I need to translate result to UTF8 -> HTMLAgilityPack have any function to convert? Second generally I want to result of InnerText put to Json and send it to Javascript. How I can remove char with are forbidden in Javascript?