I'm having trouble clearing some html via javascript regex replace. The task is to get a tv listing for my XBMC from a local source. The URL is http://tv.dir.bg/tv_search.php?step=1&all=1 (in bulgarian). I'm trying to use a scraper to get the data - http://code.google.com/p/epgss/ (credits to Ivan Markov - http://code.google.com/u/113542276020703315321/) Unfortunately the tv listings page has changed since the above tool was last updated so I'm trying to get it to work. The problem is that when I try to parse XML from the HTML it breaks. I'm now trying to clean the html a bit by regex replacing head and script tags. Unfortunately it does not work. Here's my replacer:
function regexReplace(pattern, value, replacer)
{
var regEx = new RegExp(pattern, "g");
var result = value.replaceAll(regEx, replacer);
if(result == null)
return null;
return result;
}
And here's my call:
var htmlStringCluttered = HTML.getHTML(new URL(url), "WINDOWS-1251");
log("Content grabbed (schedule for next 7 days)");
log(url);
var htmlString = regexReplace("<head>([\\s\\S]*?)<\/head>|<script([\\s\\S]*?)<\/script>", htmlStringCluttered, "");
the getHTML function comes from the original source with my minor modification of setting User-Agent. Here is its base:
public static java.io.Reader open(URL url, String charset) throws UnsupportedEncodingException, IOException
{
URLConnection con = url.openConnection();
con.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.0 (KHTML, like Gecko) Chrome/3.0.195.38 Safari/532.0");
con.setAllowUserInteraction(false);
con.setReadTimeout(60*1000/*ms*/);
con.connect();
if(charset == null && con instanceof HttpURLConnection) {
HttpURLConnection httpCon = (HttpURLConnection)con;
charset = httpCon.getContentEncoding();
}
if(charset == null)
charset = "UTF-8";
return new InputStreamReader(con.getInputStream(), charset);
}
The result of regexReplace is absolutely the same as the original. And since XML cannot be parsed the script cannot read the elements. Any ideas?