0

I'm having trouble clearing some html via javascript regex replace. The task is to get a tv listing for my XBMC from a local source. The URL is http://tv.dir.bg/tv_search.php?step=1&all=1 (in bulgarian). I'm trying to use a scraper to get the data - http://code.google.com/p/epgss/ (credits to Ivan Markov - http://code.google.com/u/113542276020703315321/) Unfortunately the tv listings page has changed since the above tool was last updated so I'm trying to get it to work. The problem is that when I try to parse XML from the HTML it breaks. I'm now trying to clean the html a bit by regex replacing head and script tags. Unfortunately it does not work. Here's my replacer:

function regexReplace(pattern, value, replacer) 
{  
var regEx = new RegExp(pattern, "g");  
var result = value.replaceAll(regEx, replacer);  
if(result == null)  
return null;  
return result;  
}  

And here's my call:

var htmlStringCluttered = HTML.getHTML(new URL(url), "WINDOWS-1251");  
log("Content grabbed (schedule for next 7 days)");  
log(url);  
var htmlString = regexReplace("<head>([\\s\\S]*?)<\/head>|<script([\\s\\S]*?)<\/script>", htmlStringCluttered, "");  

the getHTML function comes from the original source with my minor modification of setting User-Agent. Here is its base:

    public static java.io.Reader open(URL url, String charset) throws UnsupportedEncodingException, IOException  
    {
    URLConnection con = url.openConnection();
    con.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.0 (KHTML, like Gecko) Chrome/3.0.195.38 Safari/532.0");
    con.setAllowUserInteraction(false);
    con.setReadTimeout(60*1000/*ms*/);

    con.connect();

    if(charset == null && con instanceof HttpURLConnection) {
        HttpURLConnection httpCon = (HttpURLConnection)con;
        charset = httpCon.getContentEncoding();
    }

    if(charset == null)
        charset = "UTF-8";

    return new InputStreamReader(con.getInputStream(), charset);
    }

The result of regexReplace is absolutely the same as the original. And since XML cannot be parsed the script cannot read the elements. Any ideas?

nnikolov06
  • 183
  • 3
  • 11
  • Well, we all know that [parse HTML with regex isn't a good idea](http://stackoverflow.com/a/1732454/422184) so let's try and figure out what the real problem is. Why can't XML be parsed? What is your limitation on that? – LoveAndCoding Oct 27 '12 at 18:22
  • Actually, no idea. The call in the js is var html = new XML(Utils.trim(htmlString.substring(39))); // Bug 336551 The return is a plain string. – nnikolov06 Oct 27 '12 at 18:27
  • Here's the console output (including log(html)) - http://pastebin.com/Uf08iNyx – nnikolov06 Oct 27 '12 at 18:35
  • Wait, what is it you are trying to accomplish? And what language is it you are trying to do it in? Some of this code is Java, some is JS, which are you trying to accomplish what in? – LoveAndCoding Oct 27 '12 at 18:39
  • I know. I'm trying to make that code run. It's part java and part javascript. The javascript parses the html and the java calls the js and should write the output to a XMLTV formatted xml file. The problem is (IMHO) that the js does not parse the html correctly. – nnikolov06 Oct 27 '12 at 18:54

1 Answers1

1

UPDATE:

To convert this to an XMLDocument, you can do the following:

var parseXml,
    xml,
    htmlStringCluttered = HTML.getHTML(new URL(url), "WINDOWS-1251"),
    htmlString = '';

if (typeof window.DOMParser != "undefined") {
    parseXml = function (xmlStr) {
        return (new window.DOMParser()).parseFromString(xmlStr, "text/xml");
    };
} else if (typeof window.ActiveXObject != "undefined" && new window.ActiveXObject("Microsoft.XMLDOM")) {
    parseXml = function (xmlStr) {
        var xmlDoc = new window.ActiveXObject("Microsoft.XMLDOM");
        xmlDoc.async = "false";
        xmlDoc.loadXML(xmlStr);
        return xmlDoc;
    };
} else {
    throw new Error("No XML parser found");
}

console.log("Content grabbed (schedule for next 7 days)");
console.log(url);

//eliminate the '<head>' section
htmlString = htmlStringCluttered.replace(/(<head[\s\S]*<\/head>)/ig, '')

//eliminate any remaining '<script>' elements
htmlString = htmlString.replace(/(<script[\s\S]+?<\/script>)/ig, '');

//self-close '<img>' elements
htmlString = htmlString.replace(/<img([^>]*)>/g, '<img$1 />');

//self-close '<br>' elements
htmlString = htmlString.replace(/<br([^>]*)>/g, '<br$1 />');

//self-close '<input>' elements
htmlString = htmlString.replace(/<input([^>]*)>/g, '<input$1 />');

//replace '&nbsp;' entities with an actual non-breaking space
htmlString = htmlString.replace(/&nbsp;/g, String.fromCharCode(160));

//convert to XMLDocument
xml = parseXml(htmlString);

//log new XMLDocument as output
console.log(xml);

//log htmlString as output
console.log(htmlString);
  • Credit where credit is due: parseXml function found at:

XML parsing of a variable string in JavaScript

You can test this in the browser (I did :) ) simply by defining htmlStringCluttered as:

htmlStringCluttered = document.documentElement.innerHTML;

instead of:

htmlStringCluttered = HTML.getHTML(new URL(url), "WINDOWS-1251"),

and running it in the console at http://tv.dir.bg/tv_search.php?step=1&all=1

You will also have to either comment out the line:

console.log(url);

or declare url and give it a value.

Original:

Your RegExp needed some work, and it's much simpler (and easier to read) when broken into two replace statements:

var htmlStringCluttered = HTML.getHTML(new URL(url), "WINDOWS-1251"),
    htmlString = '';
console.log("Content grabbed (schedule for next 7 days)");
console.log(url);
//eliminate the '<head>' section
htmlString = htmlStringCluttered.replace(/(<head[\s\S]*<\/head>)/ig, '')
//eliminate any remaining '<script>' elements
htmlString = htmlString.replace(/(<script[\s\S]+?<\/script>)/ig, '');
//log remaining as output
console.log(htmlString);

This was tested in the console by visiting http://tv.dir.bg/tv_search.php?step=1&all=1 and running the following in the console:

console.log(document.documentElement.innerHTML.replace(/(<head[\s\S]*<\/head>)/ig, '').replace(/(<script[\s\S]+?<\/script>)/ig, ''));

If this is run on the outerHTML property (as I expect the HTML.getHTML(new URL(url), "WINDOWS-1251") method to return), then the <body> element will be wrapped in:

<html xmlns="http://www.w3.org/1999/xhtml">
    <body>
        ...
    </body>
</html>
Community
  • 1
  • 1
pete
  • 24,141
  • 4
  • 37
  • 51
  • Thank you. The regexed output is absolutely correct. Thing is it still won't be parsed to xml, but I guess that's a completely different matter. – nnikolov06 Oct 28 '12 at 12:11
  • @nnikolov06: Updated answer to convert to XMLDocument. – pete Oct 28 '12 at 13:42
  • Thanks! I was having issues with unclosed tags. It works great. Now I just added some div tags around text elements that were on their own in the td's so I could iterate easier. Thanks a lot for the help! – nnikolov06 Oct 28 '12 at 15:01