1

I am writing a crawler/parser that should be able to process different types of content, being RSS, Atom and just plain html files. To determine the correct parser, I wrote a class called ParseFactory, which takes an URL, tries to detect the content-type, and returns the correct parser.

Unfortunately, checking the content-type using the provided in method in URLConnection doesn't always work. For example,

String contentType = url.openConnection().getContentType();

doesn't always provide the correct content-type (e.g "text/html" where it should be RSS) or doesn't allow to distinguish between RSS and Atom (e.g. "application/xml" could be both an Atom or a RSS feed). To solve this problem, I started looking for clues in the InputStream. Problem is that I am having trouble coming up an elegant class design, where I need to download the InputStream only once. In my current design I have wrote a separate class first that determines the correct content-type, next the ParseFactory uses this information to create an instance of the corresponding parser, which in turn, when the method 'parse()' is called, downloads the entire InputStream a second time.

public Parser createParser(){

    InputStream inputStream = null;
    String contentType = null;
    String contentEncoding = null;

    ContentTypeParser contentTypeParser = new ContentTypeParser(this.url);
    Parser parser = null;

    try {

        inputStream = new BufferedInputStream(this.url.openStream());
        contentTypeParser.parse(inputStream);
        contentType = contentTypeParser.getContentType();
        contentEncoding = contentTypeParser.getContentEncoding();

        assert (contentType != null);

        inputStream = new BufferedInputStream(this.url.openStream());

        if (contentType.equals(ContentTypes.rss))
        {
            logger.info("RSS feed detected");
            parser = new RssParser(this.url);
            parser.parse(inputStream);
        }
        else if (contentType.equals(ContentTypes.atom))
        {
            logger.info("Atom feed detected");
            parser = new AtomParser(this.url);
        }
        else if (contentType.equals(ContentTypes.html))
        {
            logger.info("html detected");
            parser = new HtmlParser(this.url);
            parser.setContentEncoding(contentEncoding);
        }
        else if (contentType.equals(ContentTypes.UNKNOWN))
            logger.debug("Unable to recognize content type");

        if (parser != null)
            parser.parse(inputStream);

    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            inputStream.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    return parser;

}

Basically, I am looking for a solution that allows me to eliminate the second "inputStream = new BufferedInputStream(this.url.openStream())".

Any help would be greatly appreciated!

Side note 1: Just for the sake of being complete, I also tried using the URLConnection.guessContentTypeFromStream(inputStream) method, but this returns null way too often.

Side note 2: The XML-parsers (Atom and Rss) are based on SAXParser, the Html-parser on Jsoup.

jerraes
  • 15
  • 4

2 Answers2

1

Can you just call mark and reset?

inputStream = new BufferedInputStream(this.url.openStream());
inputStream.mark(2048); // Or some other sensible number

contentTypeParser.parse(inputStream);
contentType = contentTypeParser.getContentType();
contentEncoding = contentTypeParser.getContentEncoding();

inputstream.reset(); // Let the parser have a crack at it now
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • No, I tried this method as well and unfortunately this method doesn't work. I keep on getting an IOException ("stream closed") – jerraes Jul 21 '11 at 10:48
  • @jerraes: Where do you get the exception? Is your ContentTypeParser closing the stream? Please include things you've tried with the *detail* of what went wrong within the question. – Jon Skeet Jul 21 '11 at 10:49
  • No, ContentTypeParser does not close the stream -- the stream only gets closed at the end of the "createParser()" method in the ParserFactory. I was a bit too hasty with my previous comment. When I place a mark I get the following error: java.io.IOException: Resetting to invalid mark at java.io.BufferedInputStream.reset(BufferedInputStream.java:416) at dataAcquisition.Parser.ParseFactory.createParser(ParseFactory.java:65) at run.Main.main(Main.java:45) Exception in thread "main" java.lang.NullPointerException at run.Main.main(Main.java:46) – jerraes Jul 21 '11 at 11:41
  • @jerraes: That suggests you didn't set the mark limit high enough - the ContentTypeParser is probably reading more than 2K of the stream. – Jon Skeet Jul 21 '11 at 11:53
  • 1
    @jerraes: Not sure why you've put the same comment again - basically I suspect you haven't given a high enough argument to the `mark` call. – Jon Skeet Jul 21 '11 at 11:54
  • I tried raising the mark all the way up to the maximum value of int, but still the same problem. The site I am using is 74kb on disk. – jerraes Jul 21 '11 at 12:03
  • @jerraes: How much of it is contentTypeParser reading? Have you tried debugging into this code to work out what's going on? You may also want to use the BufferedInputStream constructor which sets the buffer size. – Jon Skeet Jul 21 '11 at 12:25
  • The contentTypeParser is a SAXParser that basically looks for clues in the InputStream to determine the contentType. It does this by inspecting the first node (and only the first node): ContentTypeParser extends the class DefaultHandler and overwrites the method startDocument, where the boolean firstNode is set to TRUE. Because of this, the method startElement (also overwritten) knows it needs to inspect this node (and does nothing else). Could it be that because the SAXParser runs the InputStream all the way to the end (?) that I am unable to reset it later? Again thanks for your help! – jerraes Jul 21 '11 at 13:44
  • @jerraes: Yes, it sounds like it is indeed reading the whole input stream, which would cause a problem. You *either* need to buffer the whole stream in memory, or abort the SAX parsing as soon as you've got the result. You may need to throw an exception to do this - I'm not sure. – Jon Skeet Jul 21 '11 at 13:48
  • Throwing an exception works. I am not sure how elegant this solution is -- throwing around errors to alter the functionality more seems seems like a hack to me :p See also http://stackoverflow.com/questions/1345293/how-to-stop-parsing-xml-document-with-sax-at-any-time and http://www.ibm.com/developerworks/xml/library/x-tipsaxstop/ – jerraes Jul 21 '11 at 17:57
0

Perhaps your ContentTypeParser should cache the content internally and feed it to the appropiate ContentParser instead of reacquiring data from InputStream.

Vlad
  • 10,602
  • 2
  • 36
  • 38
  • You mean cash it in a String? Thought of that as well, but I would mean that I have to convert this String back to an InputStream in order to make it work with the SAXParser. I guess this is still the best option if I can't make it work any other way. – jerraes Jul 21 '11 at 11:31
  • Converting the `String` to `InputStream` is very easy: `InputStream is = new ByteArrayInputStream(str.getBytes("UTF-8"));` However, see Jon's comments about `mark` limit being too low: there is no other reason to get invalid mark except going past the invalidation limit: http://download.oracle.com/javase/6/docs/api/java/io/InputStream.html#mark%28int%29 – Vlad Jul 21 '11 at 12:05