1

I have a requirement where a huge HTML file must be read and displayed in the front-end of my application. The HTML file size is around 25MB. Tried several options like:

Option 1:
    try (Scanner scnr = new Scanner(file);) {
                while (scnr.hasNextLine()) {
                    String line= scnr.nextLine();
                }
    } 
Option 2:
    FileUtils.readFileToString(file, "UTF-8");
Option 3:
    IOUtils.toString(new FileInputStream(new File(file)), "UTF-8")

All the above 3 options are failing to read the file. I see no error. The processing just stops and the webpage throws an "error" popup with no info.

Problem seems to be that the entire HTML file content is read as a single line of string.

Is there a way in which I can read this file?

I went through several other questions here to see if there is a possible solution, but nothing seems to be working for this case.

user811433
  • 3,999
  • 13
  • 53
  • 76
  • Java's SAX-parser package is pretty nice. I have used it and it is extremely fast and simple. It parses any XML, so should work fine for HTML. – RaminS Oct 25 '16 at 17:57
  • 1
    @Gendarme That's horrible advice. It would also require XHTML, SAX won't parse HTML. – Kayaman Oct 25 '16 at 18:00
  • Why would it not parse HTML? – RaminS Oct 25 '16 at 18:02
  • @user811433 Do you really need read that 25MB at once? That's at least, expensive... – Lucas Oliveira Oct 25 '16 at 18:02
  • @Gendarme Because HTML is not XML. – Kayaman Oct 25 '16 at 18:02
  • As for the question, it's always a bad idea to read a lot of things into memory. A better way would be to stream it out, but there's not enough code shown to provide an answer. – Kayaman Oct 25 '16 at 18:03
  • @Kayaman What in HTML does not follow the XML syntax? – RaminS Oct 25 '16 at 18:03
  • @Gendarme Is this your question? Is Google blocked for you? – Kayaman Oct 25 '16 at 18:08
  • Since you claim that the suggested XML parser doesn't work, it is appropriate to say why. Surely it cannot be hard to say how XML syntax does not apply to HTML. – RaminS Oct 25 '16 at 18:13
  • From the information provided in the question we cannot know whether or not the HTML in the file follows XML syntax. Most HTML I encounter elsewhere does not, but some does. – Ole V.V. Oct 25 '16 at 18:49
  • @Kayaman are you suggesting the usage of BufferedReader when you say "stream it out"? – user811433 Oct 25 '16 at 19:24
  • @LucasOliveira yes, all of the html file must be displayed in the application. However if there is a way in which I can read parts of the HTML, process it and then read the next part that would work too. – user811433 Oct 25 '16 at 19:25
  • @user811433 A reader and a writer (or inputstream/outputstream) yes. You're not giving almost any information in your question. Also if you're not seeing any errors, fix your logging. – Kayaman Oct 25 '16 at 19:27

2 Answers2

1

@user811433, I did some testing with Apache Commons IO reading a log file with size around 800MB and no error occurred in the execution.

This method opens an InputStream for the file. When you have finished with the iterator you should close the stream to free internal resources. This can be done by calling the LineIterator.close() or LineIterator.closeQuietly(LineIterator) method.

In case you process line by line like a Stream, The recommended usage pattern is something like this:

File file = new File("C:\\Users\\lucas\\Desktop\\file-with-800MB.log");

    LineIterator it = FileUtils.lineIterator(file, "UTF-8");
    try {           
        while (it.hasNext()) {
            String line = it.nextLine();
            // do something with line, here just sysout...
            System.out.println( line );
        }
    } finally {
        LineIterator.closeQuietly(it);
    }

Some extra references, here and here

Community
  • 1
  • 1
Lucas Oliveira
  • 833
  • 1
  • 10
  • 24
-1
try {
            File f=new File("test.html");
            BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(f)));
            String content=null;

            while((content=reader.readLine())!=null)
            {
                  System.out.println(content);
            }

        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
Keval Pithva
  • 600
  • 2
  • 5
  • 21