4

I am collecting full HTML from a service that provides access to a very large collection of blogs and news websites. I am checking the HTML as it comes (in real-time) to see if it contains some keywords. If it contains one of the keywords, I am writing the HTML to a text file to store it.

I want to do this for a week. Therefore I am collecting a large amount of data. Testing the program for 3 minutes yielded a text file of 100MB. I have 4 TB of space, and I can't use more than this.

Also, I don't want the text files to become too large, because I assume they'll become un-openable.

What I am proposing is to open a text file, and write HTML to it, frequently checking its size. If it becomes bigger than, let's say 200MB, I close the text file and open another. I also need to keep a running log of how much space I've used in total, so that I can make sure that I don't get close to 4 TB.

The question I have at this point is how to check the size of the text file before the file has been closed (using FileWriter.close()). Is there a function for this or should I count the number of characters written to the file and use that to estimate the file size?

A separate question: are there ways of minimising the amount of space my text files take up? I am working in Java.

Qwerky
  • 18,217
  • 6
  • 44
  • 80
Andrew
  • 1,157
  • 1
  • 20
  • 37

7 Answers7

5

Create a writer which counts the number of characters written and use that to wrap your OutputStreamWriter.

[EDIT] Note: The correct way to save text to a file is:

new BufferedWriter( new OutputStreamWriter( new FileOutputStream( file ), encoding ) ) );

The encoding is important; it's usually "UTF-8".

This chain gives you two places where you can inject your wrapper: You can wrap the writer to get the number of characters or the inner OutputStream to get bytes written.

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • ok, thanks. i will try this. how can i be know how many bytes a character requires? – Andrew Nov 21 '11 at 16:06
  • If you process english web pages, each character takes one byte. The UTF-8 encoding is pretty compact. But you can also wrap your `FileOutputStream` which gives you the bytes instead. – Aaron Digulla Nov 21 '11 at 16:07
  • ok. i will try experimenting with this. the way i am going to count characters (maybe this is not the right way) is to keep a running total by using a Java string length method on every string that I write to the file – Andrew Nov 21 '11 at 16:09
4

I continuation to Aaron's answer. You can use CountingOutputStream: just wrap your FileOutputStream using CountingOutputStream and you will be able to know how many bytes have you already written.

AlexR
  • 114,158
  • 16
  • 130
  • 208
3

To minimize space, you could zip your text files with Java. Why not add each file to a zip after closing it. After zipping, you could check the size of the zip to see your your cumulative storage consumption.

ewan.chalmers
  • 16,145
  • 43
  • 60
3

HTML will easily compress with a high compression ratio. Consider using a GZIPOutputStream to "minimzie the amount of space" your text files take up.

ziesemer
  • 27,712
  • 8
  • 86
  • 94
2

Did it occur to you to count how many bytes you write to the file?

Thom
  • 14,013
  • 25
  • 105
  • 185
  • i guess this essentially what i want to do, and i guess i do this by counting the number of characters written to the file, as suggested by Aaron. – Andrew Nov 21 '11 at 16:07
  • Yes, I voted for Aaron's answer too. I think that's the way to do it. – Thom Nov 21 '11 at 16:14
1
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;


public class TestFileWriter {

    /**
     * @param args
     * @throws IOException 
     */
    public static void main(String[] args) throws IOException {
        FileWriter fileWriter= new FileWriter("test.txt");
        for (int i=0; i<1000; i++) {
            fileWriter.write("a very long string, a very long string, a very long string, a very long string, a very long string\n");
            if ((i%100)==0) {
                File file=new File("test.txt");
                System.out.println("file size=" +  file.length());
            }
        }
        fileWriter.close();
        File file=new File("test.txt");
        System.out.println("file size=" +  file.length());

    }

}

This example demonstrates that if you are using a file writer you can obtain its size in realtime while writing and with the writer open. If you want to save space you can zip the stream.

Giovanni
  • 3,951
  • 2
  • 24
  • 30
0

Apologies for being slightly off-topic:

Does it have to be in Java? Depending on how you get your feed data, this sounds like a job for a fairly simple shell script to me (grep or fgrep for checking for keywords, gzip for compressing...)

beny23
  • 34,390
  • 5
  • 82
  • 85
  • I think best to stick to Java, as I know Java fairly well, and everything else is written in Java – Andrew Nov 21 '11 at 16:12