Split File - Java/Linux

Question

I have a large file contains nearly 250 million characters. Now, I want to split it into parts of each contains 30 million characters ( so first 8 parts will contains 30 million and last part will contain 10 million character). Another point is that I want to include last 1000 characters of each file at the beginning of the next part (means part 1's last 1000 characters append in 2nd part's begining - so, 2nd part contains 30 million 1000 characters and so on). Can anybody help me how to do it programmaticaly (using Java) or using Linux commands (in a fast way).

Why do you need the overlap? If you don't need it you can just use split and cat commands. — Roger Lindsjö, Jun 24 '12 at 18:33
I am very curious to know what is the use-case for overlapping the pieces. — Miserable Variable, Jun 24 '12 at 18:33
"I want to split it into parts of each contains 30 million characters " That is a surprising thing to want, are you sure you are not doing this for some reason, or is that reason enough for you? — Peter Lawrey, Jun 24 '12 at 18:38
@RogerLindsjö Actually, by overlapping characters, I want not to miss any information of previous file where I left. — Arpssss, Jun 24 '12 at 18:46
@PeterLawrey, I want spliting because 250 million I can't process in memory. — Arpssss, Jun 24 '12 at 18:47
You should be able to process a huge file, much bigger than available RAM. What kind of processing are you doing? What kind of files are these? (I'm guessing: log files, SQL dumps, video or audio files??). — Basile Starynkevitch, Jun 24 '12 at 18:50
@Arpssss if you load it as a byte array that 250 MB, if you use a memory mapped file it doesn't use any heap (< 1 KB) — Peter Lawrey, Jun 24 '12 at 18:53
@PeterLawrey, I used above case as an example to state my issue. Actually, my file is much bigger some what 3.2 gb. And .txt file. — Arpssss, Jun 24 '12 at 18:56
@BasileStarynkevitch, I used above case as an example to state my issue. Actually, my file is much bigger some what 3.2 gb. And .txt file. And I am using 4 gb RAM. — Arpssss, Jun 24 '12 at 18:56
You can still memory map a 3.2 GB or 32 GB file, but you will need to break it into portions of less than 2 GB, fortunately you can have overlapping regions. I suggest using a multiple of 4 KB as this is the natural page size. — Peter Lawrey, Jun 24 '12 at 19:16

Basile Starynkevitch · Answer 1 · 2012-06-24T18:44:16.297

2

Just use with appropriate options the split or csplit commands.

You may want to drive these programs with a more complex shell script, or using some other scripting language, to give them appropriate arguments (in particular to deal with your overlapping requirement). Perhaps you might combine them with other utilities (like grep or head or tail or sed or awk etc....).

edited Jun 24 '12 at 18:44

answered Jun 24 '12 at 18:28

Basile Starynkevitch

223,805
18
296
547

Do either of these have overlapping pieces like the OP wants? – Miserable Variable Jun 24 '12 at 18:32
Thanks. But, there is nothing about splinting on number of characters and appending last 1000 characters. – Arpssss Jun 24 '12 at 18:32

score 2 · Accepted Answer · answered Jun 24 '12 at 19:10

One way is to use regular unix commands to split the file and the prepend the last 1000 bytes from the previous file.

First split the file:

split -b 30000000 inputfile part.

Then, for each part (ignoring the farst make a new file starting with the last 1000 bytes from the previous:

unset prev
for i in part.*
do if [ -n "${prev}" ]
  then 
    tail -c 1000 ${prev} > part.temp
    cat ${i} >> part.temp
    mv part.temp ${i}
  fi
  prev=${i}
done

Before assembling we again iterate over the files, ignoring the first and throw away the first 1000 bytes:

unset prev
for i in part.*
do if [ -n "${prev}" ]
  then 
    tail -c +1001 ${i} > part.temp
    mv part.temp ${i}
  fi
  prev=${i}
done

Last step is to reassemble the files:

cat part.* >> newfile

Since there was no explanation of why the overlap was needed I just created it and then threw it away.

score 2 · Answer 3 · answered Jun 24 '12 at 19:25

You can try this. I have to used read/mode the first time as the file didn't exist at first. Youc an use read only as this code suggests.

long start = System.nanoTime();
long fileSize = 3200 * 1024 * 1024L;
FileChannel raf = new RandomAccessFile("deleteme.txt", "r").getChannel();
long midPoint = fileSize / 2 / 4096 * 4096;
MappedByteBuffer buffer1 = raf.map(FileChannel.MapMode.READ_ONLY, 0, midPoint + 4096);
MappedByteBuffer buffer2 = raf.map(FileChannel.MapMode.READ_ONLY, midPoint, fileSize - midPoint);
long time = System.nanoTime() - start;
System.out.printf("Took %.3f ms to map a file of %,d bytes long%n", time / 1e6, raf.size());

This is running on a Window 7 x64 box with 4 GB of memory.

Took 3.302 ms to map a file of 3,355,443,200 bytes long

score 1 · Answer 4 · answered Jun 24 '12 at 18:59

You can do it using BreakIterator class and its static method getCharacterInstance(). It Returns a new BreakIterator instance for character breaks for the default locale.

You can also use getWordInstance(), getLineInstance().. to break words, line...etc

eg:

BreakIterator boundary = BreakIterator.getCharacterInstance();

boundary.setText("Your_Sentence");

int start = boundary.first();

int end = boundary.next();

Iterate over it... to get the Characters....

For more detail look at this link:

http://docs.oracle.com/javase/6/docs/api/java/text/BreakIterator.html

Split File - Java/Linux

4 Answers4

Linked