0

first of all: I don't have a deep knowledge of the java 8 streams and may be what I'm going to ask is trivial, impossible or already implemented.

I'm working with records stored in large binary files. Those binary files are associated with another binary index allowing to access some parts of the file using RandomAccessFile .

enter image description here

The interface would be:

public interface BinaryFile<T> extends Iterable<T> {
        public CloseableIterator<T> queryUsingIndex(long beginIndex,long endIndex);
        public CloseableIterator<T> iterator();/* get all */
        }

Say, I want to count the number of records using java 8 streams. As far as I understand, I could use a stream to count the number of records in the binary file. A parallel stream would run things faster by counting the number of records in each colored part.

new BinaryFileImp(myFile).parallel().count();

Is it possible to implement this kind of Iterator using a random-access file ? where should I start ? Which classes in the JDK I should consider ?

Thank you for your suggestions.

EDIT: additional information. I'm working with SAM files ( https://samtools.github.io/hts-specs/SAMv1.pdf ), a common bioinformatics file format storing millions of records along the genome. A common practice is to work in parallel on different chromosome to speed up things. So, to count the number of records , i would sum the count on { chromosome 1, chromosome 2, ... chromosome Y }

Holger
  • 285,553
  • 42
  • 434
  • 765
Pierre
  • 34,472
  • 31
  • 113
  • 192
  • You need a Spliterator, not an iterator. If you iterate of the entries just to count them, the counting is tirival but the overhead of splitting that work is huge by comparison. The Spliterator will allow you to examine different portions of the data concurrently. – Peter Lawrey Mar 03 '16 at 09:42
  • Since you only need examine the indexed you would need millions of values before you see a benefit in using multiple threads. If you really want to speed this up, I suggest storing a size in the index. – Peter Lawrey Mar 03 '16 at 09:44
  • @PeterLawrey I do have millions of values and the format is already specified. https://samtools.github.io/hts-specs/SAMv1.pdf – Pierre Mar 03 '16 at 09:46
  • In that case you need a Spliterator. I would look at the ones in the JDK esp the one for ArrayList. – Peter Lawrey Mar 03 '16 at 09:47
  • @Tunaki how would doing what you suggest in the duplicate improve performance? I did already suggest this would be much, much slower and explained why. – Peter Lawrey Mar 03 '16 at 09:49
  • @PeterLawrey OP has an Iterable and wants a Stream. The duplicate is exactly about that and tell to obtain a spliterator and so that they need to implement that. It also mentions to have an Iterable that is a Collection. – Tunaki Mar 03 '16 at 09:54
  • I will post my answer soon, I've implemented a solution counting 61,045,456 records : using stream() : 374 secs, parallelstream() : 79 secs – Pierre Mar 04 '16 at 10:16
  • here it is: http://plindenbaum.blogspot.fr/2016/03/reading-vcf-file-faster-with-java-8.html – Pierre Mar 04 '16 at 16:27

0 Answers0