2

For a project I am working on, I am trying to count the vowels in text file as fast as possible. In order to do so, I am trying a concurrent approach. I was wondering if it is possible to concurrently read a text file as a way to speed up the counting? I believe the bottleneck is the I/O, and since right now I am reading the file in via a buffered reader and processing line by line, I was wondering if it was possible to read multiple sections of the file at once.

My original thought was to use Split File - Java/Linux

but apparently MappedByteBuffers are not great performance wise, and I still need to read line by line from each MappedByteBuffer once I split.

Another option is to split after reading a certain number of lines, but that defeats the purpose.

Would appreciate any help.

Community
  • 1
  • 1
jstnchng
  • 2,161
  • 4
  • 18
  • 31
  • Is this part the most time consuming part of your program? Maybe you can just cache it. – huseyin tugrul buyukisik Jun 01 '15 at 21:28
  • 3
    Unless you have a very large RAID 0 array or one or more SSD drives, reading a file concurrently will just cause disk thrashing and _slow down_ your application. Given the processing task you have is very simple (read fast) you will unlikely be able to gain from threading as you are IO constrained and not CPU constrained. – Boris the Spider Jun 01 '15 at 21:31
  • `as fast as possible` Can you explain why? What's the problem with just reading the file? – copeg Jun 01 '15 at 21:32
  • @huseyintugrulbuyukisik yep! the rest is super fast vowel counting... – jstnchng Jun 01 '15 at 21:42
  • @BoristheSpider I have about a 1G text file. This is kind of for training purposes at work, just more of an exercise. I was wondering more generally, is it possible to get to a midpoint of a file WITHOUT reading the lines before? I know there's some kind of Random Access File... – jstnchng Jun 01 '15 at 21:42
  • @copeg more of an exercise I suppose... Just learning multithreading in Java and seeing how I can speed up performance – jstnchng Jun 01 '15 at 21:43
  • Midpoint in `byte`s, yes - easily. Middle `char`, significantly more difficult as some `char`s can span multiple bytes. Middle sentence, or semantic construct, extremely difficult. – Boris the Spider Jun 01 '15 at 21:47
  • @BoristheSpider won't chars only span 2 bytes in Java? – jstnchng Jun 02 '15 at 14:12
  • 1
    In Java a logical character doesn't fit into one `char`, so you have the idea of a "Code Point" which is an `int`. The issue, you see, is that some `char`s are "special" and serve only to modify their neighbours. So in order to find the code point you need to work out whether you are looking at a [surrogate `char`](https://docs.oracle.com/javase/tutorial/i18n/text/supplementaryChars.html) or a normal one. This all gets rather messy rather quickly... [Further reading](https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java). – Boris the Spider Jun 02 '15 at 14:16
  • @BoristheSpider I see... right now I have `MappedByteBuffer buffer = fc.map(FileChannel.MapMode.READ_ONLY, i*inputSplitFileSize, inputSplitFileSize);` with fc being a FileChannel for a RandomAccessFile. Are you saying this won't necessarily split the chars evenly? (Thanks for the further reading by the way) – jstnchng Jun 02 '15 at 14:21
  • It all depends on the encoding of the file - I think you should be okay for UTF-8 or UTF-32, but in UTF-16 (which Java uses internally) you have these weird pair characters. So the answer is, as ever, it depends. Basically, one should not treat text as bytes unless one really understands the encoding being used. YMMV... – Boris the Spider Jun 02 '15 at 14:27
  • @BoristheSpider Fair enough... thanks for your insight! Even if I'm using UTF-16 I only have these issues for special chars between 0x10000 to 0x10FFFF right? – jstnchng Jun 02 '15 at 14:28
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/79445/discussion-between-jstnchng-and-boris-the-spider). – jstnchng Jun 02 '15 at 16:28

1 Answers1

0

The following will NOT split the file - but can help in concurrently processing it!

Using Streams in Java 8 you can do things like:

Stream<String> lines = Files.lines(Paths.get(filename));
lines.filter(StringUtils::isNotEmpty) // ignore empty lines

and if you want to run in parallel you can do:

lines.parallel().filter(StringUtils::isNotEmpty) 

In the example above I was filtering empty lines - but of course you can modify it to your use (counting vowels) by implementing your own method and calling it.

Nir Alfasi
  • 53,191
  • 11
  • 86
  • 129
  • 2
    This will "buffer" up a number of lines from the file and pass off each chunk to the rest of the pipeline, parallelising the processing - an example of a producer/consumer pattern (ish). Marko Topolnik has a [very good article](https://www.airpair.com/java/posts/parallel-processing-of-io-based-data-with-java-streams) on how this all works. The short story is though, that unless the processing is fairly slow-running the default behaviour will not be ideal. – Boris the Spider Jun 01 '15 at 21:45
  • @BoristheSpider thanks for the link - I'll check it out tonight! – Nir Alfasi Jun 01 '15 at 22:04