4

I have a file with about 10.000.000 lines of text (yes I have enough memory). now I want a List of MyClass (Constructor is MyClass(String s) with every line of the file. Right now I am doing it like this:

List<MyClass> help = Files.lines(Paths.get(s))
                          .parallel()
                          .map(MyClass::new)
                          .collect(Collectors.toList());

but it takes Years to progress. Any Idea on how to speed up this problem?

Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
Exagon
  • 4,798
  • 6
  • 25
  • 53
  • You could add informations about what you are willing to do and what you want to avoid. – Pshemo Nov 12 '15 at 23:37
  • And what is that process of yours? From what it looks you just create a new instance of some class from each line. More worryingly, you should `.close()` that stream of yours: `Files.lines()` is I/O bound. – fge Nov 12 '15 at 23:37
  • 1
    Another version of http://stackoverflow.com/questions/33682281/whats-the-fastest-way-to-read-line-per-line ? – PM 77-1 Nov 12 '15 at 23:41
  • @PM77-1 I would say more like *initial* than *another*. – Pshemo Nov 12 '15 at 23:43
  • how can i `.close` this stream? sorry i am new to streams. i just need this list to set to an ListView in a gui – Exagon Nov 12 '15 at 23:43
  • _Yes I have enough memory?_ A ten million line text file is not very big by today's standards. – Solomon Slow Nov 13 '15 at 00:05
  • 1
    As a very simple optimization you may try `Arrays.asList(Files.lines(...)...toArray(MyClass[]::new))` instead of `.collect(Collectors.toList())`. However having [MCVE](http://stackoverflow.com/help/mcve) would be nice. – Tagir Valeev Nov 13 '15 at 09:50

1 Answers1

2

First things first, a relevant extract from the documentation of Collectors.toList():

[...]There are no guarantees on the type, mutability, serializability, or thread-safety of the List returned; if more control over the returned List is required, use toCollection(Supplier)

Now, let us look a little more deeply into a collector's characteristics; we find this:

public static final Collector.Characteristics CONCURRENT

Indicates that this collector is concurrent, meaning that the result container can support the accumulator function being called concurrently with the same result container from multiple threads.

If a CONCURRENT collector is not also UNORDERED, then it should only be evaluated concurrently if applied to an unordered data source.

Now, nothing guarantees that the collector returned by Collectors.toList() is Concurrent at all.

Notwithstanding the time which it may take to initiate a new class of yours, the safe bet here would be to assume that this collector is not concurrent. But fortunately we have a means to use a concurrent collection instead, as mentioned in the javadoc. So, let's try:

.collect(
        Collector.of(CopyOnWriteArrayList::new,
            List::add,
            (o, o2) -> { o.addAll(o2); return o; },
            Function.<List<String>>identity(),
            Collector.Characteristics.CONCURRENT,
            Collector.Characteristics.IDENTITY_FINISH
        )
    )

This may speed things up.

Now, you have another problem. You do not close you stream.

This is little known but a Stream (whether of any type or an {Int,Double,Long}Stream for that matter) implements AutoCloseable. You want to close streams which are I/O bound and Files.lines() is such a stream.

So, try this:

final List<MyClass> list;

try (
    final Stream<String> lines = Files.lines(...);
) {
    list = lines.parallel().map(MyClass::new)
        .collect(seeAbove);
}
Community
  • 1
  • 1
fge
  • 119,121
  • 33
  • 254
  • 329
  • with this Collector Eclipse gives me an error `Type mismatch: cannot convert from Object to List` – Exagon Nov 12 '15 at 23:58
  • 2
    Well, um, does it at least speed things up for you? :p – fge Nov 13 '15 at 00:16
  • yes it speed things up but also takes about 10 minutes ... sooo... yeah...how is it possible this task to only use 0.7 of my cpu btw? – Exagon Nov 13 '15 at 00:20
  • Then what about you think a little higher? For instance, wouldn't you be able to replace your .collect() call with a .forEach() instead? – fge Nov 13 '15 at 00:25
  • .forEach(list::add) ? like this? – Exagon Nov 13 '15 at 00:30
  • 1
    No; what I meant is, is it really necessary to collect all of your newly created objects into a collection to start with? Can't you use them instead to do whatever it is that you need to do? Ie, stream.parallel().map(MyClass:new).forEach( /* do something with instance of MyClass */ ) – fge Nov 13 '15 at 00:32
  • The answer is completely misleading. If collector is not CONCURRENT, it [does not mean](http://stackoverflow.com/q/22350288/4856258) that it cannot be used for parallel stream. Special CONCURRENT processing speeds up the unordered collection eliminating the needs to execute the `combiner`. In this case the proposed solution will not use concurrent collection as both source and collector are ordered, so this would actually slow down the process even more. – Tagir Valeev Nov 13 '15 at 09:47