0

My scenario

Filter a Java Collection such a frequent scenario and I am seeking the best methodology to do so. After some searching in StackOverflow, I can only see recommendation of specific collection filtering method without statistical comparsions for speed. Somebody recommended Predicate Some recommended lambdaj

I would like to see if somebody studied that already. So, the question is: "What is the best way to filter a Java Collection, in terms of readability and speed?"

Scope

I would like to know the study result of some remarkable methodology in the web, they are: (In certain, if you have another (better) choice, feel free to edit my question)

  1. While looping
  2. CollectionUtils Predicate
  3. lambdaj
  4. CQEngine

Possible performance study details

We may assume there is a WS call which need 0.05seconds to execute. And we may call it for 100 times.

getSomething() // 0.05 seconds to execute

Then we may compare their performance in scenarios like

  1. Single Result performance (assume there is only one result returned by certain criteria. When the object is found, quit the searching) 1.1. Best case scenario 1.2 Worst case scenario 1.3 Average case scenario

  2. Multiple result performance (assume there is a collection returned, may more than one result)

Thanks for answering!

Community
  • 1
  • 1
Michael W.
  • 101
  • 8
  • 2
    Whatever the method is, it always boils down to looping over the collection, testing each element, and add the matching to a list. It's O(n). I doubt the difference between methods is significant in terms of performance. Th new Java 8 Stream API will become the standard way of doing it, and allows parallelism if needed (which is extremely rare): `list.stream().filter(i -> isAccepted(i)).collect(Collectors.toList());`. Choose the way you find the most readable, and optimize only if needed. – JB Nizet Aug 01 '15 at 14:06
  • @JBNizet I have doubt on your prediction before there is a objective research. At least CQEngine claimed itself better than a iternation naive [ref.](https://code.google.com/p/cqengine/) – Michael W. Aug 01 '15 at 14:19
  • Downvoter, please reconsider this as valid question. In fact it should really be a common scenario and problem that programmers would face – Michael W. Aug 01 '15 at 14:20
  • 1
    You clearly understand that some sort of statistical performance test needs to occur here to answer your question. You should conduct this test yourself by filtering a large collection a large number of times for each methodology and averaging the results. Stack Overflow is not somewhere to get other people to do work for you. (Of course I agree with JB Nizet that you should assume that the way you choose is fast enough (Oracle know what they are doing) and only optimise as needed.) – user1886323 Aug 01 '15 at 14:43
  • 1
    @MichaelW. CQEngine doesn't filter a collection in its benchmark. It executes a query on an indexed collection. That's a completely different matter. If you add the time to index the collection to the time needed to filter it, I don't see how it could be faster. – JB Nizet Aug 01 '15 at 15:04
  • Thanks for everyone's comment. @user1886323 In fact I am looking for existing information which somebody have already studied. By posting it, everyone may get benefit and no need to study answered research, which I think it is a good thing (What StackOverflow is doing). I would do so if no one answer/No one studied this yet – Michael W. Aug 01 '15 at 15:10
  • @user1886323 Last but not least, I agree with your comment, but still have a doubt whether there is different of performance between different methodologies because internal algorithm may be different. – Michael W. Aug 01 '15 at 15:12
  • @JB Nizet Thanks for clarifying. That's a good point. I think you've given me another field for study (indexing before searching for filtering collection) – Michael W. Aug 01 '15 at 15:14
  • 1
    Micheal, the existing information is being given to you - the accepted point of view in the industry is that it makes no different regarding performance, so a deeper study is unlikely to be found on StackOverflow. This is why you should do it yourself! (And you will likely find negligible differences because as someone else has already mentioned it's an O(n) operation and you would have to try pretty hard to implement a worse algorithm than O(n)) – user1886323 Aug 01 '15 at 15:57
  • @user1886323 Thanks again. Okay I got it. Should someone reply this as answer? (and if I have further findings I may post it here) Or it is just a not useful question (deserve to got downvoted)? Sorry for being troublesome, I am new here. – Michael W. Aug 01 '15 at 16:38

1 Answers1

1

For most uses, it is enough to assume that all the different ways of doing this have roughly the same performance (O(n)) therefore this is not something that is studied very often.

If you want to find the "best" way of doing this, you should conduct your own performance test on some large collections of varying objects. If you find any significant differences in performance you can post them here.

user1886323
  • 1,179
  • 7
  • 15
  • In my Android benchmarks, parallelStream () destroyed competition. Here is a spreadsheet with my results: https://docs.google.com/spreadsheets/d/10IqZ42ICAnugmCcMJkknqxG8haOcwP0Xx3tPz_cuAHs/edit?usp=drivesdk – Sam Jan 31 '22 at 03:51