0

I would like to split up a lot of data, for example, is there a way to use it like this(in Java):

idList.parallelStream()
    .splitCount(20)
    .eachMap((splitedIdList) -> dao.list(splitedIdList))
    .aggregate()
    .collect(Collectors.toList());

Is there a way to solve it using streams or other libraries?

Pshemo
  • 122,468
  • 25
  • 185
  • 269
Yeongjun Kim
  • 739
  • 7
  • 19
  • 3
    Sure, but why would you do that? Presumably your `dao` accesses the database, so doing that in parallel isn't going to increase performance. It will harm it. – Kayaman Oct 09 '17 at 13:36
  • DUPE https://stackoverflow.com/questions/27583623/is-there-an-elegant-way-to-process-a-stream-in-chunks – tbsalling Oct 09 '17 at 13:53
  • 1
    @Kayaman Thanks, in my case, the `idList` is large(max 10,000) and `dao.list()`'s sql is `SELECT * FROM table WHERE id IN (...)`. Would not it improve performance in this situation..? – Yeongjun Kim Oct 09 '17 at 14:05
  • 2
    How would it? You're still fetching the same amount of rows, but instead of letting the database be as efficient as possible, you're dividing it into separate queries. The only reason you need to chunk, is that there are database specific limits to the amount of values you can have in an `IN ()` clause (such as Postgres' 32768). Parallelizing just makes you use more resources (threads and connections), and you're not even going to get any performance benefits because you're doing a job that is best left to the database (planning the query and aggregating the results). – Kayaman Oct 09 '17 at 14:10
  • @Kayaman I believe `SELECT * FROM table WHERE id IN (...)` by batch is much faster than executing `select * from table where id = ?` with single id one by one hundred times. We did this kind of test before in our project. – 123-xyz Oct 09 '17 at 17:56
  • @123-xyz Of course it is. But that's not at all what was being discussed in this question. If he has only 10,000 values, there's absolutely no need or use to split them up. Nowhere was it suggested that he select the rows one by one. – Kayaman Oct 09 '17 at 18:08
  • @Kayaman are you suggesting put all ids in one query? Some database may have a small number limitation(e.g. oracle is 1000), the second reason is put a big number ids to `IN` cause may choke the database or wait much long get the response even the total time is less than fetch by batch(hundreds), generally we would suggest to split the ids into batch(hundreds) if it's very big (10K in OP). – 123-xyz Oct 09 '17 at 18:26
  • @123-xyz If 10,000 is a lot to you, then you haven't worked with large amounts of data. As I said in my earlier comment, the only reason to split would be if the db limits the amount of values. However the OP's comments all indicate that he's looking for some kind of performance increase. When you say "we suggest", who exactly is "we", and on what grounds should your suggestions be followed? In your first comment you didn't even understand what was being discussed, and thought that I suggested he should select the rows one by one. – Kayaman Oct 09 '17 at 18:33
  • @opid What database are you on? – Kayaman Oct 09 '17 at 18:33
  • @Kayaman, I meant the we are told to split the big number ids into batch, sorry for confusion, and yes, i misunderstood the your suggestion at the begin. but i don't think there is any relationship between writing a query by putting 10,000 elements in `IN` cause and the size of the target table or the amounts of data people have worked with. To me, it's kind of rare case to write some query with 10,000 elements in `IN` cause. and come back to the OP, I think it's a general use case to split big amount data into batch in query/update/delete. – 123-xyz Oct 09 '17 at 19:12
  • @123-xyz arguments of "we are told", "it's a rare case for me" and "I think it's a general use case" don't really hold water. Batch **inserts** are a common case, but unless you understand the database you're working with, don't start guessing things. – Kayaman Oct 09 '17 at 19:26
  • @Kayaman I use [cubrid](https://www.cubrid.org/) – Yeongjun Kim Oct 11 '17 at 04:53
  • 1
    Can't find any info about limits of the `IN` predicate in Cubrid, but if it's large enough, don't forget to test the performance between splitting and doing a direct query. This is *not* standard practice, and if the database is any good (although it seems to be quite young), you would get the best performance without any artificial splitting. – Kayaman Oct 11 '17 at 05:50

2 Answers2

0

Although the remarks of @Kayaman are very relevant, I think what you're looking for is java.util.Spliterator<T>.

An object for traversing and partitioning elements of a source. The source of elements covered by a Spliterator could be, for example, an array, a Collection, an IO channel, or a generator function.

A Spliterator may traverse elements individually (tryAdvance()) or sequentially in bulk (forEachRemaining()).

A Spliterator may also partition off some of its elements (using trySplit) as another Spliterator, to be used in possibly-parallel operations. Operations using a Spliterator that cannot split, or does so in a highly imbalanced or inefficient manner, are unlikely to benefit from parallelism. Traversal and splitting exhaust elements; each Spliterator is useful for only a single bulk computation.

The class' JavaDoc is quite elaborate including an example on how to implement and use a Spliterator.

geert3
  • 7,086
  • 1
  • 33
  • 49
-1

Here is a sample solution my fork on SteamEx

StreamEx.of(idsList).parallel()
        .splitToList(100)
        .flatCollection(ids -> dao.list(ids)) // assume dao.list(...) returns a list of records, you can switch to flatArray/flatMap per the return type of dao.list
        .toList();

Update: I did some performance test for the comments under the OP: (unit is milliseconds)


size the of the id list: 100, 1000, 10_000

All in one query: 3.5, 6.9, 282.2

By batch(200) 3.5, 4.8, 186.0

One by One: 7.8, 35.5, 572.9

Based on the performance test result on MySQL 5.7 and my personal experiences, I would say:

1, if there are only couple of ids, maybe it's ok to query one by one.

2, if there are hundreds/thousands or even more, batch or all in one query is necessary.

3, if you want to execute the query with all ids in one statement, please be aware:

a) The limitation is 1000 for Oracle, 8623 for SQL server

b) In the limitation doesn't mean the database is tuned for the maximum capability.

c) it may take long to execute big query and the database could be choked by long query

d), timeout or over the packet size.

123-xyz
  • 619
  • 4
  • 5
  • You can use `dao::list` instead of `ids -> dao.list(ids)`. – shmosel Oct 10 '17 at 01:54
  • Downvoted for "personal experience". It's also never okay to query one by one, if you know you need to fetch multiple rows. As to your case `b)`, you're going to have to give some proof if you imply that it's less effective. Same for c and d. See [here](https://stackoverflow.com/questions/22797896/what-are-the-performance-implications-of-oracle-in-clause-with-no-joins) for an example case with Oracle. If you do a performance test, you need to show code and settings, query plans, otherwise we can't trust you for testing correctly. – Kayaman Oct 10 '17 at 05:16
  • I'm not out to get you, but this answer has very little value and provides wrong solutions. Max packet size is configurable in MySQL for example, executing big queries doesn't take longer than multiple small queries, unless the planner makes an error (which means you've made an error in your configuration). There have been hundreds of man years put into making databases smart, your personal experience doesn't hold a candle to it. – Kayaman Oct 10 '17 at 05:30