4

I have 270000 records in a CSV file with columns user_id, book_ISBN, and book_rating, I need to insert the records into a many-to-many table. I parsed the data with openCSV library and the result is a list.

public List<UserRatingDto> uploadRatings(MultipartFile file) throws IOException{
        BufferedReader fileReader = new BufferedReader(new
                InputStreamReader(file.getInputStream(), "UTF-8"));

        List<UserRatingDto> ratings = new CsvToBeanBuilder<UserRatingDto>(fileReader)
                .withType(UserRatingDto.class)
                .withSeparator(';')
                .withIgnoreEmptyLine(true)
                .withSkipLines(1)
                .build()
                .parse();
        return ratings;
    }

There are no performance issues with this, it takes approximately 1 minute to parse. However, in order to insert these into a table, I need to fetch books and users from the DB in order to form the relationship, I tried to make the method async with @Async annotation, I tried parallel stream, I tried putting the objects into a stack and using saveAll() to bulk insert, but it still takes way too much time.

 public void saveRatings(final MultipartFile file) throws IOException{
        List<UserRatingDto> userRatingDtos = uploadRatings(file);

        userRatingDtos.parallelStream().forEach(bookRating->{
            UserEntity user = userRepository.findByUserId(bookRating.getUserId());
            bookRepository.findByISBN(bookRating.getBookISBN()).ifPresent(book -> {
                BookRating bookRating1 = new BookRating();
                bookRating1.setRating(bookRating.getBookRating());
                bookRating1.setUser(user);
                bookRating1.setBook(book);
                book.getRatings().add(bookRating1);
                user.getRatings().add(bookRating1);
                bookRatingRepository.save(bookRating1);
            });

        });
}

This is what I have now, is there anything I can change to make this faster?

Elen Mouradian
  • 475
  • 1
  • 8
  • 13
  • If the bookIsbn and userId are foreign key references then you can avoid fetching book and user objects. https://stackoverflow.com/a/6311954 – JavaLearner Jun 08 '22 at 14:22

2 Answers2

6

The problem is data is being fetched and persisted one by one. The most performant way to access data is usually well defined batches, then following the pattern:

  • fetch data required for processing the batch
  • process the batch in memory
  • persist processing results before fetching the next batch

For your specific use case, you can do something like:

    public void saveRatings(final MultipartFile file) throws IOException {
        List<UserRatingDto> userRatingDtos = uploadRatings(file);

        // Split the list into batches
        getBatches(userRatingDtos, 100).forEach(this::processBatch);
    }

    private void processBatch(List<UserRatingDto> userRatingBatch) {
        
        // Retrieve all data required to process a batch
        Map<String, UserEntity> users = userRepository
                .findAllById(userRatingBatch.stream().map(UserRatingDto::getUserId).toList())
                .stream()
                .collect(toMap(UserEntity::getId, user -> user));
        Map<String, Book> books = bookRepository.findAllByIsbn(userRatingBatch.stream().map(UserRatingDto::getBookISBN).toList())
                .stream()
                .collect(toMap(Book::getIsbn, book -> book));

        // Process each rating in memory
        List<BookRating> ratingsToSave = userRatingBatch.stream().map(bookRatingDto -> {
            Book book = books.get(bookRatingDto.getBookISBN());
            if (book == null) {
                return null;
            }
            UserEntity user = users.get(bookRatingDto.getUserId());
            BookRating bookRating = new BookRating();
            bookRating.setRating(bookRatingDto.getBookRating());
            bookRating.setUser(user);
            bookRating.setBook(book);
            book.getRatings().add(bookRating);
            user.getRatings().add(bookRating);
            return bookRating;
        }).filter(Objects::nonNull).toList();

        // Save data in batches
        bookRatingRepository.saveAll(ratingsToSave);
        bookRepository.saveAll(books.values());
        userRepository.saveAll(users.values());

    }

    public <T> List<List<T>> getBatches(List<T> collection, int batchSize) {
        List<List<T>> batches = new ArrayList<>();
        for (int i = 0; i < collection.size(); i += batchSize) {
            batches.add(collection.subList(i, Math.min(i + batchSize, collection.size())));
        }
        return batches;
    }

Note that all I/O should always be done in batches. If you have a single DB lookup or save in the inner processing loop this will not work at all.

You can try different batch sizes to see what brings better performance - the bigger the batch the longer transactions will remain open, and not always bigger batches result in better overall performance.

Also, make sure you handle errors gracefully - for example:

  • if a batch throws an error, you can break such a batch in two, and so on until only one rating fails.
  • you can also retry a failing batch with backoff if for example there's a DB access problem.
  • you can discard a rating if for example you have a null required field

EDIT: As per OP's comment, this increased performance 10x+. Also, if ordering is not important performance can still be greatly improved by processing each batch in parallel.

EDIT 2: As a general pattern, ideally we wouldn't have all records in memory to begin with, instead retrieving data to be processed in batches as well. This would further improve performance and avoid OOM errors.

Also, this can be done in many concurrency patterns, for example having dedicated threads to fetch data, worker threads to process it, and another set of threads to persist the results.

The easiest pattern is having each unit of work being independent - they're given what they should process (e.g. a set of ids to fetch from DB), then retrieve the necessary data for processing, process it in memory, and persist the results.

Tomaz Fernandes
  • 2,429
  • 2
  • 14
  • 18
  • I was able to implement this with some corrections. Thanks, it helped a lot! I will try to optimize it a bit more with executor service. – Elen Mouradian Jun 01 '22 at 17:02
  • I'm glad to hear that! Would you share by how much the performance was improved, if you have the metrics? Also, if ordering is not an issue, you sure can process the batches in parallel for even greater performance gains. – Tomaz Fernandes Jun 01 '22 at 18:03
  • Before it would take minutes just to store 1000-3000 entities, I can't tell you precise numbers, but the number of entities stored was multiplied by 10 if not more. I am currently working toward parallel processing! – Elen Mouradian Jun 02 '22 at 11:27
  • Nice, thanks for sharing! I've added some suggestions on concurrency handling - depending on your performance requirements they might be useful. – Tomaz Fernandes Jun 02 '22 at 15:11
  • I'm new to Java data access libraries. Usually I just write SQL. Also, embarrassing as it is to admit, new to transactions. I've never used them. In my defense, maybe this is because I tend to work with big data NoSQL databases and data pipelines. I'm just now getting into application development. How do transactions work with `saveAll` like this? Does the ORM library do a big SQL query under the hood that is actually comprised of many small transactions, that the database will apply individually? And the database is able to complete some, and tell you which failed in that batch? – Matt Welke Jun 03 '22 at 03:45
  • Not really - if a “save” fails, the entire batch should be rolled back. In this answer I didn’t add transaction control, probably should Each batch should have its own transaction.Not sure if we can get a detailed enough exception to know which insertion failed, but there are a few mechanisms to handle the exception, some of which I described in the answer. – Tomaz Fernandes Jun 03 '22 at 03:53
4

Why not just use a temporary staging table like this (possibly using NOLOGGING and other optimisations, if available):

CREATE TEMPORARY TABLE load_book_rating (
  user_id BIGINT,
  book_isbn TEXT,
  rating TEXT
);

Then batch load the CSV data into that staging table, then bulk insert all the data in the real table, like this:

INSERT INTO book_rating (user_id, book_id, book_rating)
SELECT l.user_id, b.id, l.book_rating
FROM load_book_rating AS l
JOIN book AS b ON l.book_isbn = b.isbn

I may have overlooked some details from your schema, but my main point here is that you're probably doing all these hoops only because of the ISBN natural key that you're not using as a primary key of your BOOK table, so you have to perform a lookup?

Alternatively, use your RDBMS's native CSV import capabilities. Most of them can do it, see e.g. PostgreSQL's COPY command

I'm pretty sure that a purely SQL based approach will outperform any other approach that you may implement in Java.

Lukas Eder
  • 211,314
  • 129
  • 689
  • 1,509