4

I have a request to export roughly a million records to do a one time migration and I need those records exported in JSON format and adhering to the same api contract object structure we normally expose to this other team.

The object I need to expose is composed of data from two tables and I have a conditional in place to limit the resultset from millions to 1 million.

The batching logic I have below does the initial fetch then batches them in groups of 1000 and add them to a list that I want to expose eventually.

Service method below:

public List<NewEntity> getNewEntities() {
    int i=0;
    int BATCH_SIZE = 1000; 

    List<Entity> totalEntityList = new ArrayList<>();
    
    List<Entity> entityList = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
        .setFirstResult(i)
        .setMaxResults(BATCH_SIZE).getResultList();

    while(entityList.size() == BATCH_SIZE) {
        i+=BATCH_SIZE;
        entityList = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
            .setFirstResult(i)
            .setMaxResults(BATCH_SIZE)
            .getResultList();
        totalEntityList.addAll(entityList);
    }

    return totalEntityList.stream()
        .map(entity -> entity.toNewEntity())
        .collect(toList());
}

This process was taking 2.5 (with batch of 1000) and 3.5 hours (with batch of 2000). The first time I ran it, I had a bug where I didn't save all of them to the list, so I ran it again overnight. I checked in the morning, it took longer and my IDE froze so I don't know why the request eventually failed with 500 error. I force quit my IDE and trying again now but I won't know if it works until after 2 to 3 hours. This is very slow.

Is there a way to improve how I'm doing this? Any other approaches here?

I'm using Spring Boot, Java, JPA.

Thanks!

EDIT: UPDATE 2 after adjusting per recommendations from the comments:

public void getNewEntities() {
    int i=0;
    int BATCH_SIZE = 1000; 
    
    List<Entity> entityList = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
        .setFirstResult(i)
        .setMaxResults(BATCH_SIZE).getResultList();

    try{
        FileWriter file = new FileWriter("new_entities.json");
        while(entityList.size() == BATCH_SIZE) {
            i+=BATCH_SIZE;
            entityList = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
                .setFirstResult(i)
                .setMaxResults(BATCH_SIZE)
                .getResultList();
            file.write(new Gson().toJson(entityList));
        }
        file.close();
    }
    catch(IOException ex) {
        ex.printStackTrace(); 
    };
}

Is this a better(correct?) approach? I'm letting it run right now and not sure how long it would take to write 1,000,000 records to JSON file but let me know if this on the right track? Currently, it's been two hours and it's only about 450,000 records through!

UPDATE 3 adding Entity for visibility. I'm assuming now it could be an issue with the mapping to locations where I'm converting it to list at the end. Should I just keep it as a stream instead? Also I obviously have a lot more fields but I removed the ones where it was just a simple column field with no complexity to it.

@Builder
@Getter
@Entity
@AllArgsConstructor
@NoArgsConstructor
@Table(name = "TABLE_NAME", schema = "SCHEMA")
@NamedQuery(name = SELECT_NEW_ENTITIES, query = FIND_NEW_ENTITIES)
public class Entity { 

    public static final String SELECT_NEW_ENTITIES = "SELECT_NEW_ENTITIES";

    protected static final String FIND_NEW_ENTITIES = "" + 
        "SELECT a FROM Entity a " + 
        "WHERE a.code IS NOT NULL " + 
        "AND a.someId IS NOT NULL " +
        "ORDER BY a.id ";
    
    @Id 
    @Column(name = "NEW_ENTITY_ID") 
    private Long id;

    @Column(name = "SHARED_ID") 
    private Long sharedId;

    @OneToMany(cascade = CascadeType.ALL) 
    @JoinColumn(name = "NEW_ENTITY_ID")
    @Builder.Default 
    private List<Location> locations = new ArrayList<>();
    
    @Formula(value = "(SELECT eng.SOME_ID from SCHEMA.ANOTHER_TABLE eng WHERE eng.SHARED_ID = SHARED_ID)") 
    private String someId;
    
    @Formula(value = "(SELECT eng.CODE from SCHEMA.ANOTHER_TABLE eng WHERE eng.SHARED_ID = SHARED_ID)") 
    private String code;

    public NewEntity toNewEntity() { 
        return NewEntity
        .builder() 
        .newEntityId(this.getId())
        .code(code) 
        .locations(mapLocations()) 
        .build(); 
        }
 
    private List<LocationModel> mapLocations() { 
        return this.getLocations()
        .stream() 
        .map(location -> LocationModel.builder() 
            .id(location.getId()) 
            .newEntityId(location.getNewEntityId()) 
            .city(location.getCity()) 
            .state(location.getState()) 
            .country(location.getCountry())
            .build()) 
        .collect(toList()); 
    }
}
Euridice01
  • 2,510
  • 11
  • 44
  • 76
  • 4
    Nice try with the batching but you are doing it wrong, as you are still loading everything in memory. Instead do this and directly write to the json file instead of storing them in memory. Process chunks not all of them at once. – M. Deinum Jun 19 '23 at 12:40
  • For queries this large, you should use a streaming API. I'm not sure if JPA has that feature, but something like jOOQ certainly does. But at any rate, you don't want to have all those objects in memory at the same time. Open the output stream and write each batch to it as it comes in. – Jorn Jun 19 '23 at 12:40
  • 3
    *"Is there a way to improve how I'm doing this?"* - don't build a list of a million `Entity` objects. – Stephen C Jun 19 '23 at 12:41
  • 1
    Is there anything against using Oracle's own exp tool ? – BigMike Jun 19 '23 at 12:48
  • 1
    I am removing the Oracle tag because, apart from using it in the background, there is nothing obviously related to Oracle in the question. If Oracle is actually relevant to the question then please [edit] the question an include a [MRE] with: the `CREATE TABLE` and `CREATE INDEX` statements for your table(s)/index(es); the `SELECT` statement you are generating; an example of the data you are passing to Oracle; the expected output from the query; and the `EXPLAIN PLAN` for the query. – MT0 Jun 19 '23 at 12:54
  • Ok got it, I'll batch it and then write directly to file for every batch then instead of collecting it in a giant list in memory. For oracle's own export tool, I need the "response object" formatted differently than how the records are presented in the db and also I need to join two tables. I can't use the export tool. – Euridice01 Jun 19 '23 at 12:54
  • 2
    I would've gone with a simple PL/SQL block, a cursor fetch loop writing JSON or whatever format you need on a file. A million records may seem daunting but can be managed even with some simple sql*plus script (yes, even with the json output). But I guess an Oracle take on this isn't required anymore. – BigMike Jun 19 '23 at 12:58
  • 1
    You can follow this post blog ["Streaming Large JSON Response In Spring"](https://dev.to/ratulsharker/streaming-large-json-response-in-spring-2pho) – Ratul Sharker Jun 19 '23 at 13:15
  • Hey everyone, is this the right approach? I updated my post to reflect me writing to JSON file instead. I'm running it now and I don't know how long it will take but please double check my code and see if it makes sense or I'm still missing something core. Thanks! – Euridice01 Jun 19 '23 at 15:00
  • 1
    I would do at least 2 things: measure what takes more time, query+unmarshalling or writing to file. I guess query+unmarshalling takes more time. Depends on the "size" and structure of `Entity`. A bunch of fields won't take very long, even 1M records. Also consider that the query has no (obvious) order, it may return the same record again and again. – PeterMmm Jun 19 '23 at 18:31
  • Why would it return the same record again? @PeterMmm should I order it? – Euridice01 Jun 19 '23 at 20:39
  • 1
    I said "may", I won't bet that the "natural order" is always the same in separate `selects` https://dev.to/peledzohar/the-natural-order-misconception-2ekg – PeterMmm Jun 20 '23 at 11:40
  • Do you really need to write all of your data(I mean all columns, along with those expensive formulae and all)? – Asgar Jun 22 '23 at 16:06
  • Maybe not all the columns for the select * in FIND_NEW_ENTITIES but I need the selects from the other fields – Euridice01 Jun 22 '23 at 17:16
  • 2
    Your entity does a lot of additional selects, which is going to be slow. You basically have a 1+N select problem (or rather 1 + (2*N) here) due to your formulas. Your conversion is also a bit slow, creating a stream adds overhead. Instead of using JPA and conversion, I would suggest to write a dedicated SQL query and use a `JdbcTemplate` and extract/write results. This will eliminate much of the overhead you currently have. – M. Deinum Jun 23 '23 at 07:00
  • Can you share an example of what you mean by JDBCTemplate and use dedicated sql query? Thanks for the suggestion. – Euridice01 Jun 23 '23 at 14:14
  • If you really want fast processing: use raw JDBC and generate JSON manually. You can't beat that. – Olivier Jun 26 '23 at 07:18
  • See [here](https://www.tutorialspoint.com/jdbc/jdbc-select-records.htm) for a basic tutorial on JDBC. – Olivier Jun 27 '23 at 07:32
  • Write a single SQL query that will fetch all the data you need. For better performance, increase the fetch size to 500 or 1000 (as explained [here](https://stackoverflow.com/a/24815708/12763954)). – Olivier Jun 27 '23 at 07:34

4 Answers4

4

The way I see this is, you are fetching a million records as an entity. After the query is executed, the resultset will be mapped into the Entity class. In your case, 2 @Formula are being executed for each record and also a list of locations being populated too. I suggest you fetching them directly via native queries.

I am writing a pseudo code,

@Transactional(readOnly=true)
public void writeMyEntitiesToJsonFile() {
        Gson gson = new Gson();
        try {

            FileWriter file = new FileWriter("new_entities.json");

            //this query will fetch all the records from your table name
            String rawQuery = "SELECT \n"
                    + "t.NEW_ENTITY_ID AS id,\n"
                    + "t.SHARED_ID AS sharedId,\n"
                    + "a.SOME_ID AS someId,\n"
                    + "a.CODE AS code\n"
                    + "FROM TABLE_NAME t\n"
                    + "JOIN ANOTHER_TABLE a \n"
                    + "ON t.SHARED_ID=a.SHARED_ID";

            List<Object[]> newEntityRawList = em.createNativeQuery(rawQuery)
                    .getResultList();

            //mapping all those Object[] into a DTO> EntityDTO
            List<EntityDTO> newEntityDTOList = newEntityRawList
                    .stream()
                    .map(obj -> new EntityDTO((Number) obj[0], (Number) obj[1], (String) obj[2], (String) obj[3]))
                    .collect(Collectors.toList());

            //this query will fetch all the location from your table
            String locationQuery = "SELECT \n"
                    + "l.newEntityId,\n"
                    + "l.id,\n"
                    + "l.city,\n"
                    + "l.state,\n"
                    + "l.country \n"
                    + "FROM location l";
            List<Object[]> locationRawList = em.createNativeQuery(locationQuery)
                    .getResultList();

            //mapping all those Object[] into a DTO> LocationDTO
            List<LocationDTO> locationDTOList = locationRawList
                    .stream()
                    .map(ob -> new LocationDTO((Number) ob[0], (Number) ob[1], (String) ob[2], (String) ob[3], (String) ob[4]))
                    .collect(Collectors.toList());

            //Using lambda to get a map with key> new entity id. and values> List of Location 
            Map<Long, List<LocationDTO>> mapOfNewEntityIdAndLocationList = locationDTOList
                    .stream()
                    .collect(Collectors.groupingBy(LocationDTO::getNewEntityId));

            //setting The list of locations to EntityDTO
            for (EntityDTO ne : newEntityDTOList) {
                ne.setLocations(mapOfNewEntityIdAndLocationList.get(ne.getId()));
            }
            
            //writing data to file
            file.write(gson.toJson(newEntityDTOList));
        } catch (IOException io) {
            io.printStackTrace();
        }
    }

    @Getter
    public class EntityDTO {

        private final long id;
        private final long sharedId;
        private final String someId;
        private final String code;

        @Setter
        private List<LocationDTO> locations;

        public EntityDTO(Number id,
                Number sharedId,
                String someId,
                String code) {
            this.id = id.longValue();
            this.sharedId = sharedId.longValue();
            this.someId = someId;
            this.code = code;
        }
    }

    @Getter
    public class LocationDTO {

        private final long id;
        private final long newEntityId;
        private final String city;
        private final String state;
        private final String country;

        public LocationDTO(Number id,
                Number newEntityId,
                String city,
                String state,
                String country) {
            this.id = id.longValue();
            this.newEntityId = newEntityId.longValue();
            this.city = city;
            this.state = state;
            this.country = country;
        }

    }
    

Instead of using the constructors, you can also use SqlResultSetMapping.

Of course, this will still be an expensive operation and you have figured out the batching yourself from the other answers in the answer section.

This approach of solution should eliminate the additional expensive queries.

I did not run the process I don't have any data to work with.

P.S. Please do try the batching too. The maxResult for the first native query is easy, fetching the corresponding data from the second query will be a little tricky as you can you will need to set the WHERE NEW_ENTITY_ID BETWEEN MIN AND MAX ids obtained from the first result(should not be very hard as you know how the data are sitting on your database tables).

Good luck!

Asgar
  • 1,920
  • 2
  • 8
  • 17
  • Wouldn't this be bringing too many objects in memory though because of the list implementation? Also, I need to double check something but I think when I tried with the raw query above a a few days ago, it mentioned something about needing TypedQuery (That was because I was referring to different entities), is that something I can do run here? Thanks! – Euridice01 Jun 23 '23 at 23:15
  • All you need is to export your data. As per your question of bringing all data to memory, I have mentioned that you should also try batching. One way or another, you will be loading all data into memory. My solution will help you out by removing extra burden of query. You are getting (2N+1) or even more problem – Asgar Jun 24 '23 at 02:31
1

You could parallelize your batching. As soon as you need to export 1,000,000 records, you could create e.g. 4 threads each starting data extraction from it's own offset and writing the data into its own temporary file.

After batching is over you merge four temp files into the resulting one.

If one of threads fails for some reason you don't need to start the whole process from the very beginning but the use results stored in temp files.

Also in your code you create a new expensive Gson object for each chunk of size 1000, making 1000 objects for the whole batch. Instead one object cached in the field or injected by IoC container can be used.

I suggest you to modify your code as:

public class BatchExtractor {
  private static final int BATCH_SIZE = 1000;

  private EntityManager entityManager;
  private Gson gson = new Gson();

  @SneakyThrows
  public void getNewEntities() {
    int limit = 250_000;
    int parallelism = 4;
    try (ExecutorService executorService = Executors.newFixedThreadPool(parallelism)) {
      List<Future<File>> futures = new ArrayList<>();
      for (int i = 0; i < parallelism; i++) {
        int offset = i * limit;
        Future<File> future = executorService.submit(() -> extract(offset, limit));
        futures.add(future);
      }
      File resultingFile = new File("new_entities.json");
      resultingFile.createNewFile();
      try (FileOutputStream fos = new FileOutputStream(resultingFile)) {
        futures.stream().map(this::getFile).forEach(tempFile -> transfer(tempFile, fos));
        fos.flush();
      }
    }
  }

  @SneakyThrows
  private File extract(int offset, int limit) {
    String fileName = "new_entities_temp_" + offset + ".json";
    File file = new File(fileName);
    file.createNewFile();
    int extractedCount = 0;
    try (FileWriter tempFile = new FileWriter(file)) {
      while (extractedCount < limit) {
        offset += BATCH_SIZE;
        List<Entity> entityList = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
                .setFirstResult(offset)
                .setMaxResults(BATCH_SIZE)
                .getResultList();
        tempFile.write(gson.toJson(entityList));
        extractedCount += entityList.size();
      }
    }
    return file;
  }

  @SneakyThrows
  private long transfer(File tempFile, FileOutputStream fos) {
    return Files.copy(tempFile.toPath(), fos);
  }

  @SneakyThrows
  private File getFile(Future<File> fileFuture) {
    return fileFuture.get();
  }
}

UPD You could also combine this approach with the streaming suggested in another answer.

Sergey Tsypanov
  • 3,265
  • 3
  • 8
  • 34
1

Use a stream instead of batching and detach the objects as not to polute the memory with objects you don't need.

Create Gson once and re-use as it is an expensive object.

In the code below I make a couple of assumptions:

  • Spring Boot 3.x
  • Java 17 or higher
@Transactional(readOnly=true)
public void getNewEntities() {
    var gson = new Gson();

    try (var writer = Files.newBufferedWriter(Path.get("new_entities.json"))) {
      var entities = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
      .setHint(QueryHints.HINT_FETCH_SIZE, "1000")
      .setHint(QueryHints.READ_ONLY, "true")
      .getResultStream();

       entities.forEach( (entity) -> writeToFile(gson, writer, entity);      
    }
    catch(IOException ex) {
        throw new IllegalStateException(ex);
    };
}

private void writeToFile(Gson gson, Writer writer, Entity entity) {
  try {
    writer.write(gson.toJson(entity.toNewEntity()));
  } catch (Exception ex) {
    throw new IllegalStateException(ex);
  }
  entityManager.detach(entity);
}

M. Deinum
  • 115,695
  • 22
  • 220
  • 224
  • Hey, I'm trying your approach and while it's probably less likely (if at all) to eat up memory locally, it is still taking a long time. It's been nearly an hour so far and it's only written ~80 MB to the file. I expect this file be closer to 1GB. Is there anyway to improve performance here? Should I share my entity to see if it's an issue with how I'm setting up some fields there or something else? Thanks! – Euridice01 Jun 22 '23 at 13:52
  • What would be interesting is your `Entity` and your `toNewEntity`. Ideally you would directly return `NewEntity` from the query and write a specific query that returns the data you need. Depending on your `Entity` it might even lead to excessive selects, and make sure that you have an `@Transactional(readOnly=true` on your `getNewEntities` method as well. – M. Deinum Jun 22 '23 at 14:00
  • In additional to that, make sure that you don't run into network issues/latency and that you are actually writing to a local file (and aren't recreating `Gson` each time you need it!). – M. Deinum Jun 22 '23 at 14:01
  • Hey thanks for writing back. My `toNewEntity` returns the domain object I map the entity to. so returns `NewEnity.builder().(fields mapped).build()`. I'll share it in a bit. One more is I have the `@Transactional(readOnly=true)` on the controller method instead of the service call. Is it better to move it directly to the service call instead of the controller method where I make the call to test it? Thanks! – Euridice01 Jun 22 '23 at 14:42
  • I updated my post to show the `Entity` and `NewEntity`. I think it's because I map locations to list right? I should keep it as a stream or? – Euridice01 Jun 22 '23 at 15:31
  • Regarding the transaction, do you really want to have the transaction be part of your web handling? Your service layer should be the transactional/business boundary not the web, – M. Deinum Jun 23 '23 at 07:01
  • do you really want to have the transaction be part of your web handling? >> What do you mean by this? I'm directly calling the service layer endpoint locally via postman to test this feature implementation. How else will I trigger/invoke the call to action this action? Unless you mean it's better to create a separate console program and invoke this call on main when I spin it up? – Euridice01 Jun 23 '23 at 14:15
  • If something during marshalling of your response fails it will impact the transaction. Is that **really** what you want? Do you want your transaction be impacted by something that happens on the web? Because that is basically what you have done now. Regarding the export if you only need it to export, just write a main with an `ApplicationRunner` and kick-off the process. Why the clutter of a controller in the first place? – M. Deinum Jun 23 '23 at 14:20
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254219/discussion-between-euridice01-and-m-deinum). – Euridice01 Jun 23 '23 at 14:25
1

In case you suffer from Entity cache blowing up and processing speed decreasing over time, you might consider to use EntityManager.clear() method. (Creating lots of Entities is one of the best ways to bring down your system. Use a loop and call the clear method after every couple of some hundred entities to get rid of the Entities you do not need anymore.) See also here

jausen brett
  • 1,111
  • 7
  • 10