I have a request to export roughly a million records to do a one time migration and I need those records exported in JSON format and adhering to the same api contract object structure we normally expose to this other team.
The object I need to expose is composed of data from two tables and I have a conditional in place to limit the resultset from millions to 1 million.
The batching logic I have below does the initial fetch then batches them in groups of 1000 and add them to a list that I want to expose eventually.
Service method below:
public List<NewEntity> getNewEntities() {
int i=0;
int BATCH_SIZE = 1000;
List<Entity> totalEntityList = new ArrayList<>();
List<Entity> entityList = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
.setFirstResult(i)
.setMaxResults(BATCH_SIZE).getResultList();
while(entityList.size() == BATCH_SIZE) {
i+=BATCH_SIZE;
entityList = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
.setFirstResult(i)
.setMaxResults(BATCH_SIZE)
.getResultList();
totalEntityList.addAll(entityList);
}
return totalEntityList.stream()
.map(entity -> entity.toNewEntity())
.collect(toList());
}
This process was taking 2.5 (with batch of 1000) and 3.5 hours (with batch of 2000). The first time I ran it, I had a bug where I didn't save all of them to the list, so I ran it again overnight. I checked in the morning, it took longer and my IDE froze so I don't know why the request eventually failed with 500 error. I force quit my IDE and trying again now but I won't know if it works until after 2 to 3 hours. This is very slow.
Is there a way to improve how I'm doing this? Any other approaches here?
I'm using Spring Boot, Java, JPA.
Thanks!
EDIT: UPDATE 2 after adjusting per recommendations from the comments:
public void getNewEntities() {
int i=0;
int BATCH_SIZE = 1000;
List<Entity> entityList = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
.setFirstResult(i)
.setMaxResults(BATCH_SIZE).getResultList();
try{
FileWriter file = new FileWriter("new_entities.json");
while(entityList.size() == BATCH_SIZE) {
i+=BATCH_SIZE;
entityList = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
.setFirstResult(i)
.setMaxResults(BATCH_SIZE)
.getResultList();
file.write(new Gson().toJson(entityList));
}
file.close();
}
catch(IOException ex) {
ex.printStackTrace();
};
}
Is this a better(correct?) approach? I'm letting it run right now and not sure how long it would take to write 1,000,000 records to JSON file but let me know if this on the right track? Currently, it's been two hours and it's only about 450,000 records through!
UPDATE 3 adding Entity for visibility. I'm assuming now it could be an issue with the mapping to locations where I'm converting it to list at the end. Should I just keep it as a stream instead? Also I obviously have a lot more fields but I removed the ones where it was just a simple column field with no complexity to it.
@Builder
@Getter
@Entity
@AllArgsConstructor
@NoArgsConstructor
@Table(name = "TABLE_NAME", schema = "SCHEMA")
@NamedQuery(name = SELECT_NEW_ENTITIES, query = FIND_NEW_ENTITIES)
public class Entity {
public static final String SELECT_NEW_ENTITIES = "SELECT_NEW_ENTITIES";
protected static final String FIND_NEW_ENTITIES = "" +
"SELECT a FROM Entity a " +
"WHERE a.code IS NOT NULL " +
"AND a.someId IS NOT NULL " +
"ORDER BY a.id ";
@Id
@Column(name = "NEW_ENTITY_ID")
private Long id;
@Column(name = "SHARED_ID")
private Long sharedId;
@OneToMany(cascade = CascadeType.ALL)
@JoinColumn(name = "NEW_ENTITY_ID")
@Builder.Default
private List<Location> locations = new ArrayList<>();
@Formula(value = "(SELECT eng.SOME_ID from SCHEMA.ANOTHER_TABLE eng WHERE eng.SHARED_ID = SHARED_ID)")
private String someId;
@Formula(value = "(SELECT eng.CODE from SCHEMA.ANOTHER_TABLE eng WHERE eng.SHARED_ID = SHARED_ID)")
private String code;
public NewEntity toNewEntity() {
return NewEntity
.builder()
.newEntityId(this.getId())
.code(code)
.locations(mapLocations())
.build();
}
private List<LocationModel> mapLocations() {
return this.getLocations()
.stream()
.map(location -> LocationModel.builder()
.id(location.getId())
.newEntityId(location.getNewEntityId())
.city(location.getCity())
.state(location.getState())
.country(location.getCountry())
.build())
.collect(toList());
}
}