1

I have a csv file of data which has altogether 100 000 records. I am iterating over the records and trying to update 5 tables for each record. Here is the sample data:

EAN Code,Site,Genric Material,Material,Sap Ean Code,Style,Color,Size,MRP,Gender,EAN Code,Season,Collection,BRAND,Color revision,Category (L5),Category (L6)
123456789,6001,000000000061000102,000000061000102001,61000102001,03/BE100,SC/TG,L/112 cm,850.00,MENS,123456789,AW12,Colors,XXXXXX,RD/TG,Tee Shirt,Graphic

The five tables that will be updating for each iteration are as follows:

  1. Master
  2. MasterDescription
  3. Attributes
  4. AttributeValues
  5. AssociationTable

The relationship between the above mentioned tables are as follows:

Master M-M AttributeValues

Master M-1 MatserDescription

Master M-M Attributes

Attributes 1-M AttributeValues

Here is the code that I have to save the CSV data into 5 tables in a single session using batch technique:

Service class

@Service
public class EanService{

@AutoWired
public EanRepository eanrepository;

// Method that saves data from CSV to DataBase
@Transactional
public void saveEANMasterData1(BufferedReader br, String userName,
        List<EanAttributes> attributes, String eanMasterName,String description) {
    int i =1;

    EanMasterDiscription eanDes = new EanMasterDiscription();
    User user = userRepository.findUserByUsername(userName);
    EanMasterDiscription deciption = null;
    eanDes.setDescription(description);
    eanDes.setMasterName(eanMasterName);
    eanDes.setDate(new Timestamp(Calendar.getInstance()
            .getTimeInMillis()));
    String line;
    try {
        List<Ean> eans = new ArrayList<Ean>();
        // iterating over each record in the CSV and saving the data into DB            
        while (((line = br.readLine()) != null)) {
             String[] cols = line.split(",");
             // Style Keeping Unit
             Ean ean = new Ean();
             for(EanAttributes attr : attributes){
                 EanAttributeValues eanAttributeValues = new EanAttributeValues();
                 if(attr.getAttrInferredType().equalsIgnoreCase("EAN")){
                         ean.setEAN(cols[attr.getAttributeOrder()]);
                 }else if(attr.getAttrInferredType().equalsIgnoreCase("Season")){
                     ean.setSeason(cols[attr.getAttributeOrder()]);
                 }else {
                     if(attr.getAttrInferredType().equalsIgnoreCase("Attribute")){
                         EanAttributes eanAttr = eanrepository.loadAttrsListByAttName(attr.getAttributeName());
                         if(eanAttr == null){
                             eanAttributeValues.setAttributeValue(cols[attr.getAttributeOrder()]);
                             eanAttributeValues.setEanAttributes(attr);
                             ean.getEanAttributeValues().add(eanAttributeValues);
                             ean.getEanAttributes().add(attr);
                             attr.getEan().add(ean);
                         }else{
                             ean.getEanAttributes().add(eanAttr);
                             eanAttr.getEan().add(ean);
                             if(eanrepository.isAttributeValueAvailable(cols[attr.getAttributeOrder()])){
                                 eanAttributeValues.setAttributeValue(cols[attr.getAttributeOrder()]);
                                 eanAttributeValues.setEanAttributes(eanAttr);
                                 ean.getEanAttributeValues().add(eanAttributeValues);
                             }else{
                                 EanAttributeValues values = eanrepository.loadDataByAttrValue(cols[attr.getAttributeOrder()]);
                                 ean.getEanAttributeValues().add(values);
                                 values.getEan().add(ean);
                             }
                         }
                         eanAttributeValues.getEan().add(ean);
                     }
                 }
             }
             if(!eanrepository.isEanMasterNameAvailable(eanMasterName)){
                EanMasterDiscription eanMasterDes = eanrepository.loadDataByMasterName(eanMasterName);
                 ean.setEanMasterDesciption(eanMasterDes);
             }else{
                 ean.setEanMasterDesciption(eanDes);
             }
             ean.setUser(user);
             if(eanrepository.isEanWithSeasonAvailable(ean.getEAN(),ean.getSeason())){
                     // Persisting Ean; I think there is some problem with this method
                     eanrepository.saveEanData(ean,i);
             }else{
                 System.out.println("************ EAN ALREADY EXIST ******************** ");
             }

             i++;
        }
    } catch (NumberFormatException | IOException e) {
        e.printStackTrace();
    }       
    }
}

Repository class

@Repository
public class EanRepository{

@PersistanceContext
EntityManager em;

public void saveEanData(Ean ean , int recordNum){
    em.merge(ean);
    if(recordNum % 50 == 0){
        em.flush();
        em.clear();
        // em.getEntityManagerFactory().getCache().evictAll();
    }
}

}

But this is taking too much time (nearly 10hrs) to finish saving all the 100 000 records. How can we reduce the time and what I am missing?

Maciej Kowalski
  • 25,605
  • 12
  • 54
  • 63
Ram kumar D M
  • 11
  • 1
  • 4
  • Possible duplicate of [How to persist a lot of entities (JPA)](http://stackoverflow.com/questions/5649345/how-to-persist-a-lot-of-entities-jpa) – eis Feb 19 '17 at 09:53
  • or [Batch inserts using JPA EntityManager](http://stackoverflow.com/a/31020939/365237) – eis Feb 19 '17 at 09:54

3 Answers3

2

I was having same problems in my batch application and we have incorporated two techniques which vastly speed up the process of importing the data:

1) Multithreading - You have to take advantage of multiple threads processing your file data and doing the saving.

The way we did it was to first, read all the data from the file and pack it into a Set of POJO objects.

Then based on the number of possible threads that we can create we would split the Set evenly and feed the threads with a certain range of data.

Then each set would be processed in parallel.

I am not going get into the details as this is outside of the boundaries of this question. Just a tip that i can give is that you should try to take advantage of the java.util.concurrent and features it offers.

2) Batch Saving - The second improvement that we did was to take advantage of the batch save feature of hibernate (you have added the Hibernate tag so i assume this is your underlying persistence provider):

You can try and take advantage of the bulk insert feature.

There is hibernate property which you can define to enable this feature:

<property name="jdbc.batch_size">250</property>

With this batch setting you should have output like:

insert into Table(id , name) values (1, 'na1') , (2, 'na2') ,(3, 'na3')..

instead of

insert into Table(id , name) values (1, 'na1');
insert into Table(id , name) values (2, 'na2');
insert into Table(id , name) values (3, 'na3');

3) Flush count - you have your count set to 50 before you flush to the db.. now with the batch inserts enabled maybe you could raise it up a bit to few houndread.. try to experiment with this number to find the sweet spot.

Maciej Kowalski
  • 25,605
  • 12
  • 54
  • 63
0

There may be many reasons why loading data takes time (so only from the code it's ) and - you should optimize in smaller data chunks.

So I'd just shoot blind from my experience:

  • if possible, use persist() instead of merge(), merge() produces one more select and does some value copying

  • when loading a vast amount of data - do not use transactions. I see you flush only on every 50th record, but the transaction overhead may be still very expensive anyway

  • as written on he previous post, set the bulk insert attribute (depending on the JPA used)

gusto2
  • 11,210
  • 2
  • 17
  • 36
0

Since you have a CSV file with lot of entries, I'd consider another alternative: the database-specific, separate tools.

For example, for MySQL you'd have these tools as well as load data infile; for PostgreSQL, this syntax that can be used from command line; for Oracle, sqlldr. Those are more optimized for these kind of things.

Community
  • 1
  • 1
eis
  • 51,991
  • 13
  • 150
  • 199
  • Actually here the problem is this is an web application,The user has to upload the data i that we are inferring the schema on that schema user will select the columns which he needs those columns only we need to save it into the database. thanks – Ram kumar D M Feb 19 '17 at 11:49
  • @RamkumarDM nothing prevents web application from saving the file to disk, then call some of these tools with file path as a parameter (and removing the file afterwards). That'd be also more user friendly, as user doesn't have to wait for persisting. – eis Feb 19 '17 at 12:20