0

as I mentioned, I have huge ArrayLists in this format:

List<List<String>> alist;

I get the Lists out of some .CSV wich I get over a extern databasesystem. (I wrote a Visual Objects script to export the data we need for some calculations)

After I have the .CSV, I load the content into my List like this:

String line;
       alist = new ArrayList<List<String>>();
       int i=0;
    // Datei laden, und anschließend die Zeilen der CSV in eine ArrayList speichern 

       try {             

            br = new BufferedReader(new FileReader(path));
            while((line = br.readLine()) != null)
            {
                line = line.replace(",", ".");
                if(line.endsWith(delimitter))
                    line = line + " ";
                alist.add(Arrays.asList(line.split(delimitter)));
                i++;
                if(i==10000000)
                    break;
            }       

        }

It takes me like ~15900 ms to store the Data in my ArrayList (~1,1 million rows and 11 columns). Pretty okay I guess. Now, that I have the List in the format I need, I would like to create a Insert out of it, so we can import the data in our database. I'm creating the insert like this:

public String getInsertString()
    {
        // Tabelle ergibt sich aus dem Dateinamen, ohne das .csv
        String insert="REPLACE INTO "+tablename + " (";

        // Spaltennamen aus Array auslesen, immer die erste Zeile des CSV
        for(int i=0; i< alist.get(0).size();i++)
        {
            if(i==0)
                insert = insert + alist.get(0).get(i).trim();
            else
                insert = insert + " ,"+ alist.get(0).get(i).trim();
        }
        insert= insert + ") \rVALUES";

        // Values der SPalten in den Insert schreiben + korreckte Syntax des Bfehels sicherstellen usw.
        for(int i=1;i < alist.size();i++) // Size nach "unten" in der 2D Liste
        {

            insert= insert +"(";
            for(int j=0; j < alist.get(0).size();j++) // Size nach "rechts" in der 2D Liste
            {
                // bei dem ersten ohne "," starten, damit die Syntax stimmt
                // Sollte der aktuelle Wert eine Zahl oder "null" sein, keine "'" setzen. Ansonsten "'" setzen fuer den Insert in die DB
                if(j==0)
                {           
                    if((StringUtils.isStrictlyNumeric(alist.get(i).get(j).trim())) || alist.get(i).get(j).trim().contains("null"))
                        insert = insert + alist.get(i).get(j).trim();
                    else 
                        insert = insert + "'" + alist.get(i).get(j).trim() + "'";
                }                   
                else
                {
                    if(((StringUtils.isStrictlyNumeric(alist.get(i).get(j).trim()))) || (alist.get(i).get(j).trim().contains("null")))
                        insert = insert +","+ alist.get(i).get(j).trim();
                    else 
                        insert = insert + ",'" + alist.get(i).get(j).trim() + "'";      
                }

            }
            if(i < alist.size()-1)
                insert= insert +"),";
            else 
                insert= insert +")";
            insert = insert +"\r";

        }

        //System.out.println(insert);
        return insert;

    }

Here I go over the whole List and add the values to a string so that I can use the string for a insert. I use the filenames as table names and the first row of a file for the columns in the insert. All other rows are the values.

After this step is done, I get a string like this "REPLACE INTO tablename (column1,column2 ... columnx) VALUES(value1, value2 ... valuex), (value1, value2 ... valuex), ...."

Now I would excecute my InsertInDb class with that string and yeah, thats it.

But the second step takes way too much time. (I wait for like a hour now) Is there a smarter way to do what I want to do? (Insert all CSV automaticly in our Database)

Would a BigList maybe improve the speed? https://dzone.com/articles/biglist-scalable-high (can't test it right now)

Edit: What I did to solve the problem:

I created a class for my SQL statement -->

public class BuildInsert {


    private String insertString;

    public String getINSERTSTRING()
    {
        return insertString;
    }

    BuildInsert(String tablename, List<String> alist )
    {   
        int size = alist.size();

        this.insertString = "REPLACE INTO " + tablename + "(";

        // Insert "Header"
        for(int j=1; j <= size ;j++) 
        {
            if(j < size)
                this.insertString = this.insertString + alist.get(j-1)+",";
            else
                this.insertString = this.insertString + alist.get(j-1)+")\n";

        }

        this.insertString = this.insertString +"VALUES(";

        // Insert values
        for(int j=1; j <= size ;j++) 
        {
            if(j < size)
                this.insertString = this.insertString + "?,";
            else
                this.insertString = this.insertString + "?)";

        }
    }

}

and tock this String for a batch insert / prepared statement as Mike mentioned:

Connection con;
        Statement stmt;

        final int batchSize = 1000;
        int count = 0;

        int sizeH = alist.size();
        int sizeL = alist.get(0).size();

        try {

             // Connection oeffnen und prepared statment vorbereiten
            System.out.println("Connecting to database...");
            con = DriverManager.getConnection(DB_URL,USER,PASS);
            con.setAutoCommit(false);

            ps = con.prepareStatement(insertString);

            stmt = con.createStatement();

          //< alist.size()
            for(int i=1;i < sizeH ;i++) // Size nach "unten" 
            {
                for(int j=0; j < sizeL;j++) // Size nach "rechts" 
                {
                    ps.setString(j+1, alist.get(i).get(j));
                }

                ps.addBatch();

                if(++count % batchSize == 0){
                    ps.executeBatch();
                    con.commit();
                }
            }    

            ps.executeBatch();
            con.commit();

        }

The insert is much faster now. (~230 sek for 1,1 mil rows)

Thank you guys ;-)

user3793935
  • 411
  • 5
  • 22
  • 1
    At least use a `StringBuilder`. – marstran Aug 08 '16 at 10:21
  • Take a look at this question about importing csv into mysql http://stackoverflow.com/questions/3635166/how-to-import-csv-file-to-mysql-table – Nazarii Bardiuk Aug 08 '16 at 10:24
  • To keep this logic of reading the file, I would use a PreparedStatement, this will be much faster to insert 1million of row. Execute the preparedStatement for each row. – AxelH Aug 08 '16 at 10:30

3 Answers3

4

Since you want to load a CSV file into mysql you should use mysql tools. E.g. LOAD DATA INFILE

LOAD DATA INFILE 'data.txt' INTO TABLE tbl_name
 FIELDS TERMINATED BY ',' ENCLOSED BY '"'
 LINES TERMINATED BY '\r\n'
 IGNORE 1 LINES;

If you still want to do it by yourself you should use JDBC Batch Insert. Also consider to use PreparedStatement with jdbc batch. See using JDBC preparedStatement in a batch

Community
  • 1
  • 1
René Link
  • 48,224
  • 13
  • 108
  • 140
4
Now, that I have the List in the format I need, I would like to create a Insert out of it, so we can import the data in our database.

Don't do that! Do a JDBC batch insert instead. (See this on how to do a PreparedStatement which will save you a lot of trouble on formatting the insert statements)

Mike Thomsen
  • 36,828
  • 10
  • 60
  • 83
0

I'll try to answer to your question: "the second step takes way too much time. (I wait for like a hour now) Is there a smarter way to do what I want to do? (Insert all CSV automaticly in our Database)"

Fastest solution

As stated on the MySQL Reference Manual (Speed of INSERT Statements) the fastest way is to use LOAD DATA INFILE.

Java solution

Although if you want to try also a Java solution you may reconsider your work. The insert produced by your code use the syntax with multiple VALUES (the recommended one) but it's too long:

    11*10^6 rows x 11 cols x 10B per col = 11^2*10^7 = approx. 1GB  

So try to:

  • set your own variable batch-size and create multiple insert statement with at most such number of multiple values
  • check that the batch-size you choose is lower than bulk_insert_buffer_size (see reference in the same link above)
  • use PreparedStatement but instead of a string to execute you have to build a string (with ? in place of values) and an array of arguments (the values); Java Tutorial JDBC Basics - Using Prepared Statements
  • if you use a PreparedStatement then you can avoid to check if you are going to insert a number, a string or other (using setObject method) so you can avoid those redundant control about the content of the csv field (this is validation and you should move it on the parsing step)
Aris2World
  • 1,214
  • 12
  • 22