Removing entire rows from data based on duplicates in a column

Question

So here's my question. I have a giant text file of data and I need to input all of this data into a mySQL database fast through obviously using a java program. My only problem is that, the data is identified by a certain ID. Some of these ID's have duplicates and contain all the same info as eachother. I would like to remove all of these for sorting purposes and clarity sake.

What would be the best way to go about this? If anyone could help I'd appreciate it!

Thanks.

score 6 · Accepted Answer · edited May 23 '17 at 12:12

while reading the data have a hashmap or hashset. check if the id exists in the hasmap/hashset and if so continue. otherwise enter in set/map and insert.

An aside: The difference between hashmap and hashset is hashset only takes values while hashmap takes key values. However, Hashset itself uses a hashmap within memory and just inserts a dummy object for values. See: Differences between HashMap and Hashtable?

Example with hashset:

    HashSet<Integer> distinctIds = new HashSet<Integer>();

    MyRowData rowdata;
    int rowID;

    while((rowdata = this.getRowData())!=null ) // or however you iterate over the rows using reader etc
    {
    rowID = rowdata.getRowID(); 

    if(!distinctIds.contains(new Integer(rowID)))
    {
      distinctIds.add(rowID);
      inertDataInMysql(rowdata); //however you insert your data here
      System.out.println("Adding " + rowID);
    }
    }

You can use batch insert to further speed up your code by executing a commutative insert for many rows. See:

score 0 · Answer 2 · answered Apr 12 '13 at 07:23

0

Just add a primary key to your ID when adding data into the database. That way, repeated ID line will not be added to your database. Hope this helps.

answered Apr 12 '13 at 07:23

luffymonkey

50
8

The ID is already set as the primary key. I am trying to sort all of the data with removed duplicates before I place it in the database however. Would the previous recommendation of hashmap still be the best idea or would you suggest something else? – Apr 12 '13 at 07:25
Adding a primary key means that he will still be sending redundant data to the database and using up resources (network+cpu) on both server and client. – Menelaos Apr 12 '13 at 07:33

Removing entire rows from data based on duplicates in a column

2 Answers2