2

It seem a little awkward to ask this question, but i'm still struggling to find the answer by myself.

I have an array of elements, several of them are duplicated. For example:

list = [ '1' , '2' , '3' , '1' , '4' , '5' , '3' ]

As can be seen, "1" and "3" exist twice. Now I want to customize it, make it "clean". This is what I always do:

//Create a new list
listCustomize = []

for element in list:
    //Check if element already in listCustomize or not, if yes, dont add it
    if element not in listCustomize:
          listCustomize.append(element)

By this way, I can have a new array that customized

listCustomize = [ '1' ,'2' , '3' , '4' , '5' ]

SO HERE THE PROBLEM, my original array includes hundreds of thousands of elements. Therefore the program is extremely slow.

May someone suggest any more sophisticate approach for this issue? I am thinking about using multi-threading, or use a database to store the original array....

Note: What kind of programming language is not a problem . But prefer Perl/Python/Java/C++

Thank you and best regards.

Alex

AlexPham
  • 311
  • 2
  • 4
  • 16
  • 2
    If you don't require an in-place solution, then one option would be to sort the array (an `O(n*lgn)` penalty most-likely), and then simply walk down the array and keep track of the last value seen. When that changes, add the next value once and only once to a new array. – Tim Biegeleisen May 03 '16 at 05:56
  • 1
    Learn about hash tables. – Henry May 03 '16 at 05:56
  • @Henry Of course he could just add the array into a set, but I think he wants a solution only involving the array. – Tim Biegeleisen May 03 '16 at 05:58
  • you can use c++ STL set instead of array to store data. STL set store unique data – Abu Hanifa May 03 '16 at 06:00
  • Possible duplicate of [How do you remove duplicates from a list in Python whilst preserving order?](http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-python-whilst-preserving-order) – Paul Hankin May 03 '16 at 07:06
  • Why not split and use thread ? but i think @TimBiegeleisen Answer is the right one, you should sort the array and then delete the item. Other idea, maybe use a faster languages ... like Golang ;) – albttx May 03 '16 at 09:03

3 Answers3

0

I think using hashing would help in this case.Since searching in hashing is O(1) based.

If I was using Java, then everytime I insert an element in the new list, I would also store the element in a hashmap. The reason for doing this is I can now check in time O(1) whether the element is already inserted in the new list or not.Therefore, there would be no duplicates in new list.

The key would be the Integer and value would be a Boolean variable.The hashmap data structure is defined as:

HashMap<Integer,Boolean> hm=new HashMap<Integer,Boolean>();

The algorithm for creating new list is as follows:

for each element e in the original list
{
 if(hm.get(e)==null)//CHECK IF ELEMENT IS ALREADY IN THE NEW LIST
 {
  add e to new list
  hm.add(e,true);//INSERT NEW ELEMENT IN HASHMAP
 }
}

This will definitely speed up your code because we are able to cancel out duplicates and do the job in linear time O(n) , n is number of elements in original array.

Whenever there is a search problem, hashing helps in most of the cases.

Sumeet
  • 8,086
  • 3
  • 25
  • 45
  • Hmm... hashing tends to temporarily use a lot of RAM when creating hash table, especially in higher level languages. "... my original array includes hundreds of thousands of elements... ". Did you try your solution on a large data set? – BJovke May 03 '16 at 09:27
  • @BJovke But that hashmap will only contain distinct elements. – Sumeet May 03 '16 at 09:32
  • @BJovke So only in the worst of the worst case your all hundreds of thousands of elements will be distinct. The point is that it does not matter how many elements are there, it only matters how many distinct elements are there. – Sumeet May 03 '16 at 09:33
  • Yes, you are right, your solution is valid. But hash map can have high impact on RAM for large number of elements. It can also impact CPU heavily. Hash creation takes the most of the time, which enables fast access to elements after. Here it is needed that total execution time is as small as possible. The most effective solution is by using a sort algorithm. – BJovke May 03 '16 at 09:40
  • 1
    sorting will take O(nlogn) and my solution is O(n). So how does sorting win here. – Sumeet May 03 '16 at 09:43
  • You're right. But hash function takes time to execute and there is a possibility of collision. In this case the best way to know is to make a benchmark of few different approaches. Time of execution of hash approach is O(n)*time_to_exec_one_hash. Time of execution of sort approach is O(n log n)*time_to_exec_one_iteration. Second approach is always faster until a certain number of elements n. – BJovke May 03 '16 at 09:53
0

Here the question is not which programming language and function to use. Although your program will run, for example, much faster in C than in Java, this is not so critical.

What you need is to sort your array. When it is sorted, you only need one pass through array after that and you will easily discard duplicate values since they are one after the another.

You can use any Quicksort implementation in any language. The problem of Quicksort is that it becomes slower when over 50% of the array is already sorted. Best case complexity is O(n log n) and worst case O(n^2).

Many implementations of sorting first pass through array and determine the degree in which array is already sorted and then choose Quicksort if array is sorted below certain degree and, for example, Straight Insertion Sort otherwise.

Hm, yes, there is also a question if hash approach is faster or not since hash approach has a complexity of O(n). But hash function takes time to execute and there is a possibility of collision. In this case the best way to know is to make a benchmark of few different approaches. Time of execution of hash approach is O(n)*time_to_exec_one_hash. Time of execution of sort approach is O(n log n)*time_to_exec_one_iteration. Second approach is always faster until a certain number of elements n.

Also, depending on language and method used, there could be additional hidden complexities in how addition of new elements to hash/sort result is implemented.

BJovke
  • 1,737
  • 15
  • 18
-1

In python, set data structure such as a = set() could solve this problem.

liyj144
  • 110
  • 7
  • Answer without any explanation. Alex is interested in speeding up the program. You need to explain what speed improvements your answer brings. The core problem is not which function to use but which algorithm to use. – BJovke May 03 '16 at 09:26