1

I'm still on the learning curve for python3, please advise on this. I have a very long array which looks like something below and how do I check whether two of this value exist(Date in the 4th position and string in the second position of the array) in any of the array element.

Array:

[
('1','200','300','500','2015-04-25 7:00:00'),
('1','200','500','500','2015-04-26 8:00:00'),
('1','200','500','500','2015-04-26 8:00:00'), # Repeated
('1','200','900','500','2015-04-27 9:00:00'),
('1','200','300','500','2015-04-28 17:00:00'),
('1','200','300','500','2015-04-28 17:00:00'), # Repeated
...
...
]
Eric T
  • 1,026
  • 3
  • 20
  • 42
  • 1
    Here's a link that might help: http://stackoverflow.com/questions/11236006/identify-duplicate-values-in-a-list-in-python – Mark Hill Jun 25 '15 at 04:07

3 Answers3

3

I'd recommend using pandas. Say your array (actually called a list in Python) is called A, you can load it with

import pandas as pd
df = pd.DataFrame(A)
df
   0    1    2    3                    4
0  1  200  300  500   2015-04-25 7:00:00
1  1  200  500  500   2015-04-26 8:00:00
2  1  200  500  500   2015-04-26 8:00:00
3  1  200  900  500   2015-04-27 9:00:00
4  1  200  300  500  2015-04-28 17:00:00

Then you can get the repeated rows like this

df['Repeated'] = df.duplicated(subset=[3,4])
df

Out[463]: 
   0    1    2    3                    4 Repeated
0  1  200  300  500   2015-04-25 7:00:00    False
1  1  200  500  500   2015-04-26 8:00:00    False
2  1  200  500  500   2015-04-26 8:00:00     True
3  1  200  900  500   2015-04-27 9:00:00    False
4  1  200  300  500  2015-04-28 17:00:00    False
maxymoo
  • 35,286
  • 11
  • 92
  • 119
3

Some approaches that don't require using an external library are:

long_array = [
    ('1','200','300','500','2015-04-25 7:00:00'),
    ('1','200','500','500','2015-04-26 8:00:00'),
    ('1','200','500','500','2015-04-26 8:00:00'), # Repeated
    ('1','200','900','500','2015-04-27 9:00:00'),
    ('1','200','300','500','2015-04-28 17:00:00'),
    ('1','200','300','500','2015-04-28 17:00:00'), # Repeated
    # ...
]

Use a set..

values = set()
for entry in long_array:    
    value = (entry[1], entry[4])
    if (value in values): 
        print("Duplicate " + str(entry))
    else:
        values.add(value)

or use collections counter..

from collections import Counter

values = Counter([(entry[1], entry[4]) for entry in long_array])
for value, count in values.items():
    if count > 1:
        print(str(count) + " duplicates of " + str(value))

Size of the array is pretty important here.. These may cause problems for really really big arrays.

demented hedgehog
  • 7,007
  • 4
  • 42
  • 49
  • 1
    I like your set solution, as it do not require any library plugin and it is portable from time to time. Thank you. – Eric T Jun 25 '15 at 07:09
  • 1
    I think the set implementation is better too. The counter implementation goes through all the values twice, the set implementation just once. I was going to suggest only storing the hash values in the set if the size got too big.. but that leads to potential problems with possible hash collisions (maybe, if you're doing a huge number of them).. The counter is good if you've already got a list of things. So they can both be handy from time to time. – demented hedgehog Jun 25 '15 at 11:07
1

If you want to actually code up a solution in Python to get practice, here's one way:

# the indices in the tuples to be used as keys for determining repeats
# set this to whatever indices you would like (or even all of them)!
key_indices = [1, 4]

# for a given tuple tpl, construct a key consisting of the values in tpl
# that are found at the indices given in ki
def make_key(tpl, ki):
    key_elements = []
    for i in ki:
        key_elements.append(tpl[i])

    # need to return a tuple, as you cannot use a list as a key for a dict
    return tuple(key_elements)

data = [
('1','200','300','500','2015-04-25 7:00:00'),
('1','200','500','500','2015-04-26 8:00:00'),
('1','200','500','500','2015-04-26 8:00:00'), # Repeated
('1','200','900','500','2015-04-27 9:00:00'),
('1','200','300','500','2015-04-28 17:00:00'),
('1','200','300','500','2015-04-28 17:00:00') # Repeated
]

# the data structure that we'll use to remember where we've seen keys before
memory = dict()
duplicates = set()

for i in range(0, len(data)):
    # make the key for comparison
    k = make_key(data[i], key_indices)

    # find out where we've seen this before
    # if nowhere else, return an empty list
    previous_locations = memory.get(k, [])

    # note that we have now seen this key at location i
    previous_locations.append(i)

    if (len(previous_locations) > 1):
        duplicates.add(i)

    # update the dict with the new location
    memory[k] = previous_locations

print("Duplicate values found at: {}".format(list(duplicates)))


# and if you want to know which keys were duplicated where?
for k in memory.keys():
    locs = memory[k]
    if len(locs) > 1:
        print("{}: {}".format(k, locs))

Output:

Duplicate values found at: [2, 5]
('200', '2015-04-28 17:00:00'): [4, 5]
('200', '2015-04-26 8:00:00'): [1, 2]
Aaron Johnson
  • 795
  • 1
  • 8
  • 16