1

I have a boolean matrix in python and need to find out which are duplicate rows. The representation can also be a list of bitarrays as I am using this for other purposes anyways. Comparing all rows with all rows is not an option as this would yield 12500^2 comparisons and I can only do about 500 per second. Also converting each row into an integer is not possible as each row is about 5000 bits long. Still it seems to me that the best way would be to sort the list of bitarrays and then compare only consecutive rows. Anyone has an idea how to map bitarrays to sortable values or how to sort a list of bitarrays in the first place? Or is there a different approach that is more promising? Also, since I only have to do this once, I prefer less code over efficiency.

user1850980
  • 991
  • 1
  • 10
  • 15
  • What is the size of the array ? – Roberto Mar 16 '14 at 09:33
  • 12500 bitstrings of length 5000 – user1850980 Mar 16 '14 at 09:42
  • I am currently checking whether sort() might actually be able to do the job, so stand by... – user1850980 Mar 16 '14 at 09:43
  • 1
    Two good ideas for such a problem are (1) hash the values and (2) sort them. For (1), you need a hashing function for a bit array. For (2), you need ordering function. If `bitarray` module does not provide any of these (test by putting in dict and comparing two bitarrays, respectively), you will probably have to implement one of them. – Gassa Mar 16 '14 at 10:06

1 Answers1

0

Ok, so a list of bitarrays is quickly sortable by sort() or sorted(). Furthermore, probably better way to solve this problem is indicated in Find unique rows in numpy.array.

Community
  • 1
  • 1
user1850980
  • 991
  • 1
  • 10
  • 15