0

I am working with two big lists of data and I need to efficiently check for matches between the two. This is the scenario:

  • Reading from a file line by line (this file has 1 million lines)
  • For each line, check within an ArrayList of strings whether it has a match (this ArrayList also has a huge number of elements)
  • If a match is found, replace the line from the file with a new value

Any ideas what would be a good way to tackle this problem in terms of efficiency? Obviously looping through that number of records is hopelessly inefficient and process heavy.

Thanks for any help!

UPDATE It's worth noting, I'm not specifically saying I need to use an ArrayList, that is just something I was using for testing. Any suggestions of more efficient Collections would be welcome.

colmulhall
  • 1,548
  • 3
  • 18
  • 33
  • Not a duplicate or anything, but [this answer](https://stackoverflow.com/a/559056/2375907) could be helpful – S.V. Dec 08 '17 at 08:35
  • 1
    Why do you use an ArrayList to check for matches? Wouldn't it be better to use a HashSet instead? – Ralf Renz Dec 08 '17 at 08:39
  • What does "match" mean in your context? `fileLine.equals(collectionElement)` ? – Ralf Kleberhoff Dec 08 '17 at 09:06
  • As Ralf Renz says use a hashMap for the second list. Then it will be O(N) time to check all matches. Also if the first list is a set(no repeated values) then you could remove the values in hashMap that you already matched. I am also assuming you are not checking matches for the new value, are you? – Pumpkin Dec 08 '17 at 09:33
  • It depends whether you can parse the lines into words or you want a generic approach supporting any string (not necessarily limited by word borders). The first is best done with a word index (Bloomfilter, Hashmap, Trie), the second with a multistring matching algorithm (e.g. Aho-Corasick). – CoronA Dec 08 '17 at 09:42
  • The file line will be someting like "item1" and I will be checking an ArrayList (or any other type of list) against each line looking for the match. – colmulhall Dec 08 '17 at 11:18
  • But does "match" mean equality, i.e. do you just test if the input line is exactly equal to some member of your ArrayList? If so, then a `HashSet` trivially solves it. If you need a case-insensitive match, it's still simple. If you match by common words or alike, then it may get much more complicated. Pls, add this information to your question. – maaartinus Dec 09 '17 at 04:02

3 Answers3

0

You may consider reading the file partially by different threads. Similar issue is discussed here.

You may process the text in chunks (say x bytes or one line) , each chunk can be executed by different threads , ie one thread per chunk.

Addy
  • 147
  • 1
  • 1
  • 7
0

Without more details (such as the nature of the keys) it is difficult to be certain but you may find using a Bloom filter useful to minimise the number of times you do check within an ArrayList of strings whether it has a match.

Obviously this would not help much if the lookup list changes over time.

You would use the Bloom filter to do a pre-check before searching the list because it can very quickly give you a straight no answer if the key does not exist in the list. You will still need to search you list if the bloom filter says maybe.

OldCurmudgeon
  • 64,482
  • 16
  • 119
  • 213
0

you should use HashMap it's approximately O(1), or if your strings have a lot of collisions than you need to use TreeSet O(logN), or Bloom filter.