You need to keep track of the case where the same record might appear more than once in the files. For example, if a record appears twice in file A and once in file B, then you need to record that as an extra record.
Since we have to keep track of the number of occurrences, you need one of:
- A Multiset
- A Map from record to Integer e.g. Map
With a Multiset, you can add and remove records and it will keep track of the number of times the record has been added (a Set doesn't do that - it rejects an add of a record that is already there). With the Map approach, you have to do a little bit of work so that the integer tracks the number of occurrences. let's consider that approach (the MultiSet is simpler).
With the map, when we talk about 'adding' a record, you look to see if there is an entry for that String in the Map. if there is, replace the value with value+1 for that key. If there isn't, create an entry with the value of 1. When we talk about 'removing an entry', look for an entry for that key. If you find it, replace the value with value-1. If that reduces the value to 0, remove the entry.
- Create a Map for each file.
- Read a record for one of the files
- Check to see if that record exists in the other Map.
- If it exists in the other Map, remove that entry (see above for what that means)
- If it doesn't exist, add it to the Map for this file (see above)
- Repeat until end, alternating files.
The contents of the two Maps will give you the records that appeared in that file but not the other.
Doing this as we go along, rather than building the Maps up front, keeps the memory usage down, but probably doesn't have a big impact on performance.