Efficient Intersection and Union of Lists of Strings

Question

I need to efficiently find the ratio of (intersection size / union size) for pairs of Lists of strings. The lists are small (mostly about 3 to 10 items), but I have a huge number of them (~300K) and have to do this on every pair, so I need this actual computation to be as efficient as possible. The strings themselves are short unicode strings -- averaging around 5-10 unicode characters.

The accepted answer here Efficiently compute Intersection of two Sets in Java? looked extremely helpful but (likely because my sets are small (?)) I haven't gotten much improvement by using the approach suggested in the accepted answer.

Here's what I have so far:

protected double uuEdgeWeight(UVertex u1, UVertex u2) {
    Set<String> u1Tokens = new HashSet<String>(u1.getTokenlist());
    List<String> u2Tokens = u2.getTokenlist();

    int intersection = 0;
    int union = u1Tokens.size();
    for (String s:u2Tokens) {
        if (u1Tokens.contains(s)) {
            intersection++;
        } else {
            union++;
        }
    }
    return ((double) intersection / union);

My question is, is there anything I can do to improve this, given that I'm working with Strings which may be more time consuming to check equality than other data types.

I think because I'm comparing multiple u2's against the same u1, I could get some improvement by doing the cloning of u2 into a HashSet outside of the loop (which isn't shown -- meaning I'd pass in the HashSet instead of the object from which I could pull the list and then clone into a set)

Anything else I can do to squeak out even a small improvement here?

Thanks in advance!

Update

I've updated the numeric specifics of my problem above. Also, due to the nature of the data, most (90%?) of the intersections are going to be empty. My initial attempt at this used the clone the set and then retainAll the items in the other set approach to find the intersection, and then shortcuts out before doing the clone and addAll to find the union. That was about as efficient as the code posted above, presumably because of the trade of between it being a slower algorithm overall versus being able to shortcut out a lot of the time. So, I'm thinking about ways to take advantage of the infrequency of overlapping sets, and would appreciate any suggestions in that regard.

Thanks in advance!

This question appears to be off-topic because it is about a code review and belongs on [codereview.stackexchange.com](http://codereview.stackexchange.com) not here. — , Jan 06 '14 at 19:19
Is there the possibility of your list containing any duplicates? I also think this isn't so much a code review but an interesting algorithm question about how to find fast intersections and unions of string sets. — Daniel Williams, Jan 06 '14 at 19:22
I think it's possible for the lists to contain duplicates, but 99% of the time they will not. — PurpleVermont, Jan 06 '14 at 19:38
@JarrodRoberson Can you clarify what belongs here versus codereview.stackexchange.com? Most of the questions I have seen on here include code snippets to illustrate the asker's initial approach. — PurpleVermont, Jan 06 '14 at 19:47
If you have **working** code, that you want advice on, it doesn't belong here. This site is for non-working code and how to fix it. — , Jan 06 '14 at 19:59
I guess the counter argument would be that this is non-working in that it isn't fast enough. There is some overlap between here and code review and since he has a specific question about this code I think it's ok. — Tim B, Jan 06 '14 at 20:14
@Jarrod Roberson: All the "off-topic because of another site" is a big non-sense, especially when speaking about performance! Moreover, CR is beta and there's hardly anything usable there. CR should be shut down, that's it. — maaartinus, Jan 06 '14 at 22:51
I see that this post has been put on hold as "off topic". Would it be on-topic if I edited it to exclude the code snippet and just ask a general question about how to do this most efficiently in Java? I'm a bit confused because the community standards page states that good questions usually should include a bit of code. They don't say that it has to be broken code. The code I posted computes the right function but doesn't do it efficiently enough for me to use it, so as @TimB says, it's not "working" for my purposes. — PurpleVermont, Jan 07 '14 at 01:49
Or, alternately, is there a way to move this question with its replies so far to codereview, if that is indeed the best place for it? — PurpleVermont, Jan 07 '14 at 01:56
Remove the code and it is off-topic because it doesn't include code. This also falls foul of *"too localized"* even if that doesn't exist as a pre-made selection anymore either. — , Jan 07 '14 at 04:51
It already has 3 re-open votes. I guess it's one of those borderline cases where a good number of people think it should be closed and equally a good number think it should be open. — Tim B, Jan 07 '14 at 08:23
@JarrodRoberson can you explain what you mean by "too localized"? Do you mean that you think I'm the only person who would care about the answer? — PurpleVermont, Jan 07 '14 at 15:44
@TimB is there anything I can/should do to improve the question, or should I just wait it out and see how the voting goes? I've been a reader for many years, but am a new poster, and I want to be a "good citizen". — PurpleVermont, Jan 07 '14 at 15:45
You could post a question on meta asking for feedback on whether the question should be closed or open and suggestions to improve it. — Tim B, Jan 07 '14 at 16:33
The on topicness of this question is [under discussion on meta](http://meta.stackexchange.com/questions/215282/is-stack-overflow-the-right-place-to-ask-a-question-about-code-efficiency). Current opinion appears to be; best on code-review but acceptable on stack overflow — Richard Tingle, Jan 07 '14 at 17:32
@JarrodRoberson Just because a question is on-topic elsewhere doesn't mean it's off-topic here. In particular, just because a question asks about how to improve working code doesn't mean it is off-topic on StackOverflow. See [this Meta thread](http://meta.stackexchange.com/questions/214787) about precisely that scenario. — Mark Amery, Jan 07 '14 at 17:34
I know this is old, and so probably won't get seen, but would using TreeSets be out of the question for efficiency? If not, then intersections could be done using 2 iterators and 2 while loops, minimizing the number of comparisons that would need to be done. — Robert Benson, Jun 24 '16 at 19:01

Tim B · Answer 1 · 2014-01-06T20:16:52.927

1

You would get a large improvement by moving the HashSet outside of the loop.

If the HashSet really has only got a few entries in it then you are probably actually just as fast to use an Array - since traversing an array is much simpler/faster. I'm not sure where the threshold would lie but I'd measure both - and be sure that you do the measurements correctly. (i.e. warm up loops before timed loops, etc).

One thing to try might be using a sorted array for the things to compare against. Scan until you go past current and you can immediately abort the search. That will improve processor branch prediction and reduce the number of comparisons a bit.

edited Jan 06 '14 at 20:16

answered Jan 06 '14 at 19:14

Tim B

40,716
16
83
128

Thanks! I'll try keeping it in the Array and see what happens there. I am (surprisingly) not getting noticeable gain by moving the HashSet outside of the loop (on a benchmark part of my dataset that takes about 15 minutes to run). I was hoping I'd get a fairly big win by doing that. :-/ I'm not sure how to do the measurements correctly wrt warm-up loops etc. I'll look that up. I have not been doing anything but running my code on a chunk of my data and comparting how long it takes. – PurpleVermont Jan 06 '14 at 19:41
1

Here is a (very short) answer I wrote a while ago on benchmarking http://stackoverflow.com/questions/20655963/java-benchmarking-why-is-the-second-loop-faster/20655996#20655996 – Tim B Jan 06 '14 at 20:11
Added an improvement to the array idea. – Tim B Jan 06 '14 at 20:17

score 0 · Answer 2 · answered Jan 06 '14 at 19:18

0

If you want to optimize for this function (not sure if it actually works in your context) you could assign each unique String an Int value, when the String is added to the UVertex set that Int as a bit in a BitSet.

This function should then become a set.or(otherset) and a set.and(otherset). Depending on the number of unique Strings that could be efficient.

answered Jan 06 '14 at 19:18

Rickard

7,239
1
18
11

There will be probably on the order of a million unique strings. Would this still be a reasonable option to try? To assign unique integers, would it make sense just to put all the strings into a Hash of some kind and use the hash values? – PurpleVermont Jan 06 '14 at 19:43

Efficient Intersection and Union of Lists of Strings

2 Answers2