Extract efficiently a list of Files from List with a list of names

Question

I have a HUGE list of Files List<File> (1 Giga) and a small list of file names List<String>, around 20 file names. Now I want to extract the sublist of Files that contains only the files from the other list - the list of file names - by checking their names (File.name).

For Example: Huge List<File> [1A, 2A, 3A, 4A,1B, 2B, 3B, 4B, 5B, 6B, 1C, 2C, 3C, 4C] and file names list List<String> is [1A, 2B, 3C]. So I want all the files from the huge List<File> that their names are 1A and 2B and 3C.

What would be the most elegant and efficient way to do it in Kotlin in a Kotlin way.

When you say ‘sublist’, do you mean the first (or alternatively the longest) subsequence of _consecutive_ files with a matching filename? Or do you want to extract _every_ matching file? — gidds, Jan 15 '21 at 23:25
I want to extract the "Files" from the huge list. The second list with the names is only to help what files I should extract from the huge list. — Deksterious, Jan 16 '21 at 10:26
Yes, but that doesn't answer my question. You want to create a new list, containing _some_ of the files from your huge list, by checking their filenames. But do you want _every_ matching file from the huge list? Or only some sublist of _consecutive_ matching files from it? For example, if your huge list had files `[1A, 2B, 3A, 4B, 5B, 6A, 7B]`, and you want to extract those ending in `B`, would you want the result to be `[2B, 4B, 5B, 7B]`? Or just `[2B]` (the first such sublist)? Or `[4B, 5B]` (the longest such sublist)? Or something else? — gidds, Jan 16 '21 at 11:28
No, I want all the files from the huge list that their names are in the second list of names. For Example: Huge List `[1A, 2A, 3A, 4A,1B, 2B, 3B, 4B, 5B, 6B, 1C, 2C, 3C, 4C]` and file names list List is `[1A, 2B, 3C]`. So I want all the files from the huge List that their names are `1A` and `2B` and `3C`. I am looking only for the most efficient way in Kotlin, as I am a java developer, not kotlin. Btw, the names are always exact names in both lists, not prefixes nor suffixes. — Deksterious, Jan 16 '21 at 11:41

score 1 · Accepted Answer · answered Jan 16 '21 at 12:14

Based on your clarifications, you're not looking for a sublist at all. (That's a consecutive portion of a list, as returned by e.g. java.util.List.subList().)

Instead, what you want is to filter the list. And the usual way to do this is with filter(), which is very simple and clear:

val result = hugeList.filter{ it.name in filenameList }

That will do a linear scan through your huge list; for each file, it will scan through your list of filenames until it finds a match (or runs out of filenames). Then it appends each match to the resulting list.

In most cases, that would be perfectly good. But in this case, where performance is clearly an issue, that has two issues:

First, the need to scan through your list of filenames each time. If there are very few filenames, that's probably fine. Otherwise, it might be more efficient to first convert it to a set (which can be checked in constant time):

val filenameSet = filenameList.toSet()
val result = hugeList.filter{ it.name in filenameSet }

And second, it'll initially allocate only a small amount of space for the resulting list; as it grows and grows, it may need to keep reallocating a larger space and copying the elements over. (For example, ArrayList grows by 50% each time.) So if you have a rough idea of how large it might grow, you might want to pre-allocate a list of a suitable capacity, and then filterTo() it, to reduce the reallocation and copying. For example, if you expect roughly half the files to match:

val filenameSet = filenameList.toSet()
val result: MutableList<File> = ArrayList(hugeList.size / 2)
hugeList.filterTo(result){ it.name in filenameSet }

(I think that's about as efficient as you're likely to get, without either trying to parallelise it (see this question, though that could be a lot harder and might not give much improvement here), or using any opportunities the bigger picture gives you to change the problem.)

Wow, thanks for that answer. Maybe deleting the already found files from original huge list would also help? — Deksterious, Jan 16 '21 at 13:03
@Deksterious I wouldn't recommend updating the original huge list at all. If it's implemented with an array, then deleting an item would mean copying every subsequent item up one, which would need to update most of the array; it would also spoil the locality of reference and lose the benefit of memory caching. (And since the filter scans from the first item, deleting them would turn it from an O(n) process into an O(n²) one!)… — gidds, Jan 16 '21 at 13:11
…(If it was a linked list, instead, then it wouldn't have those disadvantages; but deletion would still involve unnecessary work.) If you don't want the huge list afterwards, then just ensure you don't keep a reference to it, and let the garbage collector take care of it. — gidds, Jan 16 '21 at 13:11

Extract efficiently a list of Files from List with a list of names

1 Answers1