4

Assume you have something like

class Person {
  LocalDate bornOn;
  LocalDate diedOn;
}

let's say you have a bunch of "Person" instances that you can store any way you like.

What's the best way of writing an efficient function that can list all people that were alive at a given time?

The data structure should be efficiently mutable as well, in particular in terms of adding new elements.

I.e. conceptually something like

List<Person> alive(List<Person> people, LocalDate date) {
  return people.stream().filter(x -> x.bornOn.compareTo(date) <= 0 && x.diedOn.compareTo(date) > 0).collect(Collectors.toList())
}

Only more efficient.

My first gut feeling would be having two NavigableMaps

NavigableMap<LocalDate, Person> peopleSortedByBornOn;
NavigableMap<LocalDate, Person> peopleSortedByDiedOn;

each could be queried with headMap() / tailMap() of the given date, and the intersection of those queries would be the result.

Is there any faster, or significantly more convenient solution though? Maybe even some sort of widely used Java collection/map type that'd support such an operation?

Draken
  • 3,134
  • 13
  • 34
  • 54
Bogey
  • 4,926
  • 4
  • 32
  • 57
  • 1
    If you can use any datastructure you like and the only aim is a fast query, than you could use a (sorted) hashmap with all dates from the first born-date to the last died-date as keys and a Set/List of Person as values. This is the fastest way for the query. – Tobias Otto Oct 15 '19 at 13:20
  • @TobiasOtto Thanks, absolutely indeed - I should have mentioned, the structure should be efficiently mutable as well. – Bogey Oct 15 '19 at 13:24
  • Every optiomization that I can think of, you would have to put pressure on another constraint. Some questions that may help... Do you care about stressing memory? When you say mutable, do you mean changing the born/died values or adding new persons? Should the algorithm be thread-safe or do you always want to filter it using a single thread? Do you care about initial-load time (creating the structure might be slow)? Do you care about remove/insert performance? – Ioannis Deligiannis Oct 15 '19 at 13:46
  • @IoannisDeligiannis : 1) Adding new people, assume bornOn / diedOn will never change for a given person. 2) single-threaded 3) Assume you start without any data, and you receive Person objects on the go in a fast, but unpredictable stream of information 4) Removing isn't needed; insertion is as per the previous point – Bogey Oct 15 '19 at 13:51
  • @Bogey What is the maximum number of People you would have? – nice_dev Oct 15 '19 at 13:52
  • @vivek_23 Let's say no more than 10 million – Bogey Oct 17 '19 at 12:29

3 Answers3

7

I would like to mention geometric data structures, like quad trees. For theoretical purposes. Have (born, died) coordinates: died >= born.

    d         b=d
    |    | - /
    | +  |  /
    |    | /
  D |____|/
    |   /:
    |- / :
    | /  :
    |/___:_____ b
         D

The points are all located in the upper triangle, and + is the rectangular area for people living at date D. The rectangle is open ended left and top.

Having geometric data structure could do. And there are databases that can handle such geometric queries.

I would love to see an implementation, though a speed advantage I would not bet on. Maybe with huge numbers.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • 1
    Based on an [old answer](https://stackoverflow.com/questions/10170544/getting-results-between-two-dates-in-postgresql), even a straight forward implementation should be pretty decent, and with [PostGIS](https://postgis.net/workshops/postgis-intro/indexing.html) you get spatial indexing and I'd suppose even more performance. – Kayaman Oct 16 '19 at 06:27
1

Given the constraints clarifications, I would keep it simple and use a map to keep references for living people on a given day, effectively creating an index.

Map<LocalDate,LinkedList<Person>> aliveMap;

Puts cost would be O(1) for the map and O(1) for the LinkedList. Get on the other hand, is as good as it gets; O(1) (assuming a good hashing algorithm).

Memory wise, you would incur the cost of the extra "references", however this might be significant (~80 years x 365 x 8 bytes for a x64 VM or 233,600 bytes per person).

This approach will yield optimal performance on the get operations, probably the worst in terms of memory and average on the put operations.

Variation: Instead of creating the full index, you can create buckets e.g. yearly where you first get everyone alive in a given year and then filter out the dead.

Map<Integer,LinkedList<Person>> aliveMap;

NB: I assume that your data go over 100's of years and not cover the whole population (7.5 billion). If you were only looking into a 50-100 year window, then there may be more efficient specialisations.

Ioannis Deligiannis
  • 2,679
  • 5
  • 25
  • 48
  • Thanks for your suggestion! If I understand your approach right, the following could be an issue: Assume no person is born or dies on, say, Jan 1st 2020. You might still want to know who's alive on this date. You wouldn't have a map entry for it though - unless you add entries for every single day from now until all eternity (or the next 100-something years), which would be highly memory intensive. At the moment I'm using sth similar though with a navigable map, storing only dates on which the overall state changes (any person is born or dies), and looking up the closest less-or-equal date – Bogey Oct 16 '19 at 08:22
  • It is actually the latter i.e. store for every day. Given that the average life expectancy is ~80years, you would have approx. 80 entries per person (hence the 80x in above calculations). The idea is to store same object instance, so only incurring the cost of the pointer. As I mentioned in the comment above, the only way that I believe you can optimize this is to stress another resource, in this case that would be memory.However, the pointer footprint should be negligible compared the data inside the object. – Ioannis Deligiannis Oct 16 '19 at 08:30
  • Ah. Assuming you query survival status for days (rather than years), I'd think we'd need 80 * 365 entries though; if covering the maximum lifespan rather than expected, probably more like.. 125*365 or so. Plus potentially a lot more if including people born in the past (but yes agree - it's certainly a tradeoff between memory vs CPU stress) – Bogey Oct 16 '19 at 08:33
  • you are right...It was early morning so brain have not yet booted properly. The algorithm and characteristics are sensitive to the average 80-90 years, not on the maximum. I'll see if I can give a better code example later. – Ioannis Deligiannis Oct 16 '19 at 09:04
0

The only way I can think that you can make that more efficient is creating your own custom data structure. For example create your own HashMap in Java in which you can rewrite the "put" method. This way, when you will insert a Person object in the map you will know from the moment of insertion if it is alive or dead.

Here you have an example on how to create a custom HashMap.

Andrei Tigau
  • 2,010
  • 1
  • 6
  • 17