0

Question is really wider than the title allows me to specify. I have a big file representing unordered numbered packets in the order they were received and the timestamp that corresponds to it, such as (arrows included for clarity, not really in file):

seq_1 ----> timestamp

seq_2 ----> timestamp

seq_3 ----> timestamp

seq_2 ----> timestamp

seq_5 ----> timestamp

seq_4 ----> timestamp

...

Timestamps always increase, but I might duplicate packets, packets out of order, etc. I have parsed the file to a list of strings, and must now decide the appropriate data structure to save it, taking into account that I need to:

  1. Remove all duplicated sequence numbers, only keeping the first one which arrived.
  2. Obtain an ordered iterable structure sorted by the sequence number.

The idea is that I could plot (not really going to do it, though) a graph bar, x axis being the sequence numbers and y axis being the timestamp. I need to manually find local maxima and minima, so I should be able to access the adjacent entries of any entry.

I have thought of parsing the list of lines to a dictionary of (sequence_number, timestamp), carefully not overwriting existing entries (condition 1), then turning it into a list of tuples and finally sorting the list by key. list should allow me to access adjacent entries, thus fulfilling condition 2. The parsed file is quite big, so I was wondering if there is a solution which would scale better (not requiring conversion between two data structures + posterior sorting).

Community
  • 1
  • 1
user2891462
  • 3,033
  • 2
  • 32
  • 60
  • A ["Tree Map"](https://docs.oracle.com/javase/8/docs/api/java/util/TreeMap.html) would be appropriate. The Java standard library has such a collection. Here's an older question looking for one in Python: http://stackoverflow.com/q/6886294/1172714 – dsh Oct 30 '15 at 19:36

1 Answers1

1

The easiest bet is to just dump things into a dictionary and sort the keys at the end. The d.get call ensures that it keeps the first encountered value if one exists, or inserts a new value if it doesn't.

In [23]: s = """seq_1 ----> timestamp1
   ....: seq_2 ----> timestamp2
   ....: seq_3 ----> timestamp3
   ....: seq_2 ----> timestamp4
   ....: seq_5 ----> timestamp5
   ....: seq_4 ----> timestamp6
   ....: seq_9 ----> timestamp7
   ....: seq_10 ----> timestamp8
   ....: seq_6 ----> timestamp9
   ....: seq_7 ----> timestamp10
   ....: seq_2 ----> timestamp11
   ....: seq_4 ----> timestamp12"""

In [24]: d = {}

In [25]: for line in s.split("\n"):
    seq, ts = map(str.strip, line.split("---->"))
    d[seq] = d.get(seq, ts)
   ....:

In [26]: sorted(d.items(), key=lambda x: int(x[0][4:]))
Out[26]:
[('seq_1', 'timestamp1'),
 ('seq_2', 'timestamp2'),
 ('seq_3', 'timestamp3'),
 ('seq_4', 'timestamp6'),
 ('seq_5', 'timestamp5'),
 ('seq_6', 'timestamp9'),
 ('seq_7', 'timestamp10'),
 ('seq_9', 'timestamp7'),
 ('seq_10', 'timestamp8')]
Randy
  • 14,349
  • 2
  • 36
  • 42
  • I ended up doing this, but unless I'm missing something this is the idea I originally proposed and for which I was looking alternatives (parse into a dictionary, discarding duplicates; convert the dictionary to an unsorted list with `items()`; and sort it at the end with `sorted()`). – user2891462 Nov 02 '15 at 17:54
  • Yes, reading back through, this is basically what you proposed. Since it's a 4-line approach to the problem that doesn't need any external libraries, it's worth trying first. If it works quickly enough, you're done, and if not, you've got a benchmark to compare to. Trying to come up with alternatives before doing so smells like premature optimization. – Randy Nov 02 '15 at 18:32
  • Yes, you are right, I tried this method and only took a couple of seconds to run on a file with hundreds of thousands of lines, so I was optimising prematurely :) Just thought I could use the chance to learn some more about data structures. – user2891462 Nov 03 '15 at 11:05