Parse list of strings for speed

Question

Background

I have a function called get_player_path that takes in a list of strings player_file_list and a int value total_players. For the sake of example i have reduced the list of strings and also set the int value to a very small number.

Each string in the player_file_list either has a year-date/player_id/some_random_file.file_extension or year-date/player_id/IDATs/some_random_number/some_random_file.file_extension

Issue

What i am essentially trying to achieve here is go through this list and store all unique year-date/player_id path in a set until it's length reaches the value of total_players

My current approach does not seem the most efficient to me and i am wondering if i can speed up my function get_player_path in anyway??

Code

def get_player_path(player_file_list, total_players):
    player_files_to_process = set()
    for player_file in player_file_list:
        player_file = player_file.split("/")
        file_path = f"{player_file[0]}/{player_file[1]}/"
        player_files_to_process.add(file_path)
        if len(player_files_to_process) == total_players:
            break
    return sorted(player_files_to_process)


player_file_list = [
    "2020-10-27/31001804320549/31001804320549.json",
    "2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
    "2020-10-28/31001804320548/31001804320549.json",
    "2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
    "2020-10-29/31001804320547/31001804320549.json",
    "2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
    "2020-10-30/31001804320546/31001804320549.json",
    "2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
    "2020-10-31/31001804320545/31001804320549.json",
    "2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
]

print(get_player_path(player_file_list, 2))

Output

['2020-10-27/31001804320549/', '2020-10-28/31001804320548/']

***"Each string in the player_file_list has a `year-date/player_id/some_random_file.file_extension`"*** - The following value doesn't match that pattern `2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat`, should it also count as "_unique year-date/player_id path_"? — Pedro Lobito, Nov 30 '20 at 20:45
@KarlKnechtel that can be eliminated but essentially i am trying to return a set that equals to that number because my other function will take the set and process all those players. On a given day i might want to process 1000 players and on another day i might want to process 500..etc. — Sluna, Nov 30 '20 at 20:48
@PedroLobito I have updated my question to highlight that as well. — Sluna, Nov 30 '20 at 20:50
One slight optimization would be not computing the set’s length at each iteration. — L. B., Nov 30 '20 at 20:57
@L.B. I thought about it, but was lazy :) will create a new version and post the link. — Pedro Lobito, Nov 30 '20 at 20:58
@PedroLobito your solution seems neat, do you mind explaining how it is faster than my current solution? I would love to learn the difference so i can become a more mindful programmer. You can also post it as answer so i can upvote it for future readers. — Sluna, Nov 30 '20 at 21:07
But in your example it is not sorted. If it's sorted, you should add to the description because it changes everything. — janluke, Nov 30 '20 at 21:41
They are sorted by date @sboby, does that make a difference? — Sluna, Nov 30 '20 at 21:41
If they are sorted only by date, no. If they were sorted by date/id, it made a huge difference because you would not need the sorting at the end which has the dominant cost of the entire function. — janluke, Nov 30 '20 at 21:43
@sboby yes id should have fixed length , by fix length you mean fixed number of digits in the player id right? — Sluna, Nov 30 '20 at 22:39
@Sluna Yes. I mean that len(id) is the same for all IDs in the list. — janluke, Nov 30 '20 at 23:04

janluke · Answer 1 · 2020-12-01T01:20:07.487

Let's analyze your function first:

your loop should take linear time (O(n)) in the length of the input list, assuming the path lengths are bounded by a relatively "small" number;
the sorting takes O(n log(n)) comparisons.

Thus the sorting has the dominant cost when the list becomes big. You can micro-optimize your loop as much as you want, but as long as you keep that sorting at the end, your effort won't make much of a difference with big lists.

Your approach is fine if you're just writing a Python script. If you really needed perfomances with huge lists, you would probably be using some other language. Nonetheless, if you really care about performances (or just to learn new stuff), you could try one of the following approaches:

replace the generic sorting algorithm with something specific for strings; see here for example
use a trie, removing the need for sorting; this could be theoretically better but probably worse in practice.

Just for completeness, as a micro-optimization, assuming the date has a fixed length of 10 characters:

def get_player_path(player_file_list, total_players):
    player_files_to_process = set()
    for player_file in player_file_list:
        end = player_file.find('/', 12)       # <--- len(date) + len('/') + 1
        file_path = player_file[:end]         # <---
        player_files_to_process.add(file_path)
        if len(player_files_to_process) == total_players:
            break
    return sorted(player_files_to_process)

If the IDs have fixed length too, as in your example list, then you don't need any split or find, just:

LENGTH = DATE_LENGTH + ID_LENGTH + 1   # 1 is for the slash between date and id
...
for player_file in player_file_list:
    file_path = player_file[:LENGTH]
...

EDIT: fixed the LENGTH initialization, I had forgotten to add 1

Thanks for the insightful write up. I believe it not possible to use `trie` in python unless i write my own or use a 3rd party library right? That is a path i want to avoid going down as i think built in libraries are optimized quite a bit. — Sluna, Nov 30 '20 at 23:05
You're correct. This should be fast: https://github.com/pytries/datrie. I'm just not sure if it'll be faster. Also, make sure you can iterate the the string in order (I've never used this package) if you intend to give it a try. — janluke, Nov 30 '20 at 23:30

Pedro Lobito · Accepted Answer · 2020-12-10T09:56:07.950

0

I'll leave this solution here which can be further improved, hope it helps.

player_file_list = (
    "2020-10-27/31001804320549/31001804320549.json",
    "2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
    "2020-10-28/31001804320548/31001804320549.json",
    "2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
    "2020-10-29/31001804320547/31001804320549.json",
    "2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
    "2020-10-30/31001804320546/31001804320549.json",
    "2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
    "2020-10-31/31001804320545/31001804320549.json",
    "2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
)

def get_player_path(l, n):
  pfl = set()
  for i in l:
    i = "/".join(i.split("/")[0:2])
    if i not in pfl:
      pfl.add(i)
    if len(pfl) == n:
      return pfl

  
  if n > len(pfl):
    print("not enough matches")
    return

print(get_player_path(player_file_list, 2))
# {'2020-10-27/31001804320549', '2020-10-28/31001804320548'}

Python Demo

edited Dec 10 '20 at 09:56

answered Nov 30 '20 at 21:08

Pedro Lobito

94,083
31
258
268

Thank you so much, can you explain why this is faster? I can see some things that make it faster like you do `if i not in plf` which i guess is better than directly adding everything to the set ? – Sluna Nov 30 '20 at 21:39
The code isn't much different from yours, but it uses `sets` instead of a `lists`, making it faster, specially if you've large `sets`. – Pedro Lobito Nov 30 '20 at 21:42
I am using a set as well `player_files_to_process = set()` in my code, just wondering how is your use of set different from mine? – Sluna Nov 30 '20 at 21:48
The `set` can be specifically handy to store a long list of items on `player_file_list`, but I don't see room for much improvement regarding performance on your code. – Pedro Lobito Nov 30 '20 at 22:04
This is faster because you are not returning a sorted list of paths, you are returning the set. You're basically skipping the most expensive step of the original code. This code is not equivalent to the original one by @Sluna. – janluke Dec 03 '20 at 14:39
Besides of that, this is identical to the original code but much worse, because it adds a try block that catches any exception (which is the epitome of a bad practice) and undesired behavior. The pass instruction there is completely useless (since there's an instruction just after it) and in case of exceptions the function prints a stack trace to stdout and then "not enough files". If the input is malformed, you want to raise a ValueError, not printing a stacktrace to stdout and continuing. – janluke Dec 03 '20 at 15:09
@PedroLobito Catching any exception is pretty much always wrong. I'm sorry, but you do seem confused about the pass instruction. The pass instruction is simply an instruction that does nothing and would be needed if it was the only instruction in the block: without it you would get a syntax error because a block can't be empty in Python. Try to insert a `raise Exception()` inside the try block. It will run the same with or without the "pass". – janluke Dec 09 '20 at 16:11
As I explained, this answer is not useful anyway, because the only difference with the original code (besides of worsening it with harmful code that does nothing for performances) is that it removes the sorting. – janluke Dec 09 '20 at 16:17
@PedroLobito I posted my answer on Dec 1st btw. – janluke Dec 09 '20 at 16:47
@janluke You're right about `pass`. I've updated my answer and demo. – Pedro Lobito Dec 10 '20 at 09:56

score 0 · Answer 3 · answered Nov 30 '20 at 21:19

0

Use dict so that you don't have to sort it since your list is already sorted. If you still need to sort you can always use sorted in the return statement. Add import re and replace your function as follows:

def get_player_path(player_file_list, total_players):
    dct = {re.search('^\w+-\w+-\w+/\w+',pf).group(): 1 for pf in player_file_list}
    return [k for i,k in enumerate(dct.keys()) if i < total_players]

answered Nov 30 '20 at 21:19

Aaj Kaal

1,205
1
9
8

OP wants to "_speed up_" the function , based on that, `re.search` isn't the best approach imo. – Pedro Lobito Nov 30 '20 at 21:21
re is faster than split. https://stackoverflow.com/questions/9602856/most-efficient-way-to-split-strings-in-python – Aaj Kaal Nov 30 '20 at 21:33
It depends on which regex you're referring to. Your assumption is quite abstract since certain regex can be very complex, making them extremely slow, while `split()` is normally faster. – Pedro Lobito Nov 30 '20 at 21:45
1

This code takes [`0.004031909629702568`](https://trinket.io/python3/f6f15f365c) to execute, while with split [`0.00005268678069114685`](https://trinket.io/python3/5e1578466d) – Pedro Lobito Nov 30 '20 at 21:56

Parse list of strings for speed

3 Answers3

Linked