I am providing a statistics service for a video game in which I mine the leaderboard of that game once per hour and create graphs etc. to track the progress of the players.
Initially, I used a MultiIndex DataFrame
, which looks similar to this:
import pandas as pd
import numpy as np
index_arrays = [np.array(["01.01.22"]*4+["02.01.22"]*4),
np.array(["Name","Level","Money","Guild"]*2)]
df = pd.DataFrame([["a","b","c"],[4,5,6],[7,8,9],
[10,11,12],["a","b","d"],[16,1,17],
[18,19,20],[21,22,23]], index=index_arrays, columns=["1","2","3"])
Output:
1 2 3
01.01.22 Name a b c
Level 4 5 6
Money 7 8 9
Guild 10 11 12
02.01.22 Name a b d
Level 16 1 17
Money 18 19 20
Guild 21 22 23
The structure is based on how the leaderboard is mined (Columns represent the rank). After updating this DataFrame, it has always been saved as a .csv
file.
However, after a year or so, iterating through this DataFrame and collecting data for a specific player (which was neither easily filterable by column nor index) became really slow.
This was fixed by re-organizing the data in a more player oriented way. Each player got his own .csv
file (=> hundreds of those files), which made accessing this data more easy, but at the same time increased the needed disk space significantly.
An example for player "a" looks like this:
Level Money Guild
01.01.22 4 7 10
02.01.22 16 18 21
This works for now, but I am thinking about what would be the most efficient way to continue this, both in the sense of speed of accessing the data and storage space. I am thinking about switching from a .csv storage to a SQL solution. Would that be faster or require less disk space? Instead of having numerous .csv files, I could use numerous tables in a single database file. Another option I saw in the context of stock data was one big database in which each row would present the entry of one player at a certain time. It would look something like this and adds up to millions of rows.
Name Rank Level Money Guild Date
0 a 1 4 7 10 01.01.22
1 b 2 5 8 11 01.01.22
2 c 3 6 9 12 01.01.22
3 a 1 16 18 21 02.01.22
4 b 2 1 19 22 02.01.22
5 d 3 17 20 23 02.01.22
I realized that this problem may be similar to collecting historical stock market data in which one has time series data of several properties (price, volume, ...) for hundreds of stocks. So I am sure that some people have found the most elegant and fast solution to this kind of problem. Does anyone of you have experience with this and maybe even measured some benchmarks?