0

I am providing a statistics service for a video game in which I mine the leaderboard of that game once per hour and create graphs etc. to track the progress of the players. Initially, I used a MultiIndex DataFrame, which looks similar to this:

import pandas as pd
import numpy as np
index_arrays = [np.array(["01.01.22"]*4+["02.01.22"]*4), 
                np.array(["Name","Level","Money","Guild"]*2)]

df = pd.DataFrame([["a","b","c"],[4,5,6],[7,8,9],
                   [10,11,12],["a","b","d"],[16,1,17],
                   [18,19,20],[21,22,23]], index=index_arrays, columns=["1","2","3"])

Output:

                 1   2   3
01.01.22 Name    a   b   c
         Level   4   5   6
         Money   7   8   9
         Guild  10  11  12
02.01.22 Name    a   b   d
         Level  16   1  17
         Money  18  19  20
         Guild  21  22  23

The structure is based on how the leaderboard is mined (Columns represent the rank). After updating this DataFrame, it has always been saved as a .csv file.

However, after a year or so, iterating through this DataFrame and collecting data for a specific player (which was neither easily filterable by column nor index) became really slow. This was fixed by re-organizing the data in a more player oriented way. Each player got his own .csv file (=> hundreds of those files), which made accessing this data more easy, but at the same time increased the needed disk space significantly. An example for player "a" looks like this:

          Level  Money  Guild
01.01.22      4      7     10
02.01.22     16     18     21

This works for now, but I am thinking about what would be the most efficient way to continue this, both in the sense of speed of accessing the data and storage space. I am thinking about switching from a .csv storage to a SQL solution. Would that be faster or require less disk space? Instead of having numerous .csv files, I could use numerous tables in a single database file. Another option I saw in the context of stock data was one big database in which each row would present the entry of one player at a certain time. It would look something like this and adds up to millions of rows.

  Name  Rank  Level  Money  Guild      Date
0    a     1      4      7     10  01.01.22
1    b     2      5      8     11  01.01.22
2    c     3      6      9     12  01.01.22
3    a     1     16     18     21  02.01.22
4    b     2      1     19     22  02.01.22
5    d     3     17     20     23  02.01.22

I realized that this problem may be similar to collecting historical stock market data in which one has time series data of several properties (price, volume, ...) for hundreds of stocks. So I am sure that some people have found the most elegant and fast solution to this kind of problem. Does anyone of you have experience with this and maybe even measured some benchmarks?

BenB
  • 658
  • 2
  • 10

2 Answers2

1

in terms of storage, some of the other file types that are efficent for dataframe storage and also help with large improvements in speed of parsing large data sets (where csv for example is slow)

  1. feather good for short term storage and keeping data types and fast read. we use this as form of "caching" of data, where we may have SQL database that stores and helps with efficient querying, then caching of that dataframe via feather file
  2. parquet good for more longer term storage. link has good explanation of difference of the two
  3. hdf5 another good file format (more explanation in link

these are just a couple but hopefully give some good intro to different file types as get larger data sets will find csv to be bit slow. example post of some performance differences (shows against csv) see link

with this what solutions above wont help with is giving more flexibility and performance in querying data prior to going into dataframe. with above any further querying is implied would need to read whole file in memory into dataframe then filter/query/etc. if datasets get large, need persistence, want to be able to query to smaller datasets pandas can use that is not bad solution (SQL is nice as can enforce different data consistency, typing, etc). now SQL generally be slower when reading into dataframe compared to feather/etc, but if queries get data small enough perhaps this negligible for you

kamster
  • 139
  • 6
1

You might consider polars. It is orders of magnitudes faster than pandas. And I recommend storing your data in a binary format such as parquet or arrow ipc format (feather).

If you write to parquet, we can even use the statistics in your queries, which might improve your query performance.

ritchie46
  • 10,405
  • 1
  • 24
  • 43
  • curious in using polars does it work nicely when using other libraries that may work with pandas dataframes, like is it simple replacement or do you some of the ecosystem of packages that may have nice integrations with pandas dataframes? – kamster Feb 09 '22 at 01:00