Split large CSV file based on row value

Question

The porblem

I have a csv file called data.csv. On each row I have:

timestamp: int
account_id: int
data: float

for instance:

timestamp,account_id,value
10,0,0.262
10,0,0.111
13,1,0.787
14,0,0.990

This file is ordered by timestamp.
The number of row is too big to store all rows in memory.
order of magnitude: 100 M rows, number of account: 5 M

How can I quickly get all rows of a given account_id ? What would be the best way to make the data accessible by account_id ?

Things I tried

to generate a sample:

N_ROW = 10**6
N_ACCOUNT = 10**5

# Generate data to split
with open('./data.csv', 'w') as csv_file:
  csv_file.write('timestamp,account_id,value\n')
  for timestamp in tqdm.tqdm(range(N_ROW), desc='writing csv file to split'):
    account_id = random.randint(1,N_ACCOUNT)
    data = random.random()
    csv_file.write(f'{timestamp},{account_id},{data}\n')

# Clean result folder
if os.path.isdir('./result'):
  shutil.rmtree('./result')
os.mkdir('./result')

Solution 1

Write a script that creates a file for each account, read rows one by one on the original csv, write the row on on the file that corresponds to the account (open and close a file for each row).

Code:

# Split the data
p_bar = tqdm.tqdm(total=N_ROW, desc='splitting csv file')
with open('./data.csv') as data_file:
  next(data_file) # skip header
  for row in data_file:
    account_id = row.split(',')[1]
    account_file_path = f'result/{account_id}.csv'
    file_opening_mode = 'a' if os.path.isfile(account_file_path) else 'w'
    with open(account_file_path, file_opening_mode) as account_file:
      account_file.write(row)
    p_bar.update(1)

Issues:

It is quite slow (i think it is inefficient to open and close a file on each row). It takes around 4 minutes for 1 M rows. Even if it works, will it be fast ? Given an account_id I know the name of the file I should read but the system has to look over 5M files to find it. Should I create some kind of binary tree with folders with the leafs being the files ?

Solution 2 (works on small example not on large csv file)

Same idea as solution 1 but instead of opening / closing a file for each row, store files in a dictionary

Code:

# A dict that will contain all files
account_file_dict = {}

# A function given an account id, returns the file to write in (create new file if do not exist)
def get_account_file(account_id):
  file = account_file_dict.get(account_id, None)
  if file is None:
    file = open(f'./result/{account_id}.csv', 'w')
    account_file_dict[account_id] = file
    file.__enter__()
  return file

# Split the data
p_bar = tqdm.tqdm(total=N_ROW, desc='splitting csv file')
with open('./data.csv') as data_file:
  next(data_file) # skip header
  for row in data_file:
    account_id = row.split(',')[1]
    account_file = get_account_file(account_id)
    account_file.write(row)
    p_bar.update(1)

Issues:

I am not sure it is actually faster. I have to open simultaneously 5M files (one per account). I get an error OSError: [Errno 24] Too many open files: './result/33725.csv'.

Solution 3 (works on small example not on large csv file)

Use awk command, solution from: split large csv text file based on column value

code:

after generating the file, run: awk -F, 'NR==1 {h=$0; next} {f="./result/"$2".csv"} !($2 in p) {p[$2]; print h > f} {print >> f}' ./data.csv

Issues:

I get the following error: input record number 28229, file ./data.csv source line number 1 (number 28229 is an example, it usually fails around 28k). I assume It is also because i am opening too many files

: Can you share the input your are passing and output you are expecting — codeholic24, Sep 16 '22 at 14:59
As far as I know to make the process faster split the process into batch by batch — codeholic24, Sep 16 '22 at 15:00
@codeholic24 The code that generates the file is given just under `Things i tired` it just creates a csv file with random data `timestamp, account_id, value`. The output of each solution is a `result` folder that contains multiple csv file, the name of the csv file is `.csv` which stores all lines from the original csv file that belongs the account. — Vince M, Sep 16 '22 at 15:06
@codeholic24 I did not tried to batch, it might indeed speed up the process (not sure it will speed up a lot the process though but i will try, thank you). — Vince M, Sep 16 '22 at 15:08
@VinceM : how big is the total file ? within `3 GB` `mawk2` and `gawk` probably can handle it all in memory and do it in one shot. Or use `gnu-parallel` or something to run 9 `awk`s, each given a single digit as their own customized input, and have each `awk` instance only handle `account_id`s starting with that custom digit — RARE Kpop Manifesto, Sep 16 '22 at 15:42
@RAREKpopManifesto the file is around 15 GB (`data.csv` file does not contain only 3 columns, it was a simplified example). I am not an expert with `awk` (i did copy and paste solution from `split large csv text file based on column value` and adapted it for my data). I do not know what `mawk2`, `gawk`, `gnu-parallel` are but i will try to look at it. — Vince M, Sep 16 '22 at 15:48
Have you considered importing the CSV into a database and using queries and views to select (and process: sum, etc.) data by account? — Dennis Williamson, Sep 16 '22 at 15:53
@DennisWilliamson Yes i did considered this solution (I was tempted to add it in the list of solution of the post). This solution would work fine but it has a downside: i need to install, set-up and populate the db. I am not alone on the project i am working on, so all other developers would have to install new dev dependencies. Worst case, i could choose this solution but if possible, i would like to avoid it and stay with just python code. — Vince M, Sep 16 '22 at 15:59
@VinceM : if u're cool with calling `awk` from within `python`, then my response below can be a rough framework for the approach i was mentioning — RARE Kpop Manifesto, Sep 16 '22 at 16:14
@VinceM have a look at https://docs.python.org/3/library/sqlite3.html, it doesn't require any dependencies — westandskif, Sep 19 '22 at 15:59

score 0 · Answer 1 · answered Sep 16 '22 at 16:12

@VinceM :

While not quite 15 GB, I do have a 7.6 GB one with 3 columns :

-- 148 mn prime numbers, their base-2 log, and their hex

  in0: 7.59GiB 0:00:09 [ 841MiB/s] [ 841MiB/s] [========>] 100%            
  
  148,156,631 lines 7773.641 MB (  8151253694)  /dev/stdin

|

f="$( grealpath -ePq ~/master_primelist_19d.txt )"

( time ( for __ in '12' '34' '56' '78' '9'; do 

     ( gawk -v ___="${__}" -Mbe 'BEGIN {

               ___="^["(___%((_+=_^=FS=OFS="=")+_*_*_)^_)"]" 

       }  ($_)~___ && ($NF = int(($_)^_))^!_' "${f}" & ) done | 

  gcat - ) ) | pvE9 > "${DT}/test_primes_squared_00000002.txt"

|

 out9: 13.2GiB 0:02:06 [98.4MiB/s] [ 106MiB/s] [ <=> ]

  ( for __ in '12' '34' '56' '78' '9'; do; ( gawk -v ___="${__}" -Mbe  "${f}" &)  

  0.36s user 3     out9: 13.2GiB 0:02:06 [ 106MiB/s] [ 106MiB/s]

Using only 5 instances of gawk with big-integer package gnu-GMP, each with a designated subset of leading digit(s) of the prime number,

—- it managed to calculate the full precision squaring of those primes in just 2 minutes 6 seconds, yielding an unsorted 13.2 GB output file

if it can square that quickly, then merely grouping by account_id should be a walk in the park

score 0 · Answer 2 · answered Sep 19 '22 at 16:12

Have a look at https://docs.python.org/3/library/sqlite3.html You could import the data, create required indexes and then run queries normally. No dependencies except for the python itself.
https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_csv.html
If you have to query raw data every time and you are limited by simple python only, then you can either write a code to read it manually and yield matched rows or use a helper like this:

from convtools.contrib.tables import Table
from convtools import conversion as c

iterable_of_matched_rows = (
    Table.from_csv("tmp/in.csv", header=True)
    .filter(c.col("account_id") == "1")
    .into_iter_rows(dict)
)

However this won't be faster than reading 100M row csv file with csv.reader.

Split large CSV file based on row value

The porblem

Things I tried

Solution 1

Solution 2 (works on small example not on large csv file)

Solution 3 (works on small example not on large csv file)

2 Answers2