0

I am seeking general guidance on how to process large file (will millions of rows in it) quickly and general advice on how to approach this as well as improve the code. I assume the last part where line by line is written to the file is not optimal but I am not so sure how to approach it i.e. what would be the fastest way and mechanisms behind it.

Any help is greatly appreciated.

import csv
import json
import pandas as pd

filename = 'filename.csv'

def getstuff():
    with open(filename, "rt") as csvfile:
        datareader = csv.reader(csvfile,delimiter="|")
        for row in datareader:
            yield row[8]

def get_json():
    with open('bad_records.txt', 'w') as f:  # Open the bad records file
        for i in getstuff():
            if '"metrics"' in i:
                try:
                    yield json.loads(i)
                except json.JSONDecodeError:
                    f.write(f"{i}\n")

def make_dataframe():
    for i in get_json():
        yield pd.json_normalize(i["metrics"])

with open('name.txt',mode="a") as f:
    for df in make_dataframe():
        df.to_csv(f, index=False, header=False, sep='\t',escapechar='\\')
Yami Mahō
  • 33
  • 4
  • 5
    Did you profile the individual steps? – jarmod Apr 22 '23 at 16:26
  • 1
    *I assume the last part where line by line is written to the file is not optimal* - you can try skipping it entirely for a test (so the `df.to_csv` would be a `pass`). My bet goes for reading JSON-embedded-in-kinda-CSV being the slow part. – tevemadar Apr 22 '23 at 16:38
  • @jarmod I have done it for the first functions and that one at least with cProfile.run() offers quickly some results (all zeroes but then again I might be doing something wrong) but the last one takes too much time so I must say I have not profiled it in fullest. Any advice on profiling that I missed? – Yami Mahō Apr 22 '23 at 16:42
  • 1
    Some ideas for [profiling Python generator functions](https://stackoverflow.com/questions/3570335/profiling-python-generators). Alternatively, try as suggested by @tevemadar by replacing actual functionality by pass to determine the delta. – jarmod Apr 22 '23 at 17:09
  • I think this could be parallelized using [Dask delayed](https://examples.dask.org/applications/embarrassingly-parallel.html#Use-Dask-Delayed-to-make-our-function-lazy). If you provide examples for your input data files I could have a go at it. – Bill Apr 22 '23 at 17:19
  • 1
    @jarmod thanks for the post about the profiling! It is really helpful! So the first two functions ran and I got the statistics. However, the one where dataframe is created takes a lot of time. I will assume that is where improvement needs to happen. – Yami Mahō Apr 22 '23 at 21:50

0 Answers0