2

I have a huge MySQL table with millions of records. I have a use case where, based on date column, I want to split and dump the results in different text file.

For example:

col_1 col_2 date_col
a     b      20-12-2021 
a     b      20-12-2021 
a2    b2     21-12-2021
..    ..     ...

So row1 and row2 will go to a CSV called 20_12_2021.csv and row3 will go to 21_12_2021.csv. What I am doing right now is I am using Python to fetch results and then loop over row by row and then determine where the result will go, but it is taking a huge time.

Is there any better more optimized way to do this?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
Atif
  • 1,012
  • 1
  • 9
  • 23
  • How you are reading the Tables from `MySQL ` to pandas, or you have already dumped the data into CSV and now just parsing the same ? There is nice [so post, MySQL : retrieve a large select by chunks](https://stackoverflow.com/questions/34180448/mysql-retrieve-a-large-select-by-chunks) – Karn Kumar Aug 20 '21 at 07:10
  • Right now , i was bringing the results in chunksize of 1000 rows and iterating over it . – Atif Aug 20 '21 at 08:22
  • 1
    you should post your code if it doesn't have sensitive information, to get right advice or solution as i see even using `chunksize ` doesn't help to reduce memory, i've mention some references you may check them. – Karn Kumar Aug 20 '21 at 08:39
  • 1
    @KarnKumar , the codes are on my office laptop and i can't copy paste. I will try to create a sample code . Allow me sometime. – Atif Aug 20 '21 at 11:50
  • Atif, I uderstood. – Karn Kumar Aug 20 '21 at 11:51

1 Answers1

1

I like the solution, where you can use df.groupby with date_col column, which will Group DataFrame using a mapper or by a Series of columns on the axis 0 that's by default, and signifies rows.

Solution:

for name, g in df.groupby('date_col'):
    g.to_csv(f'{name}.csv', index=False)

Result:

$ cat 20-12-2021.csv  <-- CSV file created

col_1,col_2,date_col
a,b,20-12-2021
a,b,20-12-2021

$ cat 21-12-2021.csv   <-- CSV file created

col_1,col_2,date_col
a2,b2,21-12-2021

OR

df.groupby('date_col').apply(lambda x: x.to_csv('{}.csv'.format(x.name), index=False))

Note: if you are curiosu about MYSQL efficiently for a million row then, there is a nice answer here Pandas now has built-in support for chunked loading by @ThePhysicist.

Alao look at the Reading table with chunksize still pumps the memory

Karn Kumar
  • 8,518
  • 3
  • 27
  • 53