How to dump a huge mysql table in different text/csv files based on the date column?

Question

I have a huge MySQL table with millions of records. I have a use case where, based on date column, I want to split and dump the results in different text file.

For example:

col_1 col_2 date_col
a     b      20-12-2021 
a     b      20-12-2021 
a2    b2     21-12-2021
..    ..     ...

So row1 and row2 will go to a CSV called 20_12_2021.csv and row3 will go to 21_12_2021.csv. What I am doing right now is I am using Python to fetch results and then loop over row by row and then determine where the result will go, but it is taking a huge time.

Is there any better more optimized way to do this?

How you are reading the Tables from `MySQL ` to pandas, or you have already dumped the data into CSV and now just parsing the same ? There is nice [so post, MySQL : retrieve a large select by chunks](https://stackoverflow.com/questions/34180448/mysql-retrieve-a-large-select-by-chunks) — Karn Kumar, Aug 20 '21 at 07:10
Right now , i was bringing the results in chunksize of 1000 rows and iterating over it . — Atif, Aug 20 '21 at 08:22
you should post your code if it doesn't have sensitive information, to get right advice or solution as i see even using `chunksize ` doesn't help to reduce memory, i've mention some references you may check them. — Karn Kumar, Aug 20 '21 at 08:39
@KarnKumar , the codes are on my office laptop and i can't copy paste. I will try to create a sample code . Allow me sometime. — Atif, Aug 20 '21 at 11:50

Karn Kumar · Answer 1 · 2021-08-20T07:30:50.807

1

I like the solution, where you can use df.groupby with date_col column, which will Group DataFrame using a mapper or by a Series of columns on the axis 0 that's by default, and signifies rows.

Solution:

for name, g in df.groupby('date_col'):
    g.to_csv(f'{name}.csv', index=False)

Result:

$ cat 20-12-2021.csv  <-- CSV file created

col_1,col_2,date_col
a,b,20-12-2021
a,b,20-12-2021

$ cat 21-12-2021.csv   <-- CSV file created

col_1,col_2,date_col
a2,b2,21-12-2021

OR

df.groupby('date_col').apply(lambda x: x.to_csv('{}.csv'.format(x.name), index=False))

Note: if you are curiosu about MYSQL efficiently for a million row then, there is a nice answer here Pandas now has built-in support for chunked loading by @ThePhysicist.

Alao look at the Reading table with chunksize still pumps the memory

edited Aug 20 '21 at 07:30

answered Aug 19 '21 at 12:41

Karn Kumar

8,518
3
27
53

Let me try this. – Atif Aug 19 '21 at 21:59
since this is a huge table, wouldn't it be a problem to fit it in memory? . Since pandas dump data frames in memory. – Atif Aug 20 '21 at 04:22
1

try with `lambda` function like `df.groupby('date_col').apply(lambda x: x.to_csv('{}.csv'.format(x.name), index=False))` – Karn Kumar Aug 20 '21 at 06:07

How to dump a huge mysql table in different text/csv files based on the date column?

1 Answers1

Solution: