python: how can I read and process a 18GB csv file?

Question

I have a 18GB csv file from measurement and want to do some calculation based on it. I tried to do it with pandas but seems like it takes forever just to read this file.

Following codes are what I did:

df=pd.read_csv('/Users/gaoyingqiang/Desktop/D989_Leistung.csv',usecols=[1,2],sep=';',encoding='gbk',iterator=True,chunksize=1000000)
df=pd.concat(df,ignore_index=True)

U1=df['Kanal 1-1 [V]']
I1=df['Kanal 1-2 [V]']

c=[]
for num in range(0,16333660,333340):
    lu=sum(U1[num:num+333340]*U1[num:num+333340])/333340
    li=sum(I1[num:num+333340]*I1[num:num+333340])/333340
    lui=sum(I1[num:num+333340]*U1[num:num+333340])/333340
    c.append(180*mt.acos(2*lui/mt.sqrt(4*lu*li))/np.pi)
    lu=0
    li=0
    lui=0

phase=pd.DataFrame(c)
phase.to_excel('/Users/gaoyingqiang/Desktop/Phaseverschiebung_1.xlsx',sheet_name='Sheet1')

Is there anyway to accelerate the process?

does it work with a smaller file? – Jean-François Fabre Aug 01 '17 at 07:56 — Jean-François Fabre, Aug 01 '17 at 07:56
I tried with a 2GB file but the same error occurs too – Yingqiang Gao Aug 01 '17 at 08:00 — Yingqiang Gao, Aug 01 '17 at 08:00

score 3 · Accepted Answer · answered Aug 01 '17 at 07:56

3

df is a TextFileReader, not DataFrame, so need concat:

df = pd.concat(df, ignore_index=True)

Sample:

import pandas as pd
from pandas.compat import StringIO

temp=u"""id,col1,col2,col3
1,13,15,14
1,13,15,14
1,12,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
3,14,15,13
3,14,15,13
3,14,185,213"""
df = pd.read_csv(StringIO(temp), chunksize=3)
print (df)
<pandas.io.parsers.TextFileReader object at 0x000000000D6E2EF0>

df = pd.concat(df, ignore_index=True)
print (df)
    id  col1  col2  col3
0    1    13    15    14
1    1    13    15    14
2    1    12    15    13
3    2    18    15    13
4    2    18    15    13
5    2    18    15    13
6    2    18    15    13
7    2    18    15    13
8    2    18    15    13
9    3    14    15    13
10   3    14    15    13
11   3    14   185   213

answered Aug 01 '17 at 07:56

jezrael

822,522
95
1,334
1,252

Thx it works but it still takes forever to read it... It there anyway to make the process quicker? – Yingqiang Gao Aug 01 '17 at 08:22
2

I really big file, it is problem. Maybe help some alternative for working with big files like `dask`. – jezrael Aug 01 '17 at 08:25
Why not using Spark for this? – Dat Tran Aug 01 '17 at 08:38
@DatTran - Yes, it should be alternative, I believe there is more possible solutions for working with big files. – jezrael Aug 01 '17 at 08:41
@YingqiangGao - Thank you for accepting. Now i am a bit confused, all works perfectly? Because if create big `df` after concat, then use some loops, what is very slow in pandas. Is possible create new question for process data with sample data, your code and desired output? I believe you get some nice solutions in `numpy` or `pandas`. Thanks. – jezrael Aug 01 '17 at 08:48
well I can only create a new question after 90 min but I'm in hurry to solve the problem...so I just edit the former question – Yingqiang Gao Aug 01 '17 at 08:50
1

Hmmm, i understand. Maybe help if create some data sample [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) with desired output for easier verify solution and then apply solution for big df. Also maybe help [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – jezrael Aug 01 '17 at 08:58

python: how can I read and process a 18GB csv file?

1 Answers1