Pandas + Python: More efficient code

Question

This is my code:

import pandas as pd
import os    
import glob as g


archivos = g.glob('C:\Users\Desktop\*.csv')

for archiv in archivos:
    nombre = os.path.splitext(archiv)[0] 
    df = pd.read_csv(archiv, sep=",")
    d = pd.to_datetime(df['DATA_LEITURA'], format="%Y%m%d")
    df['FECHA_LECTURA'] = d.dt.date
    del df['DATA_LEITURA']
    df['CONSUMO']=""
    df['DIAS']=""
    df["SUMDIAS"]=""
    df["SUMCONS"]=""
    df["CONSANUAL"] = ""            

    ordenado = df.sort_values(['NR_CPE','FECHA_LECTURA', 'HORA_LEITURA'], ascending=True)

    ##Agrupamos por el CPE
    agrupado = ordenado.groupby('NR_CPE')  

    for name, group in agrupado: #Recorremos el grupo    
          indice = group.index.values      
          inicio = indice[0] 
          fin = indice[-1]       
#Llenamos la primeras lectura de cada CPE, con esa lectura (porque no hay una lectura anterior)      
          ordenado.CONSUMO.loc[inicio] = 0
          ordenado.DIAS.loc[inicio] = 0
          cont=0
          for i in indice: #Recorremos lo que hay dentro de los grupos, dentro de los CPES(lecturas)
              if i > inicio and i <= fin :
                  cont=cont+1
                  consumo = ordenado.VALOR_LEITURA[indice[cont]] - ordenado.VALOR_LEITURA[indice[cont-1]]
                  dias = (ordenado.FECHA_LECTURA[indice[cont]] - ordenado.FECHA_LECTURA[indice[cont-1]]).days
                  ordenado.CONSUMO.loc[i] = consumo
                  ordenado.DIAS.loc[i] = dias   

# Hago las sumatorias, el resultado es un objeto DataFrame                 
    dias = agrupado['DIAS'].sum()  
    consu = agrupado['CONSUMO'].sum()    
    canu = (consu/dias) * 365  
#Contador con el numero de courrencias de los campos A,B y C
    conta=0
    contb=0
    contc=0  

#Como es un DF, para recorrerlo tengo que iterar sobre ellos para hacer la comparacion     
    print "Grupos:"    
    for ind, sumdias in dias.iteritems():
        if sumdias <= 180:
            grupo = "A"
            conta=conta+1
        elif sumdias > 180 and sumdias <= 365:
            grupo = "B"
            contb=contb+1
        elif sumdias > 365:
            grupo = "C"
            contc=contc+1


    print "grupo A: " , conta
    print "grupo B: " , contb
    print "grupo C: " , contc

    #Formateamos los campos para no mostrar todos los decimales    
    Fdias = dias.map('{:.0f}'.format)
    Fcanu = canu.map('{:.2f}'.format)

    frames = [Fdias, consu, Fcanu]
    concat = pd.concat(frames,axis=1).replace(['inf','nan'],[0,0])

    with open('C:\Users\Documents\RPE_PORTUGAL\Datos.csv','a') as f:
        concat.to_csv(f,header=False,columns=['CPE','DIAS','CONSUMO','CONSUMO_ANUAL'])

    try:         
        ordenado.to_excel(nombre+'.xls', columns=["NOME_DISTRITO",
        "NR_CPE","MARCA_EQUIPAMENTO","NR_EQUIPAMENTO","VALOR_LEITURA","REGISTADOR","TIPO_REGISTADOR",
        "TIPO_DADOS_RECOLHIDOS","FACTOR_MULTIPLICATIVO_FINAL","NR_DIGITOS_INTEIRO","UNIDADE_MEDIDA",
        "TIPO_LEITURA","MOTIVO_LEITURA","ESTADO_LEITURA","HORA_LEITURA","FECHA_LECTURA","CONSUMO","DIAS"],
        index=False) 
        print (archiv)        
        print ("===============================================")
        print ("*****Se ha creado el archivo correctamente*****")
        print ("===============================================")
    except IOError: 
        print ("===================================================")
        print ("¡¡¡¡¡Hubo un error en la escritura del archivo!!!!!")
        print ("===================================================")

This takes a file where I have lectures of energy consumption from different dates for every light meter('NR_CPE') and do some calculations:

Calculate the energy consumption for every 'NR_CPE' by substracting the previous reading with the next one and the result put in a new column named 'CONSUMO'.
Calculate the number of days where I'v got a reading and sum up the number of days
Add the consumption for every 'NR_CPE' and calculate the anual consumption.
Finally I want to classify by number of days that every light meter('NR_CPE') has a lecture. A if it has less than 180 days, B between 180 and 1 year and C more than a year.

And finally write this result in two differents files. Any idea of how should I re-code this to have the same output and be faster? Thank you all.

BTW this is my dataset:

,NOME_DISTRITO,NR_CPE,MARCA_EQUIPAMENTO,NR_EQUIPAMENTO,VALOR_LEITURA,REGISTADOR,TIPO_REGISTADOR,TIPO_DADOS_RECOLHIDOS,FACTOR_MULTIPLICATIVO_FINAL,NR_DIGITOS_INTEIRO,UNIDADE_MEDIDA,TIPO_LEITURA,MOTIVO_LEITURA,ESTADO_LEITURA,DATA_LEITURA,HORA_LEITURA
0,GUARDA,A002000642VW,101,1865411,4834,001,S,1,1,4,kWh,1,1,A,20150629,205600
1,GUARDA,A002000642VW,101,1865411,4834,001,S,1,1,4,kWh,2,2,A,20160218,123300
2,GUARDA,A002000642VJ,122,204534,25083,001,S,1,1,5,kWh,1,1,A,20150629,205700
3,GUARDA,A002000642VJ,122,204534,27536,001,S,1,1,5,kWh,2,2,A,20160218,123200
4,GUARDA,A002000642HR,101,1383899,11734,001,S,1,1,5,kWh,1,1,A,20150629,205600
5,GUARDA,A002000642HR,101,1383899,11800,001,S,1,1,5,kWh,2,2,A,20160218,123000
6,GUARDA,A002000995VM,101,97706436,12158,001,S,1,1,5,kWh,1,3,A,20150713,155300
7,GUARDA,A002000995VM,101,97706436,12163,001,S,1,1,5,kWh,2,2,A,20160129,162300
8,GUARDA,A002000995VM,101,97706436,12163,001,S,1,1,5,kWh,2,2,A,20160202,195800
9,GUARDA,A2000995VM,101,97706436,12163,001,S,1,1,5,kWh,1,3,A,20160404,145200
10,GUARDA,A002000996LV,168,5011703276,3567,001,V,1,1,6,kWh,1,1,A,20150528,205900
11,GUARDA,A02000996LV,168,5011703276,3697,001,V,1,1,6,kWh,2,2,A,20150929,163500
12,GUARDA,A02000996LV,168,5011703276,1287,002,P,1,1,6,kWh,1,1,A,20150528,205900

Seems like this is a question for [Code Review](http://codereview.stackexchange.com) and not for Stackoverflow. — nostradamus, Sep 14 '16 at 13:10
By the way, a very common language in software development is english, especially when you post your code to some platform like SO. Most people here will not understand the comments in your code for example, but it's more general a good practice to write code including comments and variable/function/class names in english. — linusg, Sep 14 '16 at 13:14

score 1 · Answer 1 · edited May 23 '17 at 10:27

Generally you want to avoid for loops in pandas. For example, the first loop where you calculate total consumption and days could be rewritten as a groupby apply something like:

def last_minus_first(df):
    columns_of_interest = df[['VALOR_LEITURA', 'days']]
    diff = columns_of_interest.iloc[-1] - columns_of_interest.iloc[0]
    return diff
df['date'] = pd.to_datetime(df['DATA_LEITURA'], format="%Y%m%d")
df['days'] = (df['date'] - pd.datetime(1970,1,1)).dt.days # create days column
df.groupby('NR_CPE').apply(last_minus_first)

(btw I don't understand why you are subtracting each entry from the previous, surely for meter readings this is the same as last-first?)

Then given the result of the above as consumption, you can replace your second for loop (for ind, sumdias in dias.iteritems()) with something like:

pd.cut(consumption.days, [-1, 180, 365, np.inf], labels=['a', 'b', 'c']).value_counts()

Thanks @danio . I'm substracting each entry from the previous cuz as you can see in the data set, there is a meter-reading called `(VALOR_LEITURA)` for each date so the consumption is the result of substracting the last reading with the actual reading. Eg: the consumption for 1 month for `A002000642VW` is the consumption (`VALOR_LEITURA`) for dates 2015/06/29 and 2016/02/18. — Juliana Rivera, Sep 15 '16 at 07:36

Pandas + Python: More efficient code

1 Answers1