1

I'm coding a little prog to compare two csv files over python 2.7.

I use the csv module and doing something like:

for row in file1:
"looking if row in file1 is in file2"

The prog works good but is very slow, because it has to check row by row if a string is in both files. To go from the first to the last row is maybe 2 minutes, but if I search a string using Notepad or any find tool in any text edit program, it finds immediately.

Why does this occur?
Is there any method or any module to implement a quick search in a file?

import csv
import os

f1 = file('tarifa_ek_tot.csv', 'r') #general
f2 = file('ek.csv', 'r') #filtro
f3 = file('sort_ek_tar.csv', 'w') #archivo salida
f4 = file('informe.csv', 'w') #archivo informe salida

#configuracion
campo_clave = 0 #campo clave
#campo_comp = 5  #campo a comparar
modo_prueba = True
fila = True
num_campos = 4
num_filas_filtro = 601
num_filas_general = 5175
ultima_posicion_encontrada = 0

#cond = 1 #copiar si ha subido
cond = 2 #copiar si ha cambiado
#cond = 3 #copiar si ha bajado

#codi,desc,tar,dte

#declaracion archivos
general = csv.reader(f1)
filtro_nostre = csv.reader(f2)
archivo_salida = csv.writer(f3)
salida_informe = csv.writer(f4)

#variables
filtro = list(filtro_nostre)
list_general = list(general)
fila_filtro = 1
found = False
num_comp_filtro= 0
encontrado = 0
num_coincidencias = 0
num_variaciones = 0
n = 0
num_no_encontrados = 0

#por cada fila en el archivo filtro
for row_filtro in filtro:
    filtro_nostre = csv.reader(f2)
    filtro = list(filtro_nostre)
    num_comp_filtro = num_comp_filtro + 1


    num_comp_general = 0


    #for index,row_general in enumerate(general):
    for i in range(n,num_filas_general):
    #for row_general in range(n,num_filas_general):

        os.system ("cls")
        print "comparando reg general num: "+str(num_comp_general)+" n="+str(n)+" reg filtro a comparar:"+str(num_filas_filtro)
        print " num reg comprobados: "+str(num_comp_filtro)+" num reg coincidentes: "+str(num_coincidencias)
        print "ultima posicion encontrada: "+str(ultima_posicion_encontrada)
        print "comprobando registro:"+str(row_filtro[campo_clave])#+" cod_gen:"+str(list_general[campo_clave])
        print "numero registros no encontrado:"+str(num_no_encontrados)     

        #print "comparem general:"+str(row_general[campo_clave])+" amb filtre:"+str(row_filtro[campo_clave])
        #print "index: "+str(index)#+" num fila: "+str(row_general)


        if list_general[n][campo_clave] == row_filtro[campo_clave]:
            #print "comparem:"+str(row_general[campo_clave])+" amb:"+str(row_filtro[campo_clave])
            num_coincidencias = num_coincidencias + 1       
            i = 0
            fila_copiar = ""
            while i < num_campos:
            #while i < 1:
                print str(i)
                if i == 0:
                    fila_copiar = list_general[n][i]
                    #print str(fila_copiar)
                else:               
                    fila_copiar = fila_copiar+","+list_general[n][i]
                    #print str(fila_copiar)
                i = i + 1


            print "fila a copiar: "+str(fila_copiar)
            archivo_salida.writerow([fila_copiar])
            encontrado = 1
            ultima_posicion_encontrada = n
            break   #salimos del if si lo encuentra
        else:
            encontrado = 0


        num_comp_general = num_comp_general+1            

        n = n + 1 
        #ultima_posicion = n#guardema la ultima posicio     

    if encontrado <> 1 and n == num_filas_general:
        n = ultima_posicion_encontrada
        num_no_encontrados = num_no_encontrados + 1
        #copiamos el campo clave del registro no encontrado del filtro
        codigo_no_encontrado = str(row_filtro[campo_clave])
        salida_informe.writerow(codigo_no_encontrado)

    print ""    
    print "*****************informe resultados *******************************"  
    print "n: "+str(n)
    print "numero registros comparados filtro: "+str(num_comp_filtro)
    print "numero registros coincidentes: "+str(num_coincidencias)     
    print "numero registros no encontrados: " +str(num_no_encontrados)
    print "archivo de salida: "+str(f3.name)
    print "archivo de informe: "+str(f4.name)

    fila_filtro = fila_filtro + 1
    fila_general = 1
    #print 'fila_general:'+str(fila_general)+'  fila_filtro:'+str(fila_filtro)    

#mensajes de alerta
if num_coincidencias < num_comp_filtro:
    print "atencion: algunas filas no se encontraron !!!!!!!!!!!!!!!!!!!!!"

f1.close()
f2.close()
f3.close()
f4.close()

Edit:
I see now that my question is not adequate.

I have been trying a simple algo to find a string in a big text file, it has to go line by line to find the string, but if I use notepad or the windows simple text editor, it finds the string immediately.

Which kind of algo uses this programs?

Is because they use the c++ language or is because use original windows dll's? or is because python is an interpretated language?

This is what I really want to know

GitaarLAB
  • 14,536
  • 11
  • 60
  • 80
  • hello, and thanks to freek, the link you showed gives me another way to solve the issue, I modified my algo using a find method as the file was a string, but I think:THE PROBLE WAS , and I think this was the cause of the slow running. – Francesc Furroy Jun 10 '15 at 14:01

1 Answers1

0

Your algo is O(N^2) because it iterated the lines in file1 and for each line iterates the lines in file2 to find a match.

If you speed up the search operation (in your post: "looking if row in file1 is in file2") to O(log(N)) your overall complexity would be O(N*log(N)).

I suggest this post as a starting point: Most efficient way for a lookup/search in a huge list (python)

The alternative is to sort the files before the line matching. Finding matches in two sorted files would be O(N). But a good sort algo is O(N*log(N)) itself, so a fast search would save just as much.

Community
  • 1
  • 1
Freek Wiekmeijer
  • 4,556
  • 30
  • 37