For loop with dataframe takes too long

Question

I am trying to optimize a for loop, because takes so long (like 1.30 minutes). I have tried with mapply without any success. The loop compares two columns of two data frames and corrects the data where the two columns match. The big dataframe (datapmi) is very large (200000 rows) and the correction one not that much(400 rows). My approach is to loop the little one and find the locations of that element in the big dataframe to then substitute for the corrected version. Any help woud be very nice, thanks in advance

  for(j in (1:nrow(correccionpmis))){
   corr<-which(datapmi$PMIGTDTipo==correccionpmis$PMIGTDTipo[j])
   datapmi$PMI[corr]<-correccionpmis$PMIcorregido[j]
   datapmi$Inspection_Procedure[corr]<-correccionpmis$Tipocorregido[j]
  }

correccionpmis is the correction dataframe with correct names and datapmi is the original dataframe.

they look like this (simplified):

> head(datapmi)
                         PMI Inspection_Procedure                                                            PMIGTDTipo
1          Presión Autoclave       Single Numeric Presión Autoclave Presión del autoclave tras 7 minutos Single Numeric
2       Tiempo Presurización       Single Numeric              Tiempo Presurización Tiempo hasta 6 bares Single Numeric
3                Videoscopio              Boolean        Videoscopio Control videoscopio OK s/ficha de producto Boolean
4        Hora Arranque Ciclo       Single Numeric                      Hora Arranque Ciclo Hora de ciclo Single Numeric
5 Tiempo de despresurización       Single Numeric  Tiempo de despresurización Tiempo de despresurización Single Numeric
6                 Peso Molde       Single Numeric                              Peso Molde Peso del Molde Single Numeric

 
> head(correccionpmis)
                                          PMIcorregido
1  Aceptabilidad Geometría Interna del Vano 2 con P/NP
4                               (2) DROSS TEST INICIAL
10                          Aceptabilidad de Inclusión
16                            Aceptabilidad de Rechupe
17                  Adjuntar las placas en SAP (SI/NO)
19                TIEMPO HASTA 1450ºC (PRUEBA PIRO TC)
                                                                                                                                                                                           PMITipo
1                                                                          Aceptabilidad Geometría Interna del Vano 2 con P/NP Indicar Aceptabilidad Geometría Interna del Vano 2 con P/NP Boolean
4                                                                                                          (2) TEST ESCORIAS INICIAL (2) Anotar Valor Dross Test (ANTES DE AFINADO) Single Numeric
10                                                                                                                            Aceptabilidad de inclusiones Indicar Aceptabilidad Inclusión Boolean
16                                                                                                                                     Aceptabilidad Rechupe Indicar Aceptabilidad Rechupe Boolean
17                                                                                                            Adjuntar las placas en SAP (SI/NO) Adjuntar las placas en SAP (SI/NO) Single Numeric
19 BACKUP CALIENTE TIEMPO HASTA 1450ºC (PRUEBA PIRO TC) BACKUP CALIENTE *SI SE HACE PRUEBA Piro-Tc Tiempo desde inicio de fusión hasta que el caldo alcance los 1450ºC (en minutos) Single Numeric
                                                                                                                                                                                        PMIGTDTipo
1                                                                          Aceptabilidad Geometría Interna del Vano 2 con P/NP Indicar Aceptabilidad Geometría Interna del Vano 2 con P/NP Boolean
4                                                                                                          (2) TEST ESCORIAS INICIAL (2) Anotar Valor Dross Test (ANTES DE AFINADO) Single Numeric
10                                                                                                                            Aceptabilidad de inclusiones Indicar Aceptabilidad Inclusión Boolean
16                                                                                                                                     Aceptabilidad Rechupe Indicar Aceptabilidad Rechupe Boolean
17                                                                                                            Adjuntar las placas en SAP (SI/NO) Adjuntar las placas en SAP (SI/NO) Single Numeric
19 BACKUP CALIENTE TIEMPO HASTA 1450ºC (PRUEBA PIRO TC) BACKUP CALIENTE *SI SE HACE PRUEBA Piro-Tc Tiempo desde inicio de fusión hasta que el caldo alcance los 1450ºC (en minutos) Single Numeric

This is what i have tried for mapply but it doesn't work (and it also takes very long)

mi_fun <- function(i) {
   corr<-which(datapmi$PMIGTDTipo==correccionpmis$PMIGTDTipo[i])
  datapmi$PMI[corr]<-correccionpmis$PMIcorregido[i]
  datapmi$Inspection_Procedure[corr]<-correccionpmis$Tipocorregido[i]
  print(difftime(Sys.time(),t0))
}

sapply(1:nrow(correccionpmis), mi_fun)

I am adding the dput function as suggested by Jon Spring on the comments:

> dput(head(correccionpmis))
structure(list(Linea = c("LINEA NGVS", "FUNDICION", "LINEA NGVS", 
"LINEA NGVS", "RX DIGITAL", "FUNDICION"), PMI = c(" Aceptabilidad Geometría Interna del Vano 2 con P/NP", 
"(2) TEST ESCORIAS INICIAL", "Aceptabilidad de inclusiones", 
"Aceptabilidad Rechupe", "Adjuntar las placas en SAP (SI/NO)", 
"BACKUP CALIENTE TIEMPO HASTA 1450ºC (PRUEBA PIRO TC)"), GTD = c("Indicar Aceptabilidad Geometría Interna del Vano 2 con P/NP", 
"(2) Anotar Valor Dross Test (ANTES DE AFINADO)", "Indicar Aceptabilidad Inclusión", 
"Indicar Aceptabilidad Rechupe", "Adjuntar las placas en SAP (SI/NO)", 
"BACKUP CALIENTE *SI SE HACE PRUEBA Piro-Tc Tiempo desde inicio de fusión hasta que el caldo alcance los 1450ºC (en minutos)"
), Tipo = c("Boolean", "Single Numeric", "Boolean", "Boolean", 
"Single Numeric", "Single Numeric"), PMIcorregido = c("Aceptabilidad Geometría Interna del Vano 2 con P/NP", 
"(2) DROSS TEST INICIAL", "Aceptabilidad de Inclusión", "Aceptabilidad de Rechupe", 
"Adjuntar las placas en SAP (SI/NO)", "TIEMPO HASTA 1450ºC (PRUEBA PIRO TC)"
), Tipocorregido = c("Boolean", "Single Numeric", "Boolean", 
"Boolean", "Boolean", "Single Numeric"), PMIGTDTipo = c(" Aceptabilidad Geometría Interna del Vano 2 con P/NP Indicar Aceptabilidad Geometría Interna del Vano 2 con P/NP Boolean", 
"(2) TEST ESCORIAS INICIAL (2) Anotar Valor Dross Test (ANTES DE AFINADO) Single Numeric", 
"Aceptabilidad de inclusiones Indicar Aceptabilidad Inclusión Boolean", 
"Aceptabilidad Rechupe Indicar Aceptabilidad Rechupe Boolean", 
"Adjuntar las placas en SAP (SI/NO) Adjuntar las placas en SAP (SI/NO) Single Numeric", 
"BACKUP CALIENTE TIEMPO HASTA 1450ºC (PRUEBA PIRO TC) BACKUP CALIENTE *SI SE HACE PRUEBA Piro-Tc Tiempo desde inicio de fusión hasta que el caldo alcance los 1450ºC (en minutos) Single Numeric"
)), row.names = c(1L, 4L, 10L, 16L, 17L, 19L), class = "data.frame")

> dput(head(datapmi))
structure(list(Linea = c("PREPARACION DE MOLDE", "PREPARACION DE MOLDE", 
"PREPARACION DE MOLDE", "PREPARACION DE MOLDE", "PREPARACION DE MOLDE", 
"CERAMICAS"), Nodo = c("MOLDE_PREPARACION DE MOLDE", "MOLDE_PREPARACION DE MOLDE", 
"MOLDE_PREPARACION DE MOLDE", "MOLDE_PREPARACION DE MOLDE", "MOLDE_PREPARACION DE MOLDE", 
"MOLDE_CERAMICAS"), PuestoID = c("PP009C", "PP009C", "PP009C", 
"PP009C", "PP009C", "PP058C"), Puesto = c("Descerado + Quemado", 
"Descerado + Quemado", "Descerado + Quemado", "Descerado + Quemado", 
"Descerado + Quemado", "Baños ceramicos Agua 2 - R4"), PartNumber = c("11013340", 
"11013340", "11013340", "11013340", "11013340", "11013340"), 
    SerialNumber = c("16817", "16817", "16817", "16817", "16817", 
    "16895"), OrdenFab = c("226489169", "226489169", "226489169", 
    "226489169", "226489169", "226489725"), OP = c(600L, 600L, 
    600L, 600L, 600L, 500L), OF_OP = c("226489169_600", "226489169_600", 
    "226489169_600", "226489169_600", "226489169_600", "226489725_500"
    ), FechaFin_OP = structure(c(NA, NA, NA, NA, NA, 1591602784
    ), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Status_OP = c("Partial", 
    "Partial", "Partial", "Partial", "Partial", "Complete"), 
    Zone = c("", "", "", "", "", ""), PMI = c("Presión Autoclave", 
    "Tiempo Presurización", "Videoscopio", "Hora Arranque Ciclo", 
    "Tiempo de despresurización", "Peso Molde"), GTD = c("Presión del autoclave tras 7 minutos", 
    "Tiempo hasta 6 bares", "Control videoscopio OK s/ficha de producto", 
    "Hora de ciclo", "Tiempo de despresurización", "Peso del Molde"
    ), Dimension_Type = c("", "", "", "", "", ""), Metodo_Medicion = c("Display microtol", 
    "Cronómetro", "Videoscopio", "Reloj", "Cronómetro", "Báscula"
    ), Frecuencia = c("100%", "100%", "100%", "100%", "100%", 
    "100%"), PMI_Descripcion = c("Presión Autoclave Presión del autoclave tras 7 minutos-Nominal:9.750000 Lim Sup:10.000000 Lim Inf:8.500000", 
    "Tiempo Presurización Tiempo hasta 6 bares-Nominal:4.000000 Lim Sup:4.400000 Lim Inf:0.000000", 
    "Videoscopio Control videoscopio OK s/ficha de producto", 
    "Hora Arranque Ciclo Hora de ciclo-Nominal:0.000000 Lim Sup:2400.000000 Lim Inf:0.000000", 
    "Tiempo de despresurización Tiempo de despresurización-Nominal:3.000000 Lim Sup:8.000000 Lim Inf:0.000000", 
    "Peso Molde Peso del Molde-Nominal:59.900000 Lim Sup:64.400000 Lim Inf:55.400000"
    ), PMI_Rango = c("9.750 (8.500-10.00)", "4.000 (0.000-4.400)", 
    "", "0.000 (0.000- 2400)", "3.000 (0.000-8.000)", "59.90 (55.40-64.40)"
    ), Inspection_Procedure = c("Single Numeric", "Single Numeric", 
    "Boolean", "Single Numeric", "Single Numeric", "Single Numeric"
    ), Criticidad = c("MINOR", "MINOR", "MINOR", "MINOR", "MINOR", 
    "MINOR"), LimInf = c(8.5, 0, 0, 0, 0, 55.4000015258789), 
    LimSup = c(10, 4.40000009536743, 0, 2400, 8, 64.4000015258789
    ), Nominal = c(9.75, 4, 0, 0, 3, 59.9000015258789), Es_Obligatoria = c("1", 
    "1", "1", "1", "1", "1"), Valor = c("9,5", "3,61", "1", "10", 
    "7", "63,6"), Estatus_PMI = c("Complete", "Complete", "Complete", 
    "Complete", "Complete", "Complete"), Usuario = c("82300", 
    "82300", "82300", "82300", "82300", "82127"), Comentarios = c("", 
    "", "", "", "", ""), Fecha_Registro = structure(c(1591269448, 
    1591269267, 1591269303, 1591269285, 1591269458, 1591595580
    ), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Responsable = c("Operator", 
    "Operator", "Operator", "Operator", "Operator", "Operator"
    ), NCP_ID = c("NCP_2020_9881", "NCP_2020_9881", "NCP_2020_9881", 
    "NCP_2020_9881", "NCP_2020_9881", NA), GrupoDefecto = c("Genérico", 
    "Genérico", "Genérico", "Genérico", "Genérico", NA), Defecto = c("Incorrecto", 
    "Incorrecto", "Incorrecto", "Incorrecto", "Incorrecto", NA
    ), NCP_status = c("APROBADO", "APROBADO", "APROBADO", "APROBADO", 
    "APROBADO", NA), año = structure(c(1577836800, 1577836800, 
    1577836800, 1577836800, 1577836800, 1577836800), class = c("POSIXct", 
    "POSIXt"), tzone = "UTC"), cuatri = structure(c(1585699200, 
    1585699200, 1585699200, 1585699200, 1585699200, 1585699200
    ), class = c("POSIXct", "POSIXt"), tzone = "UTC"), mes = structure(c(1590969600, 
    1590969600, 1590969600, 1590969600, 1590969600, 1590969600
    ), class = c("POSIXct", "POSIXt"), tzone = "UTC"), semana = structure(c(1590969600, 
    1590969600, 1590969600, 1590969600, 1590969600, 1591574400
    ), class = c("POSIXct", "POSIXt"), tzone = "UTC"), dia = structure(c(1591228800, 
    1591228800, 1591228800, 1591228800, 1591228800, 1591574400
    ), class = c("POSIXct", "POSIXt"), tzone = "UTC"), PMIGTDTipo = c("Presión Autoclave Presión del autoclave tras 7 minutos Single Numeric", 
    "Tiempo Presurización Tiempo hasta 6 bares Single Numeric", 
    "Videoscopio Control videoscopio OK s/ficha de producto Boolean", 
    "Hora Arranque Ciclo Hora de ciclo Single Numeric", "Tiempo de despresurización Tiempo de despresurización Single Numeric", 
    "Peso Molde Peso del Molde Single Numeric")), row.names = c(NA, 
6L), class = "data.frame")

It will be much easier if you can provide a reproducible example, with example data we can load. It's often easiest to use the `dput` function to create a code "recipe" for recreating data, for example `dput(head(correccionpmis))`. You can paste the output of that into the body of your question. — Jon Spring, Jul 20 '22 at 16:10
I expect something like this should work much faster: `library(dplyr); correccionpmis %>% left_join(datapmi, by = "PMIGTDTipo") %>% mutate(PMI = coalesce(PMIcorregido, PMI), Inspection = coalesce(Tipocorregido, Inspection_Procedure))`. If you can include example data it will be possible for me to test & confirm. — Jon Spring, Jul 20 '22 at 16:17
Hello Jon, thanks for your answer. Ive tried your solution but i cant make it work. im adding the dput function on the body i hope it makes it more clear — AeroProgrammer, Jul 20 '22 at 16:28
You can use the `dplyr` function `rows_update`: `datapmi = rows_update( datapmi, select(correccionpmis, PMIGTDTipo, PMI = PMIcorregido, Tipo = Tipocorregido), by = "PMIGTDTipo")`. It will only work if all `PMIGTDTipo` values in `correccionpmis` are also in `datapmi` (so it won't work on the small sample of data), but it should work on your whole problem if that assumption is met. — Gregor Thomas, Jul 20 '22 at 18:37

score 2 · Accepted Answer · answered Jul 20 '22 at 18:38

I had it backwards in my comment, I think you want:

library(dplyr)
datapmi %>%
  left_join(correccionpmis, by = "PMIGTDTipo") %>%
  mutate(PMI = coalesce(PMI.y, PMI.x),
         Inspection_Procedure = coalesce(Tipocorregido, Inspection_Procedure))

This will take each row of datapmi, match it by the PMIGTDTipo column to the corresponding row(s) of correccionpmis, and then replace the PMI and Inspection_Prodedure columns with replacement values, if available. In the first case, both tables had a PMI column, so the original is renamed PMI.x and the one it's joined to is PMI.y, so we look first for the replacement and if not found (NA), then we use the original PMI.

Thank you very much Jon I got the think o work instantly with this method. Just a little comment: I actally needed to substitute the PMI column with the PMIcorregido column, so the final code for me its like this : datapmi %>% left_join(correccionpmis, by = "PMIGTDTipo")%>% mutate(PMI = coalesce(PMIcorregido, PMI.x))%>% Inspection_Procedure = coalesce(Tipocorregido, Inspection_Procedure)) — AeroProgrammer, Jul 21 '22 at 06:42

For loop with dataframe takes too long

1 Answers1