1

I have 2 files,

1.File 1 DDA_File with data set of 12 columns and 124348 objects

  1. Reference sheet (Ref_File)having 4 columns and 30 objects including header.

My objective is to look for value in 12th column of DDA_File in 1st column of Ref_File. If match found, then pick value against the match from 2nd column of Ref_File and paste it in 2nd column of DDA_File.

Git link for sample data input,RS and sample output

I tried with nested loop traversing through each object of file 1 to pick for match and condition check for match.

#Base file for work

DDA_File=read.csv(file_location,header = TRUE)

# Reading reference file to enrich SAP
Ref_File=read.csv(file_location1,header = TRUE)

num_of_rows_DDA <- nrow(DDA_File)
num_of_rows_Ref <- nrow(Ref_File)

#Placeholder for data insertion
output <- data.frame(matrix(data = 0, nrow = num_of_rows_DDA, ncol=2, byrow=T))
no_entry<-data.frame( matrix(data=0) )

#For loop for traversing through each element of DDA file
system.time( for( i in 1:num_of_rows_DDA)
                {
  #For Loop for traversing through each row of Reference Sheet
                 for(j in 1:num_of_rows_Ref)
                    {
    #Condition check
                    if(DDA_File[i,12]==Ref_File[j,1])
                    {
                     output[i,1]<-paste(DDA_File[j,2],"-",Ref_File[j,2])
                     output[i,2]<-Ref_File[j,3]
                     break
                    }
                    else{
                      no_entry<-DDA_File[i,12]
                        }

                    }

})
print("No Match found in reference sheet for :")

Actual output is repetition of 6 entries in all objects.

                              X1 X2
1      I-DL-DLHI-ENB-A291 - Beta  2
2     I-DL-DLHI-ENB-A291 - Gamma  2
3     I-DL-DLHI-ENB-A291 - Gamma  5
4      I-DL-DLHI-ENB-A291 - Beta  4
5     I-DL-DLHI-ENB-A291 - Gamma  4
6     I-DL-DLHI-ENB-A291 - Alpha  4
7      I-DL-DLHI-ENB-3218 - Beta  6
8     I-DL-DLHI-ENB-A291 - Gamma  3
9     I-DL-DLHI-ENB-A291 - Alpha  2
10    I-DL-DLHI-ENB-3218 - Gamma  6
11    I-DL-DLHI-ENB-3218 - Alpha  1
12     I-DL-DLHI-ENB-3218 - Beta  1
13    I-DL-DLHI-ENB-3218 - Gamma  1
14    I-DL-DLHI-ENB-3218 - Alpha  6
24    I-DL-DLHI-ENB-A291 - Alpha  3
30    I-DL-DLHI-ENB-A291 - Alpha  5
89     I-DL-DLHI-ENB-A291 - Beta  5
94     I-DL-DLHI-ENB-A291 - Beta  3
4440  I-DL-DLHI-ENB-A291 - Gamma  7
9784   I-DL-DLHI-ENB-A291 - Beta  7
15856 I-DL-DLHI-ENB-A291 - Alpha  7
MPandey
  • 23
  • 5
  • 1
    Could you add a sample of your data with `dput` and expected output? Is there any specific reason you're using loops? – NelsonGon May 20 '19 at 04:14
  • 1
    Sounds like some form of join/merge (look into base R's `merge` or `dplyr`'s `left_join`); there shouldn't be a need for an explicit `for` loop here, and this should be a matter of a few lines of code. I second @NelsonGon's request to include [reproducible & minimal sample data along with your expected output.](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Maurits Evers May 20 '19 at 05:05
  • Makes no sense to have code that follows `break`. – IRTFM May 20 '19 at 05:28
  • I think merge and left join will give me all the columns to be merged with a common column.In this case it has matching value.I am not very specific to use loop,if there are other methods to achieve this would be great. Since it's easy and had worked in past hence used. – MPandey May 20 '19 at 05:34
  • @MPandey **Please include reproducible sample data in your post!** Otherwise any help we can provide is speculative. – Maurits Evers May 20 '19 at 05:54
  • @MPandey Let me expand on my previous comment: I really would suggest taking some time coming up with a good & minimal sample dataset. For example, do you really need all 12 columns here of `DDA_File`? Do you need column 3 of `Ref_File`? Good & minimal sample data means choosing data that is minimal yet representative of your actual problem. Then include that data in your main post using `dput` (many people are loath to download data from secondary sources). – Maurits Evers May 20 '19 at 06:03

1 Answers1

0

For future posts, please consider the comments above regarding what constitutes a "good" minimal & reproducible example.

Concerning your question, as I said in my comment above, this is in essence a simple join/merge; here is a tidydverse option

library(tidyverse)
DDA_File %>%
    left_join(Ref_File, by = c("cnum" = "Cell.Num")) %>%
    unite(SAP_Sec, NE_NAME.ACCESSIBILITY, Sector, sep = " - ") %>%
    select(SAP_Sec, starts_with("X"))
#                      SAP_Sec X18.Dec.18 X19.Dec.18 X20.Dec.18 X21.Dec.18
#1  I-DL-BADQ-ENB-I001 - Alpha      98.06      95.73      98.66      98.66
#2   I-DL-BADQ-ENB-I001 - Beta      98.06      95.73      98.66      98.66
#3   I-DL-BADQ-ENB-I001 - Beta      45.76      56.54      53.41      67.27
#4  I-DL-BADQ-ENB-I001 - Gamma      45.76      56.54      53.41      67.27
#5  I-DL-BADQ-ENB-I001 - Gamma      93.17      96.37      87.67      89.75
#6  I-DL-BADQ-ENB-I001 - Alpha      93.17      96.37      87.67      89.75
#7  I-DL-BADQ-ENB-I001 - Alpha      98.46      98.42      92.36      97.59
#8   I-DL-BADQ-ENB-I001 - Beta      92.21      98.03      89.21      93.04
#9  I-DL-BADQ-ENB-I001 - Gamma      98.61      96.69      98.64      98.65
#10 I-DL-BADQ-ENB-I001 - Alpha      98.72      97.54      78.89      95.15
#... 

Explanation: We left-join DDA_File and Ref_File by matching DDA_File$cnum and Ref_File$Cell.Num and concatenate DDA_File$NE_NAME.ACCESSIBILITY with the matching Ref_File$Sector values; we finally select those columns that you show in your expected output.


Or the same in base R using merge

df.new <- transform(
    merge(DDA_File, Ref_File, by.x = "cnum", by.y = "Cell.Num"),
    SAP_Sec = paste(NE_NAME.ACCESSIBILITY, Sector, sep = " - "))
df.new[, c("SAP_Sec", names(df.new)[grep("^X", names(df.new))])]

Explanation: In the first step we merge the two datasets and create the new column SAP_Sec; in the last step we select relevant columns to reproduce your expected output.


Either approaches should be considerably faster (and perhaps more importantly, much cleaner) than the for loop approach.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Thanks for the answer, I appreciate the effort in downloading the files for providing answer. For future questions will follow suggestion for good data samples.Being new to Stackoverflow was not aware. Once again thanks for the answer and efforts. – MPandey May 20 '19 at 06:59
  • Just checked with complete data set and still same issue which I had with the loop .In sample data set 40 observation has increased to 58 observations and thus duplicate entries. I think it's data but this is how all of my future files are going to be as well. – MPandey May 20 '19 at 07:43
  • @MPandey That means that somewhere you have one-to-many (or perhaps even many-to-many) mappings between `DDA_File$cnum` and `Ref_File$Cell.Num`; this is not an issue with the (any of the) code approaches, but something that requires *you* to devise an analysis strategy that makes sense (given the domain-specific context that you are working in) on how to deal with these duplicate entries . – Maurits Evers May 20 '19 at 07:56