0

I have a data frame as df1 which contains a column of the name of the university as University_name and has 500000 number of rows. Now I have another data frame as df2 which contains 2 columns as university_name and university_aliases and has 150 rows. Now I want to match each university alias present in the university_aliases column with university names present in university_name_new.

a sample of df1$university_name

university of auckland
the university of auckland
university of warwick - warwick business school
unv of warwick
seneca college of applied arts and technology
seneca college
univ of auckland

sample of df2

University_Alias                  Univeristy_Name_new

univ of auckland                  university of auckland
universiry of auckland            university of auckland
auckland university               university of auckland
university of auckland            university of auckland
warwick university                university of warwick
warwick univercity                university of warwick
university of warwick             university of warwick
seneca college                    seneca college
unv of warwick                    university of warwick

I am expecting output like this

university of auckland
university of auckland
university of warwick
seneca college
seneca college

and I am using the following code but it is not working

 df$university_name[ grepl(df$university_name,df2$university_alias)] <- df2$university_name_new
girijesh96
  • 455
  • 1
  • 4
  • 16
  • Are all values in `df1$university_name` present in `df2$University_Alias` (eg, `university of warwick - warwick business school`)? – pogibas Mar 17 '18 at 10:12
  • Yes, In df2$University_Alias all unique alias of df1$university_name is present – girijesh96 Mar 17 '18 at 10:15
  • Actually it is not exact matching, I have gone through this https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-rightgone but you can see I have to match with this type of input 'university of warwick - warwick business school' .grepl work good in pattern matching but I am not able to match and update value in df1$university_name at the same time. – girijesh96 Mar 17 '18 at 10:17
  • Then try this: https://stackoverflow.com/questions/2231993/merging-two-data-frames-using-fuzzy-approximate-string-matching-in-r – pogibas Mar 17 '18 at 10:19
  • sorry but I am not very clear with the example mentioned in the link. Can you please help me to figure out how I can execute it for my problem as I am new to R so I have not that much expertise in R – girijesh96 Mar 17 '18 at 10:27

2 Answers2

1

You can use sapply and str_extract to get the desired result.

 # create sample data
df1 <- data.frame(university_name = c('university of auckland',
                                      'the university of auckland',
                                      'university of warwick - warwick business school',
                                      'seneca college of applied arts and technology',
                                      'seneca college'), stringsAsFactors = F)

# these are values to match (from df2)
vals <- c('university of auckland','university of warwick','seneca college')

# get the output
df1$output <- sapply(df1$university_name, function(z)({

    f <- vals[complete.cases(str_extract(string = z, pattern = vals))]
    return(f)

}), USE.NAMES = F)

print(df1)

                                  university_name                 output
1                          university of auckland university of auckland
2                      the university of auckland university of auckland
3 university of warwick - warwick business school  university of warwick
4   seneca college of applied arts and technology         seneca college
5                                  seneca college         seneca college

Update:

Based on my understanding, df2 already has one to one mapping of university_alias with university_name_new, so that problem comes down to checking if a university_alias is not present in df1, we remove it.

# check values for university_alias in university_name
maps2 <- as.character(df2$university_alias[which(df2$university_alias %in% df1$university_name)])

# remove unmatched rows from df2
df3 <- df2[df2$university_alias %in% maps2,]

print(df3)
            university_alias    university_name_new
1           univ of auckland university of auckland
4     university of auckland university of auckland
8             seneca college         seneca college
9             unv of warwick  university of warwick
YOLO
  • 20,181
  • 5
  • 20
  • 40
  • thank you @Manish Saraswat If I am not wrong then you are taking the value of df2$df2$University_Name_new in vals and matching it with df1$university_name but instead of this I want to match df2$university_alias with df1$university_name and if alias value present in the row of df1$university_name then I want to replace it with df2$University_Name_new present in front of that alias. I am trying to do with match function and with for loop but not getting any success. Can you help me please – girijesh96 Mar 17 '18 at 16:10
  • sir could you help me to get an optimal or suboptimal solution, please. It would be greatly appreciated. – girijesh96 Mar 18 '18 at 15:43
  • just to be sure, you want to match university_alias with university_name, and if the match is found, you want to replace university_alias with university_new_name ? Is that what you want ? – YOLO Mar 18 '18 at 17:56
  • yes, I have written a code using for loop and grepl function then at some places multiple updations is happening that is why my code is not working very efficiently. – girijesh96 Mar 19 '18 at 03:19
  • but 'auckland university' is not in university_name, why does it have a university_new_name ? It's confusing. – YOLO Mar 19 '18 at 12:18
  • as my university_name column is very big in size, it is containing nearly 150k rows. So there is a possibility of 'auckland university' in university_name that's why it is in university_alias and if this alias is found in the university_name column then it should be replaced with corresponding name present in university_new_name – girijesh96 Mar 19 '18 at 13:34
  • i am using following code for it 'code' indx1 = 0 for(i in df2$university_alias){ indx1 = (indx1 +1) df1$university_name[(grepl(as.character(i), df1$university_name))] <- df2$university_new_name[indx1] – girijesh96 Mar 19 '18 at 13:38
  • thank you for your kind help. I would like to know one more thing what kind of machine learning approach I can use to build a model for such work. – girijesh96 Mar 20 '18 at 02:32
  • This is not machine learning. This is just data manipulation stuff. – YOLO Mar 20 '18 at 03:08
0

You could do this

df2$University_Name_new[which(is.element(df2$University_Alias, df1$university_name))]
### which returns the following ####
[1] "university of auckland" "seneca college" 

Now for instance, in the data you provided the university of auckland is in df1$university_name but not in df2$University_Alias, which is why we have the following:

> which(is.element(df2$University_Alias, df1$university_name))
[1] 4 8

Indeed, from df1$university_name, only university of auckland and seneca college are contained in df2$University_Alias.

niko
  • 5,253
  • 1
  • 12
  • 32
  • thank you @nate.edwinton for your response but my problem is like this. if univ of auckland is found anywhere in df1$university_name with some other text string it like (eg. univ of auckland, New Zealand) then it should be replaced by university of auckland which is present against each row of aliases – girijesh96 Mar 17 '18 at 14:22