0

Hi friends I have asked a related question here.Problem here is txt(keywords) with punctuation's are not detected.I tried to make the answer generic but have failed.

Basically i have a txt(keywords) with punctuation and without punctuation which i need to search in a file toSearch.

For Ex these are the contents of my file toSearch

 [1]'Nokia. Okay. R: Samsung R: Samsung M: And you have? R: I have Micromax'
 [2]'M: Okay, you have taken car. R: I have (Mahindra Scorpio and Mahindra's) this Duro DZ.M: Okay.'
 [3]'M: What is your age ? R: 32 years R: My name is "Nitish". I have Interior designing business.'
 [4]'R: 3rd, Not extra spicy. R: 4th, Fresh. R: 5th, Variety. R: 6th, Hygienic environment'
 [5]'How you feel? How it should be? We will move forward, if there we have to make an ideal'
 [6]'What is the strength of your organisation? How many people a re working.'
 [7]'R: Read newspaper R:Had breakfast with family.'

and the txt (keywords) are. I have used #@ to separate keywords since i cannot use ,(comma).

 txt<-"R: Samsung R: Samsung M:#@I have (Mahindra Scorpio and Mahindra's)#@R: 32 years R: My name is "Nitish"#@R: 4th, Fresh. R: 5th, Variety#@How you feel? How it should be? 

my expected o/p is finding the occurrence and replacing spaces within the keywords with underscore _

 [1]'Nokia. Okay. R:_Samsung_R:_Samsung_M: And you have? R: I have Micromax'
 [2]'M: Okay, you have taken car. R: I_have_(Mahindra_Scorpio_and_Mahindra's) this Duro DZ.M: Okay.'
 [3]'M: What is your age ? R:_32_years_R:_My_name_is_"Nitish". I have Interior designing business.'
 [4]'R: 3rd, Not extra spicy. R:_4th,_Fresh._R:_5th,_Variety. R: 6th, Hygienic environment'
 [5]'How_you_feel?_How_it_should_ be? We will move forward, if there we have to make an ideal'
 [6]'What is the strength of your organisation? How many people a re working.'
 [7]'R: Read newspaper R:Had breakfast with family.'

If u guys don't understand it is simple Find And Replace Text(FART) functionality.only spaces are replaced by _

I have tried to use this regular expression

for(i in 1:length(txt))
{
    #finding the first word of the keyword 
    start <- head(strsplit(txt, split=" ")[[i]], 1)  
    n <- stri_stats_latex(txt[i])[4] 

    #all possible occurrences for the keywords in the text
    o<-unlist(regmatches(toSearch,gregexpr(paste0(start,"(?:[^a-zA-Z'-]+[a-zA-Z'-]+){0,",n-1,"}"),toSearch,ignore.case=TRUE)))  

    #exact match with the result
    p<-which(!is.na(pmatch(txt,o)))  

    #replace the keywords in the text file.
    text<-as.character(replace_all(text,txt[p],str_replace_all(txt[p]))) 
}
oguz ismail
  • 1
  • 16
  • 47
  • 69
OnkarK
  • 41
  • 9
  • i see only spaces are replaced but depending on what ? i could not understand your explanation – aelor May 27 '14 at 05:10
  • 2
    I don't think anyone has ever abbreviated "Find And Replace Text". Ever. –  May 27 '14 at 05:14
  • 2
    @LegoStormtroopr - maybe once before: http://fart-it.sourceforge.net/ – thelatemail May 27 '14 at 05:16
  • [FART](http://www.abbreviations.com/FART) gotta love it – MattSizzle May 27 '14 at 05:19
  • @aelor Basically i need to find the occurrence of the keywords in a list of files.So trying to search the keyword(with punctuation's) in the files.If it exists then replace the keywords spaces with `_` in the file for every occurrence.So that i can find the frequency and index which is later part. – OnkarK May 27 '14 at 05:26
  • @thelatemail have tried use the same link.@LegoStormtroopr any other option in R. – OnkarK May 27 '14 at 05:32

2 Answers2

2

So you have to be very careful with punctuation when working with regular expressions. It's best just to not use regular expressions and set fixed=T for grep if you're doing an exact match. Thus you can do the find and replace using Reduce

#input data
target<-c("Nokia. Okay. R: Samsung R: Samsung M: And you have? R: I have Micromax", 
"M: Okay, you have taken car. R: I have (Mahindra Scorpio and Mahindra's) this Duro DZ.M: Okay.", 
"M: What is your age ? R: 32 years R: My name is \"Nitish\". I have Interior designing business.", 
"R: 3rd, Not extra spicy. R: 4th, Fresh. R: 5th, Variety. R: 6th, Hygienic environment", 
"How you feel? How it should be? We will move forward, if there we have to make an ideal", 
"What is the strength of your organisation? How many people a re working.", 
"R: Read newspaper R:Had breakfast with family.")

kw<-c("R: Samsung R: Samsung M:", "I have (Mahindra Scorpio and Mahindra's)", 
"R: 32 years R: My name is \"Nitish\"", "R: 4th, Fresh. R: 5th, Variety", 
"How you feel? How it should be?")

And here we use reduce to successively replace each of the keywords in the target text

Reduce(function (t,kw) gsub(kw, gsub(" ","_",kw), t, fixed=T), 
    kw, init=target, accumulate=F)

# [1] "Nokia. Okay. R:_Samsung_R:_Samsung_M: And you have? R: I have Micromax"                         
# [2] "M: Okay, you have taken car. R: I_have_(Mahindra_Scorpio_and_Mahindra's) this Duro DZ.M: Okay." 
# [3] "M: What is your age ? R:_32_years_R:_My_name_is_\"Nitish\". I have Interior designing business."
# [4] "R: 3rd, Not extra spicy. R:_4th,_Fresh._R:_5th,_Variety. R: 6th, Hygienic environment"          
# [5] "How_you_feel?_How_it_should_be? We will move forward, if there we have to make an ideal"        
# [6] "What is the strength of your organisation? How many people a re working."                       
# [7] "R: Read newspaper R:Had breakfast with family." 

I hope this helps your FART-ing.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
0

A simplified example that should work for the larger problem.

toSearch <- c("this is some text","something else to search")
txt <- c("is some#@else to")
txt <- strsplit(txt,"#@")[[1]]
txtundsc <- gsub("\\s+","_",txt)

for(i in seq_along(txt)) { toSearch <- gsub(txt[i],txtundsc[i],toSearch) }
toSearch
# [1] "this is_some text"        "something else_to search"
thelatemail
  • 91,185
  • 12
  • 128
  • 188