0

I've got lots of data and one of the columns is a free text description. I'm trying to process this in SAS and as part of this I want to correct some spelling and remove some words that don't really add any value to what the text is saying.

I've noticed there's a quite a few googlemaps links that have just been copied into quite a few of these descriptions. And I'm trying to remove all of them.

I've got ways of removing complete words and phrases I define, but all these googlemaps links are slightly different so is there a way of removing all the different instances of these types of links? In the example below I've put three different ways that the google maps links have been copied into my data:

  1. www.google.co.uk/maps/@51.34735456-2.9327
  2. https://goo.gl/maps/jFh9RXXm
  3. https://www.google.com.br/maps/place/Howard+Rd

So is there a way for example of removing just the characters starting from "https://goo", "https://www.goo" and/or "www.goo" all the way up to the next space? And then replacing that with the word "googlemapslink"? Or a way of removing the entire string bound by spaces which contains the string "/maps/"?

Any thoughts would be greatly appreciated :)

Code below (which works, but isn't really practical as I'll have to go through the whole data to first get a list of all various forms of the google maps links):

data have;
  infile datalines dsd truncover;
  input ID Description :$500. Col3 $ Col4 Col5 Col6;
datalines;
1,bla bla lay bye my mybla,C1,0,100,0
2,got laybye me tear,C1,0,0,0
3,free mug text i google by,C1,10,100,0
4,house www.google.co.uk/maps/@51.34735456-2.9327 roof tree!?,C1,10,100,0
5,Mug house https://goo.gl/maps/jFh9RXXm mugg muggle,C1,10,0,0
6,mug sky** lay mug by by lay computer https://www.google.com.br/maps/place/Howard+Rd mug mug mugs,C3,0,20,1
;

/* change instances of google maps links to "googlemapslink"*/
data data_1;
set have;
Description_new = Description;
Description_new = tranwrd(Description_new," mug ", " cup ");
Description_new = tranwrd(Description_new," https://goo.gl/maps/jFh9RXXm ", " googlemapslink ");
Description_new = tranwrd(Description_new," https://www.google.com.br/maps/place/Howard+Rd ", " googlemapslink ");
Description_new = tranwrd(Description_new," www.google.co.uk/maps/@51.34735456-2.9327 ", " googlemapslink ");
run;
uwd
  • 45
  • 6
  • Use SCAN() like in your other post and test if the pulled out word starts with your string. `if lowcase(word)=:'https://www.google.com.br/maps/' then word='/maps/'` https://stackoverflow.com/questions/70381549/is-there-a-better-way-replace-several-words-in-a-string-with-another-word-sas/70383048?noredirect=1#comment124436351_70383048 – Tom Dec 17 '21 at 16:00
  • I'm struggling to see how to use that bit of code with SCAN(). This finds the rows where the description contains a word starting as the urls I'm looking at, but it then creates a flag in a new column. How can I get it to replace the word instead? `data want; set have; fixed=description; fixed=' '; do index=1 to countw(description,' '); if lowcase(scan(description,index,' '))=:'https://www.goo' then word='/maps/'; if lowcase(scan(description,index,' '))=:'https://goo' then word='/maps/'; if lowcase(scan(description,index,' '))=:'www.goo' then word='/maps/'; end; run;` – uwd Dec 20 '21 at 13:59
  • You left out the part from the other answer where the new string is built by re-combining the words pulled from the original string. – Tom Dec 20 '21 at 14:52

1 Answers1

0

It is a qusetion about how to find url in free text and replace it by specified string. I would recommend you to use prxchange function.

data data_1;
set have;
Description_new = prxchange('s/(https:\/\/)?(\w*[\.\-\+\@\/]\w*)+/googlemaplink/',-1,Description);
run;

I write a very simple pattern to capture url in your example text, if you wanna go further, see How can I check if a given string is a valid URL address.

whymath
  • 1,262
  • 9
  • 16
  • This also replaces any punctuation with the word too. So if I had the word "bla." instead of "bla" in the first line of that data then this code would replace "bla." with "googlemaplink" as well as the actual urls. Is there a way to adjust it to only pick up those words that start with "https://" and/or "www."? – uwd Dec 20 '21 at 13:11
  • Well, you would like to change the text pattern for this new situation: `Description_new = prxchange('s/(https?:\/\/|www\.)(\w*[\.\-\+\@\/]\w*)+/googlemaplink/',-1,Description); ` – whymath Dec 21 '21 at 01:00