2

I have this sample texts like

EA SPORTS UFC  (Microsoft Xbox One, 2014) $40.00 via eBay http://t.co/Wpwj0R1EQm Tibet snake.... http://t.co/yPZXvNnugL

How do I remove urls http://t.co/Wpwj0R1EQm, http://t.co/yPZXvNnugL etc from text. I need to perform sentiment analysis and want clean words.

I am able to get rid of bad characters using simple regex.

The pattern is to remove http://t.co/{Whatever-first-word}

Cannon
  • 2,725
  • 10
  • 45
  • 86
  • Are you trying to get rid of everything after https? Because then it is a simple regex. If not how are you going to determine when to stop? i.e `something I want something I want https://somethingIdontwant something I want` ? – Jay Oct 23 '14 at 01:36

4 Answers4

5

Regular Expressions are your friend.

Simplifying your requirement to be remove all URLS in a given string. If we accept that a URL is anything that starts with http and ends with a space (URLs cannot contain spaces) then something like the follow should suffice. This regex finds any string that starts with http (Will also catch https) and ends in a space and replaces it with an empty string

string text = "EA SPORTS UFC  (Microsoft Xbox One, 2014) $40.00 via eBay http://t.co/Wpwj0R1EQm Tibet snake.... http://t.co/yPZXvNnugL";

string cleanedText = Regex.Replace(text, @"http[^\s]+", "");

//cleanedText is now "EA SPORTS UFC  (Microsoft Xbox One, 2014) $40.00 via eBay  Tibet snake.... "
OJay
  • 4,763
  • 3
  • 26
  • 47
3
text = Regex.Replace(text, @"((http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?)", "");

The pattern above will match a URL like you want, for example

http://this.com/ah.aspx?id=1

in:

this is a url http://this.com/ah.aspx?id=1 sdfsdf

You can see this in action in a regex fiddle for it.

J0e3gan
  • 8,740
  • 10
  • 53
  • 80
Arun Ghosh
  • 7,634
  • 1
  • 26
  • 38
1

You can use this function https://stackoverflow.com/a/17253735/2577248

Step1. sub = Find substring between "http://" and " " (white space)

Step2. Replace "http://" + sub with @"";

Step3. Repeat util original string does not contain any "http://t.co/any"

string str = @"EA SPORTS UFC  (Microsoft Xbox One, 2014) $40.00 via eBay http://t.co/Wpwj0R1EQm Tibet snake.... http://t.co/yPZXvNnugL" + " ";

while(str.Contains("http://")){
    string removedStr = str.Substring("http://", @" ");
    str = str.Replace("http://" + removedStr , @"");
}
Community
  • 1
  • 1
0

Regex.Replace

And I would try this patten: var regex_url_pattern = @"_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$_iuS"

Combined:

string output = Regex.Replace(input, regex_url_pattern, "");

Darj
  • 1,403
  • 1
  • 17
  • 47