14

How to remove the URLs present in text example

String str="Fear psychosis after #AssamRiots - http://www.google.com/LdEbWTgD http://www.yahoo.com/mksVZKBz";

using a regular expression?

I want to remove all the URLs in the text. But it's not working, my code is :

String pattern = "(http(.*?)\\s)";
Pattern pt = Pattern.compile(pattern);
Matcher namemacher = pt.matcher(input);
if (namemacher.find()) {
  str=input.replace(namemacher.group(0), "");
}
Serg M Ten
  • 5,568
  • 4
  • 25
  • 48
NLP JAVA
  • 432
  • 1
  • 3
  • 15

8 Answers8

22

Input the String that contains the url

private String removeUrl(String commentstr)
    {
        String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
        Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(commentstr);
        int i = 0;
        while (m.find()) {
            commentstr = commentstr.replaceAll(m.group(i),"").trim();
            i++;
        }
        return commentstr;
    }
Ev0oD
  • 1,395
  • 16
  • 33
NLP JAVA
  • 432
  • 1
  • 3
  • 15
5

Well, you haven't provided any info about your text, so with the assumption of your text looking like this: "Some text here http://www.example.com some text there", you can do this:

String yourText = "blah-blah";
String cleartext = yourText.replaceAll("http.*?\\s", " ");

This will remove all sequences starting with "http" and up to the first space character.

You should read the Javadoc on String class. It will make things clear for you.

Favonius
  • 13,959
  • 3
  • 55
  • 95
svz
  • 4,516
  • 11
  • 40
  • 66
4

How do you define URL? You might not just want to filter http:// but also https:// and other protocols like ftp://, rss:// or custom protocols.

Maybe this regular expression would do the job:

[\S]+://[\S]+

Explanation:

  • one or more non-whitespaces
  • followed by the string "://"
  • followed by one or more non-whitespaces
Philipp
  • 67,764
  • 9
  • 118
  • 153
  • i have string #AssamRiots: Situation calm in Dhubri; curfew relaxed for 2 hours - Daily Bhaskar http://t.co/ocq6RNFI – NLP JAVA Sep 11 '12 at 09:38
  • The regular expression I posted should also work when the URL is at the end of the message. When there are no whitespaces after the URL, it matches until the end of the message. At least it does on http://regexpal.com/ – Philipp Sep 11 '12 at 09:46
  • Why are you asking me when you went with the solution by svz? – Philipp Sep 11 '12 at 11:02
4

Note that if your URL contains characters like & and \ then the answers above will not work because replaceAll can't handle those characters. What worked for me was to remove those characters in a new string variable then remove those characters from the results of m.find() and use replaceAll on my new string variable.

private String removeUrl(String commentstr)
{
    // rid of ? and & in urls since replaceAll can't deal with them
    String commentstr1 = commentstr.replaceAll("\\?", "").replaceAll("\\&", "");

    String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(commentstr);
    int i = 0;
    while (m.find()) {
        commentstr = commentstr1.replaceAll(m.group(i).replaceAll("\\?", "").replaceAll("\\&", ""),"").trim();
        i++;
    }
    return commentstr;
}    
John81
  • 3,726
  • 6
  • 38
  • 58
1

As @Ev0oD mentioned, the code works perfect except in the following tweet I'm working on: RT @_Val83_: The cast of #ThorRagnarok playing "Ragnarok Paper Scissors" #TomHiddleston #MarkRuffalo (https://t.co /k9nYBu3QHu)

where the token is going to be removed: commentstr = commentstr.replaceAll(m.group(i),"").trim();

I have faced the following error:

java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 22

where the m.group(i) is https://t.co /k9nYBu3QHu)``

Mir Saman
  • 153
  • 6
0

m.group(0) should be replaced with an empty string rather than m.group(i) where i is incremented with every call to m.find() as mentioned in one of the answers above.

private String removeUrl(String commentstr)
{
    String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(commentstr);
    StringBuffer sb = new StringBuffer(commentstr.length);
    while (m.find()) {
        m.appendReplacement(sb, "");
    }
    return sb.toString();
}
0
"Hello https://www.google.com/hello - visit us here!".replaceAll("((https?|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)", "");

will print:

Hello  - visit us here!

Optionally add a space before 'https' and 'http' in the regex to strip the space before URL as well.

Oleg
  • 187
  • 3
  • 5
-3

If you can move on towards python then you can find much better solution here using these code,

import re
text = "<hello how are you ?> then ftp and mailto and gopher and file ftp://ideone.com/K3Cut rthen you "
text = re.sub(r"ftp\S+", "", result)
print(result)
Shubham Sharma
  • 2,763
  • 5
  • 31
  • 46