Removing the url from text using java

Question

How to remove the URLs present in text example

String str="Fear psychosis after #AssamRiots - http://www.google.com/LdEbWTgD http://www.yahoo.com/mksVZKBz";

using a regular expression?

I want to remove all the URLs in the text. But it's not working, my code is :

String pattern = "(http(.*?)\\s)";
Pattern pt = Pattern.compile(pattern);
Matcher namemacher = pt.matcher(input);
if (namemacher.find()) {
  str=input.replace(namemacher.group(0), "");
}

You could maybe check this post - http://stackoverflow.com/questions/8694984/remove-part-of-string — Martin Rohwedder, Sep 11 '12 at 09:29
@Rohwedder this not working if my text is ending with url because i dont have index number of url. — NLP JAVA, Sep 11 '12 at 09:32
@Philipp i have the string like #AssamRiots: Situation calm in Dhubri; curfew relaxed for 2 hours - Daily Bhaskar http://t.co/ocq6RNFI — NLP JAVA, Sep 11 '12 at 09:36

score 22 · Accepted Answer · edited Jun 19 '14 at 12:09

22

Input the String that contains the url

private String removeUrl(String commentstr)
    {
        String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
        Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(commentstr);
        int i = 0;
        while (m.find()) {
            commentstr = commentstr.replaceAll(m.group(i),"").trim();
            i++;
        }
        return commentstr;
    }

edited Jun 19 '14 at 12:09

Ev0oD

1,395
16
33

answered Oct 18 '12 at 09:02

NLP JAVA

432
1
3
15

Thanks! Really great solution. – Can Uludağ Feb 06 '16 at 08:39
2

after 3 to 4 hours i realized that your code is not working – Shubham Sharma Sep 14 '17 at 10:43

score 5 · Answer 2 · edited Mar 24 '17 at 18:26

5

Well, you haven't provided any info about your text, so with the assumption of your text looking like this: "Some text here http://www.example.com some text there", you can do this:

String yourText = "blah-blah";
String cleartext = yourText.replaceAll("http.*?\\s", " ");

This will remove all sequences starting with "http" and up to the first space character.

You should read the Javadoc on String class. It will make things clear for you.

edited Mar 24 '17 at 18:26

Favonius

13,959
3
55
95

answered Sep 11 '12 at 09:29

svz

4,516
11
40
66

2

It must be `yourText.replaceAll("http.*?\\s", "");` – Jaec Nov 13 '16 at 20:22

score 4 · Answer 3 · answered Sep 11 '12 at 09:34

4

How do you define URL? You might not just want to filter http:// but also https:// and other protocols like ftp://, rss:// or custom protocols.

Maybe this regular expression would do the job:

[\S]+://[\S]+

Explanation:

one or more non-whitespaces
followed by the string "://"
followed by one or more non-whitespaces

answered Sep 11 '12 at 09:34

Philipp

67,764
9
118
153

i have string #AssamRiots: Situation calm in Dhubri; curfew relaxed for 2 hours - Daily Bhaskar http://t.co/ocq6RNFI – NLP JAVA Sep 11 '12 at 09:38
The regular expression I posted should also work when the URL is at the end of the message. When there are no whitespaces after the URL, it matches until the end of the message. At least it does on http://regexpal.com/ – Philipp Sep 11 '12 at 09:46
Why are you asking me when you went with the solution by svz? – Philipp Sep 11 '12 at 11:02

score 4 · Answer 4 · answered Jan 19 '16 at 18:28

Note that if your URL contains characters like & and \ then the answers above will not work because replaceAll can't handle those characters. What worked for me was to remove those characters in a new string variable then remove those characters from the results of m.find() and use replaceAll on my new string variable.

private String removeUrl(String commentstr)
{
    // rid of ? and & in urls since replaceAll can't deal with them
    String commentstr1 = commentstr.replaceAll("\\?", "").replaceAll("\\&", "");

    String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(commentstr);
    int i = 0;
    while (m.find()) {
        commentstr = commentstr1.replaceAll(m.group(i).replaceAll("\\?", "").replaceAll("\\&", ""),"").trim();
        i++;
    }
    return commentstr;
}

Easily call `replace` instead of multiple `replaceAll`. – Tooraj Jam Jun 06 '22 at 08:20 — Tooraj Jam, Jun 06 '22 at 08:20

score 1 · Answer 5 · answered Sep 09 '18 at 13:27

As @Ev0oD mentioned, the code works perfect except in the following tweet I'm working on: RT @_Val83_: The cast of #ThorRagnarok playing "Ragnarok Paper Scissors" #TomHiddleston #MarkRuffalo (https://t.co /k9nYBu3QHu)

where the token is going to be removed: commentstr = commentstr.replaceAll(m.group(i),"").trim();

I have faced the following error:

java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 22

where the m.group(i) is https://t.co /k9nYBu3QHu)``

tick_tack_techie · Answer 6 · 2015-07-23T03:38:18.403

m.group(0) should be replaced with an empty string rather than m.group(i) where i is incremented with every call to m.find() as mentioned in one of the answers above.

private String removeUrl(String commentstr)
{
    String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(commentstr);
    StringBuffer sb = new StringBuffer(commentstr.length);
    while (m.find()) {
        m.appendReplacement(sb, "");
    }
    return sb.toString();
}

score 0 · Answer 7 · answered Oct 21 '22 at 08:20

"Hello https://www.google.com/hello - visit us here!".replaceAll("((https?|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)", "");

will print:

Hello  - visit us here!

Optionally add a space before 'https' and 'http' in the regex to strip the space before URL as well.

score -3 · Answer 8 · answered Sep 14 '17 at 10:59

If you can move on towards python then you can find much better solution here using these code,

import re
text = "<hello how are you ?> then ftp and mailto and gopher and file ftp://ideone.com/K3Cut rthen you "
text = re.sub(r"ftp\S+", "", result)
print(result)

Removing the url from text using java

8 Answers8

Linked