0

The user may enter text for example

This is some text, visit www.mysite.com. Thanks & bye.

The URL should be found and turned into a link, for display in a website. All other characters should appear as-is.

I have been searching and googling for some time. I'm sure this sort of thing must already exist. My temptation is to program this myself but I'm sure this is more complex than it looks.

  • Dots can be part of a URL, or can be a sentence terminator as above. I think users have the expectation that this will be handled properly; Outlook handles this correctly, for example.
  • There are various different protocols such as http:, https: etc., plus links are often entered without a protocol specifier, as above.
  • It is necessary to produce HTML (so that the <a ...> tag can be inserted) therefore it would be necessary to replace e.g. & with &amp; before doing that; however some URLs have & in them (e.g. xyz.cgi?a=b&c=d) and there we only want an &amp; to be inserted in the displayable part of the URL not in the link itself (<a href="...&...">...&amp;...</a>)

I'm sure there are other issues that I will encounter as soon as I attempt to program this myself. I don't think that a simple reg-exp is the way forward.

Is there any library which already does this, ideally for Java? (If it's in another technology maybe I can take a look at it and convert it to Java)

Adrian Smith
  • 17,236
  • 11
  • 71
  • 93

2 Answers2

1

While you are right that this is a common problem it's also one that isn't really satisfactorily solved anywhere, nor can it be. URIs without markup written in freetext like this can be ambiguous (see http://en.wikisource.org/wiki/1911_Encyclop%C3%A6dia_Britannica/Aga_Khan_I. for example, how would you know that '.' wasn't an "end of sentence" full stop and in fact is part of the URI?). You can have a look at the problem with urls for an introduction to the problem and quite an informative discussion in the comments. At the end of the day you can provide a best effort such as matching protocols, looking for valid top-level domains (which includes a lot more than you might think at first), but there will always be things slipping through the net.

To attempt to provide you with some pseudo-code I'd say something along these lines is what I'd start off with:

process() {
    List<String> looksLikeUri = getMatches(1orMoreValidUriCharacters + "\\." + 1orMoreValidUriCharacters);
    removeUrisWithInvalidTopLevelDomains(looksLikeUri);
    trimCharactersUnlikelyToBeInUris(looksLikeUri);
    guessProtocolIfNotPresent(looksLikeUri);
}

removeUrisWithInvalidTopLevelDomains() // Use a list of valid ones or limit it to something like 1-6 characters.

trimCharactersUnlikelyToBeInUris() // ,.:;? (at the very end) '(' at start ')' at end unless a starting one was in URI.

guessProtocolIfNotPresent() // Usually http unless string starts with something obvious like "ftp" or already has a protocol.
Vala
  • 5,628
  • 1
  • 29
  • 55
  • Thanks! Yes I imagined it was not possible to solve in all cases; thanks for the illustrative example. But I figured the world might have reached a consensus on a "best effort"; as opposed to each piece of software implementing its own "best effort" according to its programmers' beliefs as to what should/should not constitute a URL. Thanks also for the useful information that it hasn't been satisfactorily solved, that makes me feel a bit better about not having found anything yet... – Adrian Smith Nov 07 '11 at 11:34
0

It would be probably fully solvable if the contained URL always contained protocol (such as HTTP). Because this is not the case, any "word", which contains . character can potentially be URL (for example mysite.com) and moreover you cannot be sure with teh actual protocol (you may assume).

If you assume that user will be always online, you may make a method that will take all potential URLs, checks if URL exists and if it does, then produce HTML link.

I have wroted this code snippet:

import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.regex.*;


public class JavaURLHighlighter
{
    Pattern potentialURLAtTheBeginning = Pattern.compile("^[^\\s]+\\.[^\\s]+\\s");
    Pattern potentialURLintheMiddle = Pattern.compile("\\s[^\\s]+\\.[^\\s]+\\s");
    Pattern potentialURLAtTheEnd = Pattern.compile("\\s[^\\s]+\\.[^\\s]+$");
    private String urlString;
    ArrayList<String> matchesList=new ArrayList<String>();

    public String getUrlString() {
        return urlString;
    }

    public void setUrlString(String urlString) {
        this.urlString = urlString;
    }

    public void getConvertedMatches()
     {
        String match;
        String originalMatch;
        Matcher matcher;
        matcher = potentialURLAtTheBeginning.matcher(urlString);
        matchesList.clear();
        while (matcher.find())
        {
          match = matcher.group().trim();
          if (!match.startsWith("http://") && !match.startsWith("https://")) match = "http://"+match;
          if (match.endsWith(".")) match=match.substring(0, match.length()-1);
          if (urlExists(match)) matchesList.add(match);
        }
        matcher = potentialURLintheMiddle.matcher(urlString);
        while (matcher.find()) 
        {
          match = matcher.group().trim();
          if (!match.startsWith("http://") && !match.startsWith("https://")) match = "http://"+match;
          if (match.endsWith(".")) match=match.substring(0, match.length()-1);
          if (urlExists(match))matchesList.add(match);
        }
        matcher = potentialURLAtTheEnd.matcher(urlString);
        while (matcher.find()) 
        {
          match = matcher.group().trim();
          if (!match.startsWith("http://") && !match.startsWith("https://")) match = "http://"+match;
          if (match.endsWith(".")) match=match.substring(0, match.length()-1);
          if (urlExists(match)) matchesList.add(match);
        }

        for (int i=0; i< matchesList.size();i++) System.out.println(matchesList.get(i));
    }

    public static boolean urlExists(String urlAddress)
    {
        try
        {
          HttpURLConnection.setFollowRedirects(false);
          HttpURLConnection connection = (HttpURLConnection) new URL(urlAddress).openConnection();
          connection.setRequestMethod("HEAD");
          return (connection.getResponseCode() == HttpURLConnection.HTTP_OK);
        }
        catch (Exception e)  {return false;  }
    }

public static void main(String[] args)
{
    JavaURLHighlighter hg = new JavaURLHighlighter();

    hg.setUrlString("This is some text, visit www.mysite.com. Thanks & bye.");
    hg.getConvertedMatches();

    hg.setUrlString("This is some text, visit www.nonexistingmysite.com. Thanks & bye.");
    hg.getConvertedMatches();    

}

}

It's not actual solution to your problem and I wrote it quicky, so it might not be completly correct, but it should guide you a bit. Here I just print the matches. Have a look here Java equivalent to PHP's preg_replace_callback for regexp replacing function with which you could embrace all modified matches with a hrefs. With provided information you should be able to write what you want - but possibly with not 100% reliable detection.

Community
  • 1
  • 1
MOleYArd
  • 1,258
  • 1
  • 12
  • 16
  • An interesting approach. But I feel I shouldn't re-invent the wheel here, this must be a common problem, e.g. all mail clients can highlight links in plain/text incoming emails. But yet, my research did not yield any libraries which solve this problem.. – Adrian Smith Nov 07 '11 at 11:11