0

I have text stored in a database and I want to filtered the urls that the text contain. How is it possible to filter the urls from text using Java code. For example I have the following text inside my db "The dress-a-likes! Try to look normal and this is what happens. @ Bar Louie http://t.co/sNVcoqT0Bc". How can I filtered the link http://t.co/sNVcoqT0Bc.

    Pattern p = Pattern.compile("http://.*|www\\..*");
    DBCursor cursor = coll.find(query);
    while(cursor.hasNext()) {
         System.out.println(cursor.next().get("text"));

         Matcher m = p.matcher("http://...");
}

How can I filtered the cursor.next().get("text") with the matcher. Cursor... is an object while matcher waiting for a String. How can I convert that object to String?

Jose Ramon
  • 5,572
  • 25
  • 76
  • 152

2 Answers2

4

I would try to locate where the "http://" and then take te whole string until the end.

Use: int indexOf(String str)

If there is the posibility of having something more after the URL, then locate the space using another indexOf().

Now use: indexOf(String str, int fromIndex) where fromIndex should be the index finded before.

Make a substring from one of the index until the other.

Use: string substring(int beginIndex, int endIndex)

MuGiK
  • 351
  • 4
  • 13
3

Try using ANTLR to parse your file. Create a simple grammar that extracts the links alone. the links end when there is a space character " ". This will parse your whole file and returns all the URL (if there are more than one).

Narayana
  • 363
  • 3
  • 14