0

I m actually making my first RSS reader with JAVA android and I have a problem.

In fact, I get some RSS informations, but there are HTML tags all around.

What I need is to extract every HTML content in these tags and put them in a string list, but I dont know how to do that.

Can you help me with this ?

Thanks for advance

mfrachet
  • 8,772
  • 17
  • 55
  • 110

3 Answers3

1

Assume you have a html content called htmlString, you can clean that with regular expressions.

String htmlString = "<tr><td>12345</td></tr>";
String noHTMLString = htmlString.replaceAll("\\<.*?>","");
Mark Lee
  • 301
  • 1
  • 2
  • 9
  • This works great, but what I need a list of string :-) . Thanks for your help – mfrachet Apr 02 '15 at 09:18
  • 1
    Are you receiving your html datas stream as one single string or as multiple strings? If you can get your input html in the form off multiple strings then you can apply @Mark Lee's answer in a loop. Something like for `(String s: htmlSourceList) { // call a method based on @Mark Lee's solution here }` – alainlompo Apr 02 '15 at 09:28
1

This should extract a list of all contents between html tags into the list called matches. You should modify the regex in brackets to match your content. The current version only matches text containing digits, letters, dots, commas, brackets, minuses and spaces.

Pattern pattern = Pattern.compile("<\\w+>([\\w\\s\\.,\\-\\(\\)]+)</\\w+>");
Matcher matcher = pattern.matcher(content);

List<String> matches = new ArrayList<String>();
while(matcher.find()){
    matches.add(matcher.group(1));
}
annkitkat
  • 11
  • 3
1

If your rss is xml format, you will need dom4j.jar

import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

import org.dom4j.Document;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;

public class test {

    public static void main(String[] args) throws Exception {
        String rssUrl = ""; // paste url here
        List<RssDocument> docList = new ArrayList<RssDocument>();
        try
        {
            SAXReader saxReader = new SAXReader();
            Document document = saxReader.read(rssUrl);
            Element channel = (Element) document.getRootElement().element("channel");
            for (Iterator i = channel.elementIterator("item"); i.hasNext();)
            {
                Element element = (Element) i.next();
                String title = element.elementText("title");
                String pubDate = element.elementText("pubDate");
                String description = element.elementText("description");
                RssDocument doc = new RssDocument(title, pubDate, description);
                docList.add(doc);
            }
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }
        // do something with docList
    }

    public static class RssDocument {
        String title;
        String pubDate;
        String description;

        RssDocument(String title, String pubDate, String description) {
            this.title = title;
            this.pubDate = pubDate;
            this.description = description;
        }
    }
}

Paste your rss url into variable "rssUrl", and run this main. You will get a list of RSS document, which contains title, published date and description.


If what you need is only the title and description of every rss item, use the following codes.

import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

import org.dom4j.Document;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;

public class test {

    public static void main(String[] args) throws Exception {
        String rssUrl = ""; // paste url here
        List<String> strList = new ArrayList<String>();
        try
        {
            SAXReader saxReader = new SAXReader();
            Document document = saxReader.read(rssUrl);
            Element channel = (Element) document.getRootElement().element("channel");
            for (Iterator i = channel.elementIterator("item"); i.hasNext();)
            {
                Element element = (Element) i.next();
                String title = element.elementText("title").replaceAll("\\<.*?>","");
                String description = element.elementText("description").replaceAll("\\<.*?>","");
                strList.add(title + " " + description);
            }
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }
    }

}

Then strList will be the list of string, which contains title and description.

For example:

{
 "title1 description1"
 "title2 description2"
 "title3 description3"
}
Mark Lee
  • 301
  • 1
  • 2
  • 9