0

Hi I'm having some trouble with parsing some data from a web source between two "tags"

Here's what a sample of the web source and the code I'm using to try and parse it.

 <div class="ProfileTweet-contents">

      <p class="ProfileTweet-text js-tweet-text u-dir"

        dir="ltr">Come join us now! <a href="http://t.co/Kbhh2ed" rel="nofollow" dir="ltr" data-expanded-url="http://forum.epicurus-pk.com/" class="twitter-timeline-link" target="_blank" title="http://www.google.com" ><span class="tco-ellipsis"></span><span class="invisible">http://</span><span class="js-display-url">www.google.com</span><span class="invisible">/</span><span class="tco-ellipsis"><span class="invisible">&nbsp;</span></span></a> <a href="http://t.co/jIw2344dDZz" class="twitter-timeline-link u-isHiddenVisually" data-pre-embedded="true" dir="ltr" >pic.twitter.com/jIwtc23juZz</a></p>

Code

   while ((line = in.readLine()) != null) {
        Pattern pattern = Pattern.compile("dir=.?!<a href=");
        Matcher matcher = pattern.matcher(line);
        while (matcher.find()) {
            tweets[0] = matcher.group();
            System.out.println(matcher.group());
        }
    }

The item of data I'm trying to fetch is the following

dir="ltr">Come join us now! <a href=

For some reason it's not fetching the data inbetween dir= and < a href

Another working example which is parsing the web source just fine

  URL addr = new URL(url);
      URLConnection con = addr.openConnection();
      ArrayList<String> data = new ArrayList<String>();
        BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
        String inputLine;
        while ((inputLine = in.readLine()) != null) {
            Pattern p = Pattern.compile("<span itemprop=.*?</span>");
            Pattern p2 = Pattern.compile(">.*?<");
            Matcher m = p.matcher(inputLine);
            Matcher m2;
            while (m.find()) {
                m2 = p2.matcher(m.group());
                while (m2.find()) {
                    data.add(m2.group().replaceAll("<", "").replaceAll(">", "").replaceAll("&", "").replaceAll("#", "").replaceAll(";", "").replaceAll("3",""));
                }
            }
        }
        in.close();
        addr = null;
        con = null;

Edit: Sorry have just realised I was using a different regex from my other code example without realising.

(dir=).*?(<a href=)

Works fine

kbz
  • 984
  • 2
  • 12
  • 30

2 Answers2

0

Use a XML parser is the short version of the answer. If the html is mangled use a HTML parser that will try to make sense of the madness . Read this post as a bonus :

RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
omu_negru
  • 4,642
  • 4
  • 27
  • 38
0

You're probably looking for a pattern such as:

(dir=\".+\">.+<a\\shref=).+rel

The reason your original pattern doesn't work is that you've not included several characters in your pattern such as " along with improperly using .? — it's not going capture anything between that and !.

Here a working example of the pattern above:

http://ideone.com/wbH9O6

l'L'l
  • 44,951
  • 10
  • 95
  • 146