Hi I'm having some trouble with parsing some data from a web source between two "tags"
Here's what a sample of the web source and the code I'm using to try and parse it.
<div class="ProfileTweet-contents">
<p class="ProfileTweet-text js-tweet-text u-dir"
dir="ltr">Come join us now! <a href="http://t.co/Kbhh2ed" rel="nofollow" dir="ltr" data-expanded-url="http://forum.epicurus-pk.com/" class="twitter-timeline-link" target="_blank" title="http://www.google.com" ><span class="tco-ellipsis"></span><span class="invisible">http://</span><span class="js-display-url">www.google.com</span><span class="invisible">/</span><span class="tco-ellipsis"><span class="invisible"> </span></span></a> <a href="http://t.co/jIw2344dDZz" class="twitter-timeline-link u-isHiddenVisually" data-pre-embedded="true" dir="ltr" >pic.twitter.com/jIwtc23juZz</a></p>
Code
while ((line = in.readLine()) != null) {
Pattern pattern = Pattern.compile("dir=.?!<a href=");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
tweets[0] = matcher.group();
System.out.println(matcher.group());
}
}
The item of data I'm trying to fetch is the following
dir="ltr">Come join us now! <a href=
For some reason it's not fetching the data inbetween dir= and < a href
Another working example which is parsing the web source just fine
URL addr = new URL(url);
URLConnection con = addr.openConnection();
ArrayList<String> data = new ArrayList<String>();
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
Pattern p = Pattern.compile("<span itemprop=.*?</span>");
Pattern p2 = Pattern.compile(">.*?<");
Matcher m = p.matcher(inputLine);
Matcher m2;
while (m.find()) {
m2 = p2.matcher(m.group());
while (m2.find()) {
data.add(m2.group().replaceAll("<", "").replaceAll(">", "").replaceAll("&", "").replaceAll("#", "").replaceAll(";", "").replaceAll("3",""));
}
}
}
in.close();
addr = null;
con = null;
Edit: Sorry have just realised I was using a different regex from my other code example without realising.
(dir=).*?(<a href=)
Works fine