3

I have an arraylist:

List<String> lines = new ArrayList<String>();

which contains the html of a webpage.

I made an arraylist 'resList' which contains the searched for string "abcde" and prints out to the console the said 6 lines of html:

ArrayList<String> resList = new ArrayList<String>();
String searchString = "(?i).*abcde.*";
for (String curVal : lines){
if (curVal.matches(searchString)){
resList.add(curVal);
System.out.println(items);

OUTPUT

<span class="bl-title">   <a href="abcdefPHOBIA_00">ACRO -  abcdefPHOBIA_00</a>
<span class="bl-title">   <a href="abcdefPHOBIA_11">ACRO -  abcdefPHOBIA_11</a>
<span class="bl-title">   <a href="abcdefPHOBIA_22">ACRO -  abcdefPHOBIA_22</a>
<span class="bl-title">   <a href="abcdefPHOBIA_33">ACRO -  abcdefPHOBIA_33</a>
<span class="bl-title">   <a href="abcdefPHOBIA_44">ACRO -  abcdefPHOBIA_44</a>
<span class="bl-title">   <a href="abcdefPHOBIA_55">ACRO -  abcdefPHOBIA_55</a>

I would like to read all the strings:

abcdefPHOBIA_00, abcdefPHOBIA_11, abcdefPHOBIA_22, abcdefPHOBIA_33, abcdefPHOBIA_44, abcdefPHOBIA_55

into an arrayList.

Tried split(" - ") and then tried startsWith() but it is not exactly what I want. Also tried a pattern with a regex but could not seem to make much progress.

What would be helpful is which way would be the most beneficial in terms of improving long term and also to get this thing done!

Apologies if the Question isn't detailed enough in advance.

afsantos
  • 5,178
  • 4
  • 30
  • 54
neoslov
  • 33
  • 5
  • 9
    Use a proper HTML parser like [jsoup](http://jsoup.org) instead. – Luiggi Mendoza Jan 16 '14 at 16:23
  • What about `String.contains(text)` method? – Edwin Dalorzo Jan 16 '14 at 16:24
  • 2
    What make you think the HTML will be nicely formatted with newlines where you expect them to be? If the HTML is generated (as it appears) it could be all on one "line". Any line-based attempt to parse HTML will fail at some point. As @LuiggiMendoza says, use a real parser. – Jim Garrison Jan 16 '14 at 16:24
  • 1
    @LuiggiMendoza I was using jsoup, perhaps i should use it again but i had some issues with https:// – neoslov Jan 16 '14 at 16:31
  • 1
    @neoslov [This question](https://stackoverflow.com/questions/7744075/how-to-connect-via-https-using-jsoup) can probably help you with jsoup and HTTPS connections. – ajp15243 Jan 16 '14 at 16:33
  • @EdwinDalorzo I was using contains but i wanted it to look better for viewing here. – neoslov Jan 16 '14 at 16:34
  • @ajp15243 Cheers, will have a look at it. i found a work around which has security issues but security is not an issue for me – neoslov Jan 16 '14 at 16:35
  • 1
    You mentioned regex and html in one post. Now I must show you this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Tim B Jan 16 '14 at 16:40
  • 2
    @neoslov Well, either using your workaround or (I would recommend) the accepted answer from that linked question to get the SSL cert available to Java, I think the answer here is to implement the solution with jsoup rather than regexes or string manipulation. [Don't let Tony the Pony get you](http://stackoverflow.com/a/1732454/1883647). – ajp15243 Jan 16 '14 at 16:40
  • stupid question but how does a person mark an answer as an accepted answer – neoslov Jan 16 '14 at 16:43
  • 1
    Click on the greyed out tick and if it is your question it becomes a green tick – Tim B Jan 16 '14 at 16:44
  • 2
    @neoslov None of these are answers, they are comments. You can post an answer yourself and then mark it as accepted, which will mark your question as solved. This behavior is encouraged if you have the answer. – ajp15243 Jan 16 '14 at 16:45
  • 1
    None of these comments go into enough detail for us to consider them a full answer worthy of upvote/acceptance - so we posted them as comments not answers. :) – Tim B Jan 16 '14 at 16:46
  • 1
    @ajp15243 cheers, will have a look at those links and have discovered regex and html is not a marriage made in heaven! – neoslov Jan 16 '14 at 16:49
  • @neoslov More like one made in the 9th circle of Hell ;) – ajp15243 Jan 16 '14 at 16:52
  • 1
    Well, read the second answer too. It's not quite as black and white as the first answer made it seem - but you do need to be very aware of the limitations. – Tim B Jan 16 '14 at 16:57
  • @TimB thanks, will give an upvote on of these days then! though i think the some of the answers are acceptable as i was looking more for what to do and what not to do in a general sense – neoslov Jan 16 '14 at 17:01

2 Answers2

0

Try:

Pattern pattern = Pattern.compile("\"(abcde[^\"]*)\"");
for (String curVal : lines)
{
    Matcher matcher = pattern.matcher(curVal);
    while (matcher.find())
    {
        resList.add(matcher.group(1));
    }
}

This will find all strings with format abcede.* that are wrapped in double-quotes

Glenn Lane
  • 3,892
  • 17
  • 31
  • Thanks, that was exactly the type of answer i was originally looking for. I reckon it's better to go down the jsoup route now in terms of making things less complicated going forwards. – neoslov Jan 17 '14 at 10:11
0

I used the jsoup API. I reckon it is an easier way to manipulate the data and not too much code!

This goes to all the tags and then in each tag it searches the text from "ACRO". Then after i was given the whole text and i put them into a 2d array split by the " - ". After that One can do what One likes with the array.

so with: --> ACRO - abcdefPHOBIA_00

doc = Jsoup.connect("http://webpage.com").get();
Elements links = doc.select("a[href]");

String s = links.select("a:matches(ACRO)").first().text();
String[] str_arr = s.split(" - ");

//for example
System.out.println("before the - " + str_arr[0]);

System.out.println("after the - " + str_arr[1]);

before the - ACRO

after the - abcdefPHOBIA_00

neoslov
  • 33
  • 5