0

This is only for a small Android program I am messing with so I only need to match one or two tags

I have one HTML tag and I can get whats inside that tag which is "FC-Cologne" I use this code to get it

Pattern pattern = Pattern.compile("report\">(.*?)</a>",Pattern.MULTILINE);

here is the HTML tag I can get to work

<a href="/match-menu/3405570/first-team/fc-cologne=report"> FC Cologne</a>

But I can't get this tag, I don't know is it because of the space after the word "opposition" or/and the quotes inside the HTML tag, because they are not in the first tag

This is the one I can't get to work

<td class="bold opposition "> "Olympiacos" </td>

This is the code I am trying

Pattern pattern = Pattern.compile("opposition \">(.*?)</td>",Pattern.MULTILINE);

I have tried replacing the spaces " " with "" an empty string and I have tried \s where the space is but I get nothing.

I would appreciate if anyone could help me.

Farray
  • 8,290
  • 3
  • 33
  • 37
M_K
  • 3,247
  • 6
  • 30
  • 47
  • Could you clarify what the requirements of the regex please? – Tyler Crompton Sep 06 '11 at 21:33
  • Related: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Mat Sep 06 '11 at 21:34
  • @Tyler The requirement is to retrieve everything between the HTML tag < td class="bold opposition "> "Olympiacos" < /td> – M_K Sep 06 '11 at 21:41
  • Update your post with a small complete application we can C&P and test with. Might help solve the problem more precisely. – QuinnG Sep 06 '11 at 23:34
  • Ok I know how get the pattern to work with just a normal string by adding and extra `\"` before the quotes in `sString = "Olympiacos");` to get this `sString = "Olympiacos");` I use `sString = sString.replaceAll("\"", "\\\"");` and that works for just the String – M_K Sep 07 '11 at 01:29
  • but I need to do it with a InputStream and a ByteArrayBuffer(I'm doing it in Android), so now I've to figure that out because using replaceAll() with the stream does not work. – M_K Sep 07 '11 at 01:33

2 Answers2

2

Unless you have a typo in one of the two - < /td> has a space after the < and in your regex </td> doesn't.

Adding a space to the regex after the < caused the match to succeed in RegexBuddy

Update: Seems the space is not in the tag the OP is working with.

In RegexBuddy I have the pattern (copied as a Java String)

"opposition \">(.*?)</td>"

which matches the html

< td class="bold opposition "> "Olympiacos"       </td>

giving a match of

opposition "> "Olympiacos"       </td>

and Group 1 of

 "Olympiacos"       <--Line ends there.
QuinnG
  • 6,346
  • 2
  • 39
  • 47
  • That was only to stop the text editor on Stackoverflow to format right. The only space in the tag is after opposition and before the qoutes here `compile("opposition \">` – M_K Sep 06 '11 at 21:43
  • I agree with Nija, the regex works like this: `"opposition \">(.*?)<\s/td>"` See here: http://rubular.com/r/kZK5NR080L – morja Sep 06 '11 at 21:48
  • `Pattern pattern = Pattern.compile("opposition\\s\">(.*?)<\\s/td>",Pattern.MULTILINE);` I have this in my Java program and it still does not work, should I be escaping the quotes around Olympiacos – M_K Sep 06 '11 at 21:57
  • @Steven_M: Updated my answer with more details. – QuinnG Sep 06 '11 at 22:10
  • I think its the white space between the tag, I have got it to work on this link here [link]http://rubular.com/r/aGjjOMAfHX, but it doesn't seem to work in Java I have tried `"opposition\\s\">(\\s+.*?)"` – M_K Sep 06 '11 at 23:20
  • @Steven_M: I often shortcut the whitespace issue when I know the character count by using the `.` in the regex. – QuinnG Sep 06 '11 at 23:30
  • Yes I have tried this, I have also just tried matching the string without any spaces like this `sString = "Olympiacos");` with `"opposition\">(.*?)<\\/td>` with this pattern this should work yes? – M_K Sep 06 '11 at 23:47
  • @Nija I think it might be the website blocking me from using the source, because the tags I can get don't have quotes around them and the ones I cant get have quotes around them which when I copy and paste them off the source, they disappear! So I have changed to another website and it works, thanks for your help. `@morja` thanks for the link, very handy – M_K Sep 07 '11 at 18:54
0

This is what you're looking for I believe.

<(\w+)\s*(?:\w+(?:=(?:'(?:[^']|(?<=\\)')*'|"(?:[^"]|(?<=\\)")*"))?\s*)*>(.*?)</\1\s*>

You will want to use the second group to get the contents of the tag (the first group is the tag name). Note that this does not work recursively. Nested elements are captured in the second group so you will need to use this regex on the second group of its match until there are no matches if that makes sense.

Tyler Crompton
  • 12,284
  • 14
  • 65
  • 94