-6

I need extract the author from the text using regex. Also, I need have the index of every tags and authors. I tried few parser, none of them can preserve the index correctly. So the only solution is using regex. I have following regex and it has a problem on "[^]" How could I fix this regex:

<post\\s*author=\"([^\"]+)\"[^>]+>[^</post>]*</post>

in order to extract the author in following text:

<post author="luckylindyslocale" datetime="2012-03-03T04:52:00" id="p7">
<img src="http://img.photobucket.com/albums/v303/lucky196/siggies/ls1.png"/>

Grams thank you, for this wonderful tag and starting this thread. I needed something to encourage me to start making some new tags.

<img src="http://img.photobucket.com/albums/v303/lucky196/holidays/stpatlucky.jpg"/>
Cruelty is one fashion statement we can all do without. ~Rue McClanahan
</post>
user2372074
  • 781
  • 3
  • 7
  • 18

2 Answers2

1

Why couldn't regex:

<post\\s*author=\"([^\"]+)\"[^>]+>[^</post>]*</post>

extract the author in following text.

Because

[^</post>]*

represents a character class and will match everything but the characters <, /, p, o, s, t, and > 0 or more times.

That doesn't happen in your text. As for how to fix it, consider using the following regex

<post\s*author=\"([^\"]+?)\"[^>]+>(.|\s)*?<\/post>
// obviously, escape appropriate characters in Java String literals

with a multiline flag.

Sotirios Delimanolis
  • 274,122
  • 60
  • 696
  • 724
0

You can just do it like the following

/<post author="(.*?)"/

Working Demo

The comments are correct though with Regex not being the best tool to parse HTML. But this should do what you are looking for

Community
  • 1
  • 1
Halfwarr
  • 7,853
  • 6
  • 33
  • 51
  • Do you understand my question? Do you know any parser can preseve the offset? Most people know regex is not best tool to parse XML or HTML, so don't repeat that. BTW, you solution is very unsafe by any means. – user2372074 Aug 05 '14 at 15:51
  • @user2372074 How exactly is it unsafe? What are you trying to do? Also, no I have no idea what you mean by offset. – Halfwarr Aug 05 '14 at 15:53
  • offset means I need have the index of author from beginning of the text. – user2372074 Aug 05 '14 at 15:56
  • 1
    @user2372074 write a parser that preserves offsets. Then you'll have a parser that preserves offsets. – David Conrad Aug 05 '14 at 16:42
  • @David I would if I have time. Writing a parser seems not trivial, although I never tried. – user2372074 Aug 05 '14 at 16:47