0

I'm trying to capture the username from these lines:

title="user1 is online now"><b><font color="#2568BA"><b>user1</b></font></b></a>
title="user2 is online now"><b>user2</b></a>

With this as the pattern:

title=".{1,16} is \w{5,8}? now"><b>(?:<font color="#\w{6}">)<b>(?<text>.+?)</b>(?:</font>)</b></a>?

But it's only capturing user1. The "font color" tag needs to be ignored, sometimes it's there sometimes it's not.

I'm struggling with this for hours now, what am I missing?

Soner Gönül
  • 97,193
  • 102
  • 206
  • 364
Quoter
  • 4,236
  • 13
  • 47
  • 69
  • Why are you bothering with anything after the `title=".{1,16} is \w{5,8}? now"`? – Rawling Oct 29 '15 at 12:05
  • 2
    [Using regular expressions to parse HTML: why not?](http://stackoverflow.com/q/590747/1324033) – Sayse Oct 29 '15 at 12:07
  • @Rawling, because it will match too many of the same username. What I pasted was just a fraction of a html source page. – Quoter Oct 29 '15 at 12:10
  • How are you currently getting/parsing the HTML? I guess you are doing it with some sort of HTML parser, aren't you? – Wiktor Stribiżew Oct 29 '15 at 12:10
  • @Sayse, not parsing anything, just messing around with regex with html – Quoter Oct 29 '15 at 12:13
  • Why not just `(?<=title=")\S+` ? Or, if you want a capturing group: `title="(?\S+)` – Ron Rosenfeld Oct 29 '15 at 12:27
  • @RonRosenfeld, because the group I want to catch is between the `` tag where you see the group `(?.+?)`. – Quoter Oct 29 '15 at 12:31
  • Will it always be the same as the username that follows `title=`? – Ron Rosenfeld Oct 29 '15 at 12:34
  • @RonRosenfeld, yes, but because it matches the same user 3 times, I had to look for something unique and make the match only occur once. Therefore I had to make the matching string a bit longer. – Quoter Oct 29 '15 at 12:36
  • I don't understand. In the examples you posted, my suggestion matches only the user name that follows `Title=`. – Ron Rosenfeld Oct 29 '15 at 12:38
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/93683/discussion-between-quoter-and-ron-rosenfeld). – Quoter Oct 29 '15 at 12:45

3 Answers3

1

The following might work.

  • Assume that username follows title=" and is followed by is on(or off)line
  • capture that instance into capturing group 1
  • use a back reference to find the last instance of username in the line
  • capture that into named capturing group UserName

title="(\S+)(?= is (?:on|off)line).*(?<UserName>\k<1>)

If you wanted to, you could also capture the on or off line status.

Ron Rosenfeld
  • 53,870
  • 7
  • 28
  • 60
0

For those examples this should work:

title="\S+\sis\s(?:on|off)line\snow">(?:<b><font[^>]+>)?<b>(.*?)</b>
Neal
  • 801
  • 1
  • 9
  • 21
-1

You can use the following regex:

<[^>]*>(user\d+)<[^>]*>
Mayur Koshti
  • 1,794
  • 15
  • 20
  • That matches everything between `<>`. That's not what I want. – Quoter Oct 29 '15 at 12:15
  • The word 'user' was just an example, none of the usernames actually are 'user' or start with 'user'. In fact 'user' is just random, it could be even Mayur. – Quoter Oct 29 '15 at 12:33