2

I've a text blow and I want to match All the text in bold. So without depending on prefix i.e serial numbers, Can I match just bold characters using Regular Expressions?

  1. Spalding, K.L., Buchholz, B.A., Bergman, L.E., Druid, H., Frisén, J.: Forensics: e age written in teeth by nuclear tests. Nature 437(7057) (2005) 333–334
  2. Lovecraft, H.P.: HP Lovecraft: Tales: Tales. Library of America (2005)
  3. Duncan, R.: A survey of parallel computer architectures. Computer 23(2) (1990) 5–16
  4. Santos, N., Hoshino, Y.: Global distribution of rotavirus serotypes/genotypes and its implication for the development and implementation of an effective rotavirus vaccine. Reviews in medical virology 15(1) (2005) 29–56
  5. DIARRHOEA, R.: Rotavirus and other viral diarrhoeas. Bulletin of the World Health Organization 58(2) (1980) 183–198
  6. Barton, T.: Power and knowledge: astrology, physiognomics, and medicine under the Roman Empire. University of Michigan Press (2002)
  7. Gauquelin, M.: The cosmic clocks: From astrology to a modern science. H. Regnery Company (1967)
Cœur
  • 37,241
  • 25
  • 195
  • 267
Kishore Kumar Korada
  • 1,204
  • 6
  • 22
  • 47

3 Answers3

2

You can create a regex that groups the authors into the first group:

^(?:\d+\. )([^:]*)

Explanation:

  • (?:...) is a non-capturing group
  • ^ is line start
  • \d+\. matches one more more numbers, a dot and a space
  • (...) is a capturing group
  • [^:]* matches everything that's not a colon

If you want to make sure to match only the right lines, you can add a lookahead to the end of the regex: (?=:). So the regex would be ^(?:\d+\. )([^:]*)(?=:)

Demo here.

This approach is okay because it works with any number of digists. On the other hand, this is exactly why we can't use lookbehinds.

If you're willing to make assumptions, i.e. there can be 1..4 digits in the beginning, then you can use this:

((?<=^\d{1}. )|(?<=^\d{2}. )|(?<=^\d{3}. )|(?<=^\d{4}. ))([^:]*)(?=:)

Explanation:

  • (?<=^\d{3}. ) is a fixed length lookbehind for 3 digits from the beginning of the line
  • (...|...|...) is for alternative, fixed length lookbehinds. A bit verbose, I know. The lookbehinds, however, are not part of the match.
  • ([^:]*) matches and captures the non-colon characters
  • (?=:) a lookahead for a colon. So we match the right lines only, but do not capture the colon

Demo here.

Update

To match only the first author, we need to do a slight change: The capturing group should be ([^:,]*,[^:,]*), and the lookahead to finish the line should be (?=[:,]). So this is how the capturer regex looks like:

^(?:\d+\. )([^:,]*,[^:,]*)(?=[:,])

Demo here.

And this is how it looks like with lookbehinds:

((?<=^\d{1}. )|(?<=^\d{2}. )|(?<=^\d{3}. )|(?<=^\d{4}. ))([^:,]*,[^:,]*)(?=[:,])

Demo here.

Explanation: [^:,]*,[^:,]* is the trick to match an author. Each author has only one comma in their name, so we use a negative character class zero or more times: [^:,]*, then match one comma, and them the same negative character class zero or more times.

You will see that there are still some exceptions, e.g. at

Tamas Rev
  • 7,008
  • 5
  • 32
  • 49
  • I appreciate your answer and your effort in writing this. But in the demo you've mentioned, specified expression matches one or more authors in a row where I'm expecting it to be matched only with the bold one. How can I match what I only want? – Kishore Kumar Korada Mar 13 '18 at 09:18
  • Oh, so only the first author. So the group must end a the first comma or colon. Then the group should be `([^:.]*)` and the positive lookahead should be `(?=[:,])`. Updating the answer accordingly. – Tamas Rev Mar 13 '18 at 09:21
  • I had to do one more change to match the `,` from the first authors name. I think it should be okay now. – Tamas Rev Mar 13 '18 at 09:29
  • Awesome. This is what I trying to get and failed. I though I could match with strong/em tags. Thank you – Kishore Kumar Korada Mar 13 '18 at 09:33
1

I can identify this common pattern on each line in your example:

  • digits + a dot + a space
  • (text + comma + text) in bold
  • a comma or colon + anything

solution 1

With a non-capture operator, this translates to:

^(?:\d+\. )([^,]*,[^,:]*)

demo

solution 2

Alternative by replacing the non-capture operator with the look-behind operator:

(?<=\d\. )([^,]*,[^,:]*)

demo

solution 3

To explicitly solve http://play.inginf.units.it/#/level/12, then you need the OR operator:

(?<=^.. |^... |^.... )([^,]*,[^,:]*)

demo

Cœur
  • 37,241
  • 25
  • 195
  • 267
  • Thank you. It's working. What if I want to just match the same without serial number and space after it. i.e instead of this "144. Spalding, K.L." , this "Spalding, K.L." – Kishore Kumar Korada Mar 15 '18 at 05:02
  • @KishoreKumarKorada `(?:...)` is a non-capturing operator, so I'm not group matching your serial number (see https://stackoverflow.com/questions/3512471/…). But you can use the look-behind operator `(?<=...)` as an alternative for full matching, and the OR operator `(...|...)` to solve your challenge. – Cœur Mar 15 '18 at 06:46
-1

my solution

(?<=^\d+\.\s)(\w+,[\s\w\.]*)
Zoe
  • 27,060
  • 21
  • 118
  • 148
MonStar
  • 102
  • 1
  • 2
  • 14