2

Could anybody help me make a proper regular expression from a bunch of text in Ruby. I tried a lot but I don't know how to handle variable length titles.

The string will be of format <sometext>title:"<actual_title>"<sometext>. I want to extract actual_title from this string.

I tried /title:"."/ but it doesnt find any matches as it expects a closing quotation after one variable from opening quotation. I couldn't figure how to make it check for variable length of string. Any help is appreciated. Thanks.

Sainath Mallidi
  • 515
  • 1
  • 7
  • 17

3 Answers3

3
/title:"([^"]*)"/

The parentheses create a capturing group. Inside is first a character class. The ^ means it's negated, so it matches any character that's not a ". The * means 0 or more. You can change it to one or more by using + instead of *.

Matthew Flaschen
  • 278,309
  • 50
  • 514
  • 539
3

. matches any single character. Putting + after a character will match one or more of those characters. So .+ will match one or more characters of any sort. Also, you should put a question mark after it so that it matches the first closing-quotation mark it comes across. So:

/title:"(.+?)"/

The parentheses are necessary if you want to extract the title text that it matched out of there.

Paige Ruten
  • 172,675
  • 36
  • 177
  • 197
  • Thanks for the explanation. Works nicely. – Sainath Mallidi Jun 03 '10 at 01:07
  • One more question, I do string.match(/title:"(.+?)"/) the it return the entire `tile:`. Is there any cutesy way apart from ugly chomping of characters that I am doing now? – Sainath Mallidi Jun 03 '10 at 01:17
  • 1
    yeah yeah, that's what the parentheses are for. Your string.match expression will return a MatchData object, which you can index into to get at the matched text inside the parentheses. In your case: `string.match(/title:"(.+?)"/)[1]` should do it. – Paige Ruten Jun 03 '10 at 01:20
0

I like /title:"(.+?)"/ because of it's use of lazy matching to stop the .+ consuming all text until the last " on the line is found.

It won't work if the string wraps lines or includes escaped quotes.

In programming languages where you want to be able to include the string deliminator inside a string you usually provide an 'escape' character or sequence.

If your escape character was \ then you could write something like this...

/title:"((?:\\"|[^"])+)"/

railroad_diagram

This is a railroad diagram. Railroad diagrams show you what order things are parsed... imagine you are a train starting at the left. You consume title:" then \" if you can.. if you can't then you consume not a ". The > means this path is preferred... so you try to loop... if you can't you have to consume a '"' to finish.

I made this with https://regexper.com/#%2Ftitle%3A%22((%3F%3A%5C%5C%22%7C%5B%5E%22%5D)%2B)%22%2F

but there is now a plugin for Atom text editor too that does this.

Nigel Thorne
  • 21,158
  • 3
  • 35
  • 51