Regular Expressions - Greedy but stop before a string match

Question

I have the some data and i'd like to convert it into a table format.

Here's the input data

1- This is the 1st line with a 
newline character
2- This is the 2nd line

Each line may contain multiple newline characters.

Output

<td>1- This the 1st line with 
a new line character</td>
<td>2- This is the 2nd line</td>

I've tried the following

^(\d{1,3}-)[^\d]*

but it seems to match only till the digit 1 in 1st.

I'd like to be able to stop matching after i find another \d{1,3}\- in my string. Any suggestions?

EDIT: I'm using EditPad Lite.

qwertyboy · Answer 1 · 2012-05-29T06:45:42.690

You did not specify a language (there are many regexp implementations), but in general, what you are looking for is called "positive lookahead", which lets you add patterns that will influence the match, but will not become part of it.

Search for lookahead in the documentation of whatever language you are using.

Edit: the following sample seems to work in vim.

:%s#\v(^\d+-\_.{-})\ze(\n\d+-|%$)#<td>\1</td>

Annotation below:

%      - for all lines
s#     - substitute the following (you can use any delimiter, and slash is most
         common, but as that will require that we escape slashes in the command
         I chose to use the number sign)
\v     - very magic mode, let's us use less backslashes
(      - start group for back referencing
^      - start of line
\d+    - one or more digits (as many as possible)
-      - a literal dash!
\_.    - any character, including a newline
{-}    - zero or more of these (as few as possible)
)      - end group
\ze    - end match (anything beyond this point will not be included in the match)
(      - start a new group
[\n\r] - newline (in any format - thanks Alan)
\d+    - one or more digits
-      - a dash
|      - or
%$     - end of file
)      - end group
#      - start substitute string
<td>\1</td> - a TD tag around the first matched group

@Nerrve, I have edited my answer to include a sample that will work with vim, which you can easily download. I hope this helps. — qwertyboy, May 29 '12 at 06:35

guido · Answer 2 · 2012-05-27T17:16:02.390

2

This is for vim, and uses zerowidth positive-lookahead:

/^\d\{1,3\}-\_.*[\r\n]\(\d\{1,3\}-\)\@=

Steps:

/^\d\{1,3\}-              1 to 3 digits followed by -
\_.*                      any number of characters including newlines/linefeeds
[\r\n]\(\d\{1,3\}-\)\@=   followed by a newline/linefeed ONLY if it is followed 
                          by 1 to 3 digits followed by - (the first condition)

EDIT: This is how it would be in pcre/ruby:

/(\d{1,3}-.*?[\r\n])(?=(?:\d{1,3}-)|\Z)/m

Note you need a string ending with a newline to match the last entry.

edited May 27 '12 at 17:16

answered May 27 '12 at 12:53

guido

18,864
6
70
95

i tried this, but it seems to match everything except the last line. So if there are 10 lines {1- bla bla, 2- bla bla,..., 9- bla bal, 10- bla bla}, it will match all the lines except the 10th one. – Abbas Gadhia May 27 '12 at 13:11
fixed the PCRE regex (with a note); btw i suggest to use something similar to Eugene approach here, if it is not to learn regexes – guido May 27 '12 at 17:17

score 2 · Answer 3 · answered May 29 '12 at 04:07

SEARCH:   ^\d+-.*(?:[\r\n]++(?!\d+-).*)*

REPLACE:  <td>$0</td>

[\r\n]++ matches one or more carriage-returns or linefeeds, so you don't have to worry about whether the file use Unix (\n), DOS (\r\n), or older Mac (\r) line separators.

(?!\d+-) asserts that the first thing after the line separator is not another line number.

I used the possessive + in [\r\n]++ to make sure it matches the whole separator. Otherwise, if the separator is \r\n, [\r\n]+ could match the \r and (?!\d+-) could match the \n.

Tested in EditPad Pro, but it should work in Lite as well.

score 1 · Answer 4 · answered May 27 '12 at 12:46

1

(\d+-.+(\r|$)((?!^\d-).+(\r|$))?)

answered May 27 '12 at 12:46

dda

6,030
2
25
34

This matches the whole text! :) I'm using EditPad Lite if that is of any information. – Abbas Gadhia May 27 '12 at 13:19
I tried it before posting. On your own sample. It's working if you are using a PCRE engine. – dda May 27 '12 at 13:53
I tried it on http://regexpal.com/ with my example and your expression and it did the same thing. is regexpal PCRE? – Abbas Gadhia May 27 '12 at 16:51

score 1 · Answer 5 · answered May 27 '12 at 12:53

1

You can match only the separators and split on them. In C#, for example, it could be done like this:

string s = "1- This is the 1st line with a \r\nnewline character\r\n2- This is the 2nd line";
string ss = "<td>" + string.Join("</td>\r\n<td>", Regex.Split(s.Substring(3), "\r\n\\d{1,3}- ")) + "</td>";
MessageBox.Show(ss);

answered May 27 '12 at 12:53

Eugene Ryabtsev

2,232
1
23
37

i'll try this out in Java and let you know. – Abbas Gadhia May 27 '12 at 13:20

score 1 · Answer 6 · answered May 27 '12 at 17:55

1

Would it be good for you to do it in 3 steps?

(these are perl regex):

Replace the first:

$input =~ s/^(\d{1,3})/<td>\1/;

Replace the rest

$input =~ s/\n(\d{1,3})/<\/td>\n<td>\1/gm;

Add the last:

$input .= '</td>';

answered May 27 '12 at 17:55

ilomambo

8,290
12
57
106

Regular Expressions - Greedy but stop before a string match

6 Answers6