13

I have a sample input file as follows, with columns Id, Name, start date, end date, Age, Description, and Location:

220;John;23/11/2008;22/12/2008;28;Working as a professor in University;Hyderabad
221;Paul;30;23/11/2008;22/12/2008;He is a software engineer at MNC;Bangalore
222;Emma;23/11/2008;22/12/200825;Working as a mechanical engineer;Chennai

It contains 30 lines of data. My requirement is to only extract descriptions from the above text file.

My output should contain

Working as a professor in University

He is a software engineer at MNC

working as a mechanical engineer

I need to find a regular expression to extract the Description, and have tried many kinds, but I haven't been able to find the solution. How can I do it?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
mahodaya
  • 167
  • 1
  • 1
  • 6

4 Answers4

24

You can use this regex:

[^;]+(?=;[^;]*$)

[^;] matches any character except ;

+ is a quantifier that matches the preceding character or group one to many times

* is a quantifier that matches the preceding character or group zero to many times

$ is the end of the string

(?=pattern) is a lookahead which checks if a particular pattern occurs ahead

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Anirudha
  • 32,393
  • 7
  • 68
  • 89
5

/^(?:[^;]+;){3}([^;]+)/ will grab the fourth group between semicolons.

Although as stated in my comment, you should just split the string by semicolon and grab the fourth element of the split...that's the whole point of a delimited file - you don't need complex pattern matching.

Example implementation in Perl using your input example:

open(my $IN, "<input.txt") or die $!;

while(<$IN>){
    (my $desc) = $_ =~ /^(?:[^;]+;){3}([^;]+)/;
    print "'$desc'\n";
}
close $IN;

yields:

'Working as a professor in University'
'He is a software engineer at MNC'
'Working as a mechanical engineer'
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Lone Shepherd
  • 965
  • 1
  • 7
  • 25
0

This should work:

/^[^\s]+\s+[^\s]+\s+[^\s]+\s+(.+)\s+[^\s]+$/m

Or as lone shepherd pointed out:

/^\S+\s+\S+\s+\S+\s+(.+)\s+\S+$/m

Or with semicolons:

/^[^;]+;[^;]+;+[^;]+;+(.+);+[^;]+$/m
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Eric
  • 2,056
  • 13
  • 11
  • 1
    `\S` is the same as `[^\s]` – Lone Shepherd Feb 19 '13 at 05:03
  • no its not working 220;John;28;Working as a Professor in University;Hyderabad – mahodaya Feb 19 '13 at 05:05
  • This almost wworks if you can use a line modifier (m in php), so that ^ represents the beginning of the line while $ represents the end. In the previous example though I was just missing one column. `/^[^\s]+\s+[^\s]+\s+[^\s]+\s+(.+)\s+[^\s]+$/m` – Eric Feb 19 '13 at 05:14
  • And now I see you reverted back to semi-colons. `/^[^;]+;[^;]+;+[^;]+;+(.+);+[^;]+$/m` – Eric Feb 19 '13 at 05:19
  • No it works great in PHP using preg_replace. You of course never even specified if it's a perl regular expression that you needed, let alone what language this is for. – Eric Feb 19 '13 at 05:34
  • i am using annotated query language use to extract data from text files....a language for IBM biginsight text analytics – mahodaya Feb 19 '13 at 05:39
  • According to the documentation i'm reading on that language it should work. Of course that was without the date added in there. This one should would as long as there is only 1 column after the text you want `/^.*;([^;]+);+[^;]+$/m` (you don't need the m) – Eric Feb 19 '13 at 05:46
  • /^.*;([^;]+);+[^;]+$/ is also not extracting my output, it is extracting whole data in a single line – mahodaya Feb 19 '13 at 06:27
0

It seems relatively straightforward:

https://regex101.com/r/W9nfsd/2

.*;(.*);.*$

It is similar to Anirudha's answer, but a little simpler.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Mark
  • 143,421
  • 24
  • 428
  • 436