Extract some part of text separated by a delimiter using a regex

Question

I have a sample input file as follows, with columns Id, Name, start date, end date, Age, Description, and Location:

220;John;23/11/2008;22/12/2008;28;Working as a professor in University;Hyderabad
221;Paul;30;23/11/2008;22/12/2008;He is a software engineer at MNC;Bangalore
222;Emma;23/11/2008;22/12/200825;Working as a mechanical engineer;Chennai

It contains 30 lines of data. My requirement is to only extract descriptions from the above text file.

My output should contain

Working as a professor in University

He is a software engineer at MNC

working as a mechanical engineer

I need to find a regular expression to extract the Description, and have tried many kinds, but I haven't been able to find the solution. How can I do it?

I may have messed up on my edit, did you mean to have the semicolons and commas in there? — Lance Roberts, Feb 19 '13 at 05:04
OK, please re-edit with them. Sorry, thinking about databases too much. — Lance Roberts, Feb 19 '13 at 05:04
Why do you want a regex? Just split by semicolon and grab the 4th column and you're done. Also, you should tag with what language you are using. — Lone Shepherd, Feb 19 '13 at 05:08
The data is a mess. John has two dates then a number (age); Paul has a number and two dates; Emma has a date and a date scrunched up with the number. The columns listed don't include either of the date columns. (Someone can't spell 'engineer', or 'Bangalore'). How will the regex know to convert `Working` to `working`? That's tremendously fiddly! — Jonathan Leffler, Feb 20 '13 at 05:07

score 24 · Accepted Answer · edited Oct 22 '21 at 00:38

24

You can use this regex:

[^;]+(?=;[^;]*$)

[^;] matches any character except ;

+ is a quantifier that matches the preceding character or group one to many times

* is a quantifier that matches the preceding character or group zero to many times

$ is the end of the string

(?=pattern) is a lookahead which checks if a particular pattern occurs ahead

edited Oct 22 '21 at 00:38

Peter Mortensen

30,738
21
105
131

answered Feb 19 '13 at 05:27

Anirudha

32,393
7
68
89

([^;]+(?=;[^;]*(\r?\n|$))) – AMit SiNgh Mar 19 '18 at 12:23

score 5 · Answer 2 · edited Oct 22 '21 at 00:37

5

/^(?:[^;]+;){3}([^;]+)/ will grab the fourth group between semicolons.

Although as stated in my comment, you should just split the string by semicolon and grab the fourth element of the split...that's the whole point of a delimited file - you don't need complex pattern matching.

Example implementation in Perl using your input example:

open(my $IN, "<input.txt") or die $!;

while(<$IN>){
    (my $desc) = $_ =~ /^(?:[^;]+;){3}([^;]+)/;
    print "'$desc'\n";
}
close $IN;

yields:

'Working as a professor in University'
'He is a software engineer at MNC'
'Working as a mechanical engineer'

edited Oct 22 '21 at 00:37

Peter Mortensen

30,738
21
105
131

answered Feb 19 '13 at 05:13

Lone Shepherd

965
1
7
25

i can only use regex // in my coding, i can not use above coding – mahodaya Feb 19 '13 at 05:27
What I provided *is* a regex. And since you didn't indicate what language you were using, I provided a sample implementation making use of the regex. – Lone Shepherd Feb 19 '13 at 05:28
i am using aql language for biginsight text analytics – mahodaya Feb 19 '13 at 05:33

score 0 · Answer 3 · edited Oct 22 '21 at 00:26

0

This should work:

/^[^\s]+\s+[^\s]+\s+[^\s]+\s+(.+)\s+[^\s]+$/m

Or as lone shepherd pointed out:

/^\S+\s+\S+\s+\S+\s+(.+)\s+\S+$/m

Or with semicolons:

/^[^;]+;[^;]+;+[^;]+;+(.+);+[^;]+$/m

edited Oct 22 '21 at 00:26

Peter Mortensen

30,738
21
105
131

answered Feb 19 '13 at 05:01

Eric

2,056
13
11

1

`\S` is the same as `[^\s]` – Lone Shepherd Feb 19 '13 at 05:03
no its not working 220;John;28;Working as a Professor in University;Hyderabad – mahodaya Feb 19 '13 at 05:05
This almost wworks if you can use a line modifier (m in php), so that ^ represents the beginning of the line while $ represents the end. In the previous example though I was just missing one column. `/^[^\s]+\s+[^\s]+\s+[^\s]+\s+(.+)\s+[^\s]+$/m` – Eric Feb 19 '13 at 05:14
And now I see you reverted back to semi-colons. `/^[^;]+;[^;]+;+[^;]+;+(.+);+[^;]+$/m` – Eric Feb 19 '13 at 05:19
No it works great in PHP using preg_replace. You of course never even specified if it's a perl regular expression that you needed, let alone what language this is for. – Eric Feb 19 '13 at 05:34
i am using annotated query language use to extract data from text files....a language for IBM biginsight text analytics – mahodaya Feb 19 '13 at 05:39
According to the documentation i'm reading on that language it should work. Of course that was without the date added in there. This one should would as long as there is only 1 column after the text you want `/^.*;([^;]+);+[^;]+$/m` (you don't need the m) – Eric Feb 19 '13 at 05:46
/^.*;([^;]+);+[^;]+$/ is also not extracting my output, it is extracting whole data in a single line – mahodaya Feb 19 '13 at 06:27

score 0 · Answer 4 · edited Oct 22 '21 at 00:39

0

It seems relatively straightforward:

https://regex101.com/r/W9nfsd/2

.*;(.*);.*$

It is similar to Anirudha's answer, but a little simpler.

edited Oct 22 '21 at 00:39

Peter Mortensen

30,738
21
105
131

answered Jan 04 '19 at 04:30

Mark

143,421
24
428
436

Extract some part of text separated by a delimiter using a regex

4 Answers4

Linked