1

I want to extract the first sentence of a file in bash. I used the following command:

sed 's/(\?|\.|!).*//' filename

However, it does not work. What is wrong with my regex?

If I have the following sentence in the file: Stack overflow is the best? I am also the best., the output needs to be Stack overflow is the best?

Note: the question mark needs to be there at the end. Also the sentence may end with full stop / question mark / exclamation mark.

Edit: The sentence might contain words like Mr. etc

humble
  • 2,016
  • 4
  • 27
  • 36

4 Answers4

1
sed -r 's/([.*\?]|[.*\.]|[.*!]).*/\1/' file
              ^       ^     ^   ^  ^
              |_______|_____|___|__|_> Any symbols before first `?` **OR**
                      |_____|___|__|_> Any symbols before first `.` **OR**
                            |___|__|_> Any symbols before first `!`
                                |__|_> Any symbols
                                   |_> Print all found symbols in first pair of brackets

My solution will find:

"(Any symbols before first ? found or Any symbols before first . found or Any symbols before first ! found ), any symbols after that. --> print found in brackets symbols".

alseether
  • 1,889
  • 2
  • 24
  • 39
Viktor Khilin
  • 1,760
  • 9
  • 21
  • 1
    Can you please tell me what is wrong in my solution? – humble Jan 10 '18 at 13:16
  • @rjmessibarca in your solution sed use "base" regex, not extened (in my case, it's `-r` key). Your sed just can't find matches so it do nothing. Also if you add `-r` key in your solution, It will just delete all the symbols after first `?`, or after first `!`, or after first `.`, not printing all the symbols before it. – Viktor Khilin Jan 10 '18 at 13:20
  • @rjmessibarca nope – Viktor Khilin Jan 10 '18 at 13:30
  • An explanation would help a lot. – humble Jan 10 '18 at 13:31
  • 2
    My solution will find: `(`"Any symbols before first `?` found or Any symbols before first `.` found or Any symbols before firts `!`found`)`, any symbols after that. --> print found in brackets symbols". Your solution will: "Find **all** symbols, with `?` at the **end** of string, or with `.` at the end of string, or `!` at the end of string, print found match in brackets. – Viktor Khilin Jan 10 '18 at 13:36
  • @ViktorKhilin is preferable to make an edit in your post with this explanation than explain in comments. It's also more readable. – alseether Jan 10 '18 at 13:43
  • @ViktorKhilin that's the hard way ;) – alseether Jan 10 '18 at 13:56
  • I don't think the bracket expressions do what you think they do. What is the idea of `[.*\?]`? This matches any one character of `.`, `*`, ``\`` or `?`. And alternation between character classes doesn't make sense either. The whole expression `([.*\?]|[.*\.]|[.*!])` is equivalent to `[.*\?!]`. – Benjamin W. Jan 10 '18 at 15:02
1

I think you're not matching the beginning of the line. My solution is:

^.*?[.?!]

Which means:

  • ^ : The match must be at the beginning of the line
  • .*? : any number of characters (greedy, aka as less as possible)
  • [.?!]] : match one of the chars inside []

Working example here

Note that solution its working for python. I think there are no greedy searches with sed

alseether
  • 1,889
  • 2
  • 24
  • 39
  • 1
    It's `sed`, `([.?!]).*` – revo Jan 10 '18 at 13:18
  • ```$ cat filename Stack overflow is the best? I am also the best. [sahaquiel@sahaquiel-PC Stackoverflow]$ sed '^.*?[.?!]' filename sed: -e expression #1, char 1: unknown command: `^'``` You sure it works? (off.: how can I place line break in comments? :( ) – Viktor Khilin Jan 10 '18 at 13:25
  • @ViktorKhilin You can't. PS: Thanks for pointing. I've tried the solution and didn't work with sed. Greedy searches are not available. – alseether Jan 10 '18 at 13:37
1

If your input file consists of just one line, you can use

$ grep -o '^[^.!?]*[.!?]' <<< 'Stack overflow is the best? I am also the best.'
Stack overflow is the best?

If there are multiple lines and your first sentence might be across multiple lines, you can use -z with GNU grep to treat the file as a single line:

$ grep -zo '^[^.!?]*[.!?]' <<< $'Stack overflow\nis the best? I am also the best.'
Stack overflow
is the best?

The regex consists of these components:

  • ^ anchor to start of line
  • [^.!?]* zero or more characters other than ., ! or ?
  • [.!?] on of ., ! or ?
Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
0

If your actual Input_file is same as shown example then following sed may help you in same.

sed 's/[\?\.\!].*/\?/'   Input_file

Output will be as follows.

Stack overflow is the best?
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93