1

I'm trying to get the first sentence inside a <p> tag. I consider that a sentence ends with the first "final" dot, i.e when it goes "dot space uppercase", to skip abbreviations.

echo "<p>this will def. fail. So. Sad.</p>" | sed -r -e "s/<p>(([^\.]*\. [^A-Z])*[^\.]*\.) [A-Z]/\1/g"

The expected result is this will def. fail., which I try to capture with \1

It works on regex101 but returns this will def. fail.o. Sad.</p> when used with sed on my terminal.

LogicalKip
  • 514
  • 4
  • 13
  • 1
    Check `echo "

    this will def. fail. So. Sad.

    " | sed -r -e "s/^

    ([^.]*[.]).*$/\\1/g"`. Or `sed -r -e "s/^

    ([^.]*[.])( .*$|$)/\\1/g"` if there must be a space after the first dot.

    – Wiktor Stribiżew Dec 11 '15 at 13:16
  • Both only return `this will def.` – LogicalKip Dec 11 '15 at 13:22
  • 4
    And what is the output you expect? – Wiktor Stribiżew Dec 11 '15 at 13:28
  • 1
    `... So. Sad.` Yes, well it has been said only about 10000x now that `sed` is not the appropriate tool to process XML like data. See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 for detail ;-) Good luck to all. – shellter Dec 11 '15 at 13:36
  • @stribizhev Do you understand what `g` does, if so why are you using it? – 123 Dec 11 '15 at 13:46
  • @123: Because I just pasted what OP had. It does not matter here anyway. I am not going to answer since I voted to close the question as unclear. – Wiktor Stribiżew Dec 11 '15 at 13:47
  • @stribizhev i know it doesn't that was my point – 123 Dec 11 '15 at 13:48
  • `this will def. fail.` is the expected output. And this is really not about XML, the p tag is only here because it's in my real issue too. I guess I should have removed it, sorry. – LogicalKip Dec 11 '15 at 13:50
  • If you're working with (X)HTML elsewhere, the tool/language that you're using hopefully has better support for solving this part of your problem too. Is it absolutely necessary that you use sed here? – Tom Fenech Dec 11 '15 at 13:56
  • 1
    By the way, if you're adding more detail to your question, you should do so by editing, rather than just in the comments. – Tom Fenech Dec 11 '15 at 13:57
  • @tom I don't have to use any special tools in particular. It comes from a bash script. I simply grepped a line containing the p tag, which I don't want to appear in the final output, that's all. And if I was to use any XML parser or other tool, even though they are certainly great in processing nested tags, attributes, etc, I'm not quite sure they could help with the main problem here, which is about sentences, dots and uppercase. Tell me if I'm wrong. – LogicalKip Dec 11 '15 at 14:16
  • On what platform are you? and what's your sed version? I managed to find the essence of the problem, at least, on my platform: http://stackoverflow.com/questions/34225675/weird-character-range-behaviour-with-locales-in-sed-regex – Karoly Horvath Dec 11 '15 at 14:34
  • @karoly gnu sed 4.2.2, on linux mint – LogicalKip Dec 11 '15 at 15:02

1 Answers1

0

You need .* to grab the rest of the line:

echo "<p>this will def. fail. So. Sad.</p>" |
   LANG=C sed -r -e "s/<p>(([^\.]*\. [^A-Z])*[^\.]*\.) [A-Z].*/\1/g"
#  ^ huh?                                                   ^^
this will def. fail.

That was the small issue.

What gave me headaches that after that fix it still didn't work. Actually it took some investigation to figure out that on my platform I need to set the locales. I guess you have the same problem.

Community
  • 1
  • 1
Karoly Horvath
  • 94,607
  • 11
  • 117
  • 176
  • This works, although even without the LANG=C. When you get more information on your problem, care to share a bit ? (Such as, what do the locales have to do with anything ?) – LogicalKip Dec 11 '15 at 17:47