How to delete all text on a line appearing after a particular symbol?

Question

I have a file, file1.txt, like this:

This is some text.
This is some more text. ② This is a note.
This is yet some more text.

I need to delete any text appearing after "②", including the "②" and any single space appearing immediately before, if such a space is present. E.g., the above file would become file2.txt:

This is some text.
This is some more text.
This is yet some more text.

How can I delete the "②", anything coming after, and any preceding single space?

The solutions at How can I remove all text after a character in bash? do not seem to work, perhaps because "②" is not an ordinary character.
The file is saved in UTF-8.

What OS? What encoding of your text? – yazu Apr 18 '12 at 12:08 — yazu, Apr 18 '12 at 12:08

score 3 · Answer 1 · answered Apr 19 '12 at 09:12

3

A Perl solution:

$ perl -CS -i~ -p -E's/ ②.*//' file1.txt

You'll end up with the correct data in file1.txt and a backup of the original file in file1.txt~.

answered Apr 19 '12 at 09:12

Dave Cross

68,119
3
51
97

pizza · Answer 2 · 2012-04-18T13:23:48.337

I hope you do realize most unix utilities do not work with unicode. I assume your input is in UTF-8, if not you have to adjust accordingly.

#!/bin/bash
function px {
 local a="$@"
 local i=0
 while [ $i -lt ${#a}  ]
  do
   printf \\x${a:$i:2}
   i=$(($i+2))
  done
}
(iconv -f UTF8 -t UTF16 | od -x |  cut -b 9- | xargs -n 1) |
if read utf16header
then
 echo -e $utf16header
 out=''
 while read line
  do
   if [ "$line" == "000a" ]
    then
     out="$out $line"
     echo -e $out
     out=''
   else
    out="$out $line"
   fi
  done
 if [ "$out" != '' ] ; then
   echo -e $out
 fi
fi |
 (perl -pe 's/( 0020)* 2461 .*$/ 000a/;s/ *//g') |
 while read line
  do
    px $line
  done | (iconv -f UTF16 -t UTF8 )

score 1 · Answer 3 · answered Apr 18 '12 at 09:54

1

sed -e "s/[[:space:]]②[^\.]*\.//"

However, I am not sure that the ② symbol is parsed correctly. Maybe you have to use UTF8 codes or something like.

answered Apr 18 '12 at 09:54

Matthias

8,018
2
27
53

It does not seem to remove any text, even when I tried with a simpler symbol, such as a letter. – Village Apr 18 '12 at 11:58
It worked for me. Which platform do you use? Maybe your sed need an option for a specific syntax. – Matthias Apr 18 '12 at 12:03
I have `GNU sed version 4.2.1`. – Village Apr 18 '12 at 12:06
Try to add -r (at some platforms: -E) to switch extended regex syntax. – Matthias Apr 18 '12 at 12:22
According to the documentation, `-r` should switch on regex, but it still does not work. – Village Apr 18 '12 at 12:33
But you apply the command to your file, or pipe the content to sed, don't you? – Matthias Apr 18 '12 at 13:08
I used `sed -r -e "s/[[:space:]]②[^\.]*\.//"` file.txt`. The output shows no change. – Village Apr 18 '12 at 13:28
I tested with `sed -e "s/[[:space:]]#[^\.]*\.//" file.txt` and an accordingly adopted file `file.txt`. Worked. Regarding UTF8 (I guess you mean it, not UST-8), see the anwer by pizza. – Matthias Apr 18 '12 at 13:33

score 1 · Answer 4 · answered Apr 20 '12 at 01:55

1

Try this:

sed -e '/②/ s/[ ]*②.*$//'

/②/ look only for the lines containing the magic symbol;
[ ]* for any number (matches none) of spaces before the magic symbol;
.*$ everything else till the end of line.

answered Apr 20 '12 at 01:55

vyegorov

21,787
7
59
73

How to delete all text on a line appearing after a particular symbol?

4 Answers4