removing prepositions from a text file in linux

Question

What I want to do is that i want to remove all prepositions in a text file in CentOS. Things like 'on of to the in at ....'. Here is my script:

!/bin/bash
list='i me my myself we our ours ourselves you your yours yourself ..... '
cat Hamlet.txt | for item in $list
do
sed 's/$item//g' 
done > newHam.txt

but at the end when i open newHam.txt nothing changes! It's the same as Ham.txt. I don't know whether this is a good approach or not. Any suggestion? Any approach??

Pretty sure this is a duplicate, but could not quickly find a good one. Cross-site duplicate: https://unix.stackexchange.com/questions/322310/how-to-delete-all-occurrences-of-a-list-of-words-from-a-text-file — tripleee, Jan 06 '19 at 07:34
The immediate problem is your use of single quotes instead of double; but you can't pipe a single file into a loop and expect each iteration of the loop to receive the entire file as input. — tripleee, Jan 06 '19 at 08:28
[Replace multiple strings with different set of mapped strings](https://unix.stackexchange.com/q/404313/56041), [How can I use variables in the LHS and RHS of a sed substitution?](https://unix.stackexchange.com/q/69112/56041), etc. — jww, Jan 07 '19 at 06:17

tripleee · Accepted Answer · 2019-01-06T11:23:16.673

1

Assuming your sed understands \< and \> for word boundaries,

sed 's/\<\(i\|me\|my\|myself|\we|\our|\ours|\ourselves|\you|\your|\yours|\yourself\)\> \?//g' Hamlet.txt >newHam.txt

You want to make sure you include word boundaries; your original attempt would replace e.g. i everywhere n the nput.

If you already have the words in a string, you can interpolate it in Bash with

sed "s/\\<\\(${list// /\\|}\\)\\> \\?//g" Hamlet.txt >newHam.txt

but the ${variable//pattern/substitution} parameter expansion is not portable to e.g. /bin/sh. Notice also how double quotes instead of single are necessary for the shell to be allowed to perform variable substitutions within the script, and how all literal backslashes need to be escaped with another backslash within double quotes.

Unfortunately, many details of sed are poorly standardized. Ironically, switching to a tool which isn't standard at all might be the most portable solution.

perl -pe 'BEGIN {
    @list = qw(i me my myself we our ours ourselves you your yours yourself .....);
    $re = join("|", @list); }
    s/\b($re)\b ?//go' Hamlet.txt >newHam.txt

If you want this as a standalone script,

#!/usr/bin/perl

BEGIN {
    @list = qw(i me my myself we our ours ourselves you your yours yourself .....);
    $re = join("|", @list);
}
while (<>) {
    s/\b($re)\b ?//go;
    print
}

These words are pronouns, not prepositions.

Finally, take care to fix the shebang of your script; the first line of the script needs to start with exactly the two characters #! because that's what makes it a shebang. You'll also want to avoid the useless cat in the future.

edited Jan 06 '19 at 11:23

answered Jan 06 '19 at 07:31

tripleee

175,061
34
275
318

This will work, but you hardcode the replacements in to the operation itself. Which is counted as not good practice in programming – Romeo Ninov Jan 06 '19 at 07:37
It's not hard to generate this script from a file of keywords or whatever, but you didn't want to see any complexity. – tripleee Jan 06 '19 at 07:40
I want to make the things readable. Currently computing power is so cheap so several milliseconds more do not count – Romeo Ninov Jan 06 '19 at 07:44
Right, but now you also have to understand how those cycles are used. – tripleee Jan 06 '19 at 08:08
Perfect! it worked! could u please explain a bit about this: sed "s/\\<\$${list// /\\|}\$\\> \\?//g" Hamlet.txt >newHam.txt. I don't know exactly what the pattern means. and also about the perl script. how can i use it? what should be the shebang line? – Reza Jan 06 '19 at 11:17
1

The parameter substitution is already explained, it produces the value of `$list` with every occurrence of a space replaced with `\|` (with the backslash doubled for the shell). – tripleee Jan 06 '19 at 11:19
1

The regex ends up being `\<$list\|of\|words$\>` just like in the first, hardcoded version. – tripleee Jan 06 '19 at 11:20
1

If you have trouble understanding the other parts, please ask a more specific question. – tripleee Jan 06 '19 at 11:20
1

The Perl command is a shell command just like the `sed` command. If you put it in a script file, the shebang should be `#!/bin/bash`, though of course if this is all you need, you can use `#!/usr/bin/perl -p` with some trivial modifications. See update now. – tripleee Jan 06 '19 at 11:21
thanks for the reply. all i know about the sed format is: sed '/s/a/b/g'. so the above format is a bit complex for me. what i understand is it substitutes all elements in list with space. am i right? so why you use \\ after s/ and then after list again write two slashes? more specifically, what do these many slashes and backslashes mean exactly? and what is that question mark before g? and also the | mark? (i know why you used two backslashes instead of one). – Reza Jan 06 '19 at 11:52
1

`\|` is "or", `\<` is a left word boundary, ^\>` is right, `\?` after an expression marks it as optional, `$` and `$` is used to group expressions. The backslashes are a bit of a wart; as you can see, the regex dialect of Perl doesn't use backslashes for these constructs (but it has a lot of other backslash sequences and other extensions whach are not supported in `sed` at all). – tripleee Jan 06 '19 at 12:35
1

Comments are not a good place for a full-length regex tutorial; there are many online resources like https://regular-expressions.info/ and https://regex101.com/ where you can learn more (though the latter doesn't support any common `sed` regex dialect). – tripleee Jan 06 '19 at 12:37

removing prepositions from a text file in linux

1 Answers1