1

I am trying to find a regular expression which would allow me to delete entire content of a file if a particular string matches.

As an example, my file contents are:

This is the first line
Here is password=SECRET second line
Here is third line

I am doing search for string with pattern password= and when that match happens, ALL lines should be removed from the above file.

Below command does remove the entire line matching the pattern but I can't figure out a regular expression for removing the entire content:

cat test.txt | sed 's|^.*password=.*||' 

I understand sed works line by line and unless I use additional options in sed, I probably do not have a way to delete the entire content.

The reason I am only interested in regular expression is that I am using another tool which uses regular expression as an input to perform transformations. I use sed here as an example to illustrate what I understand so far.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
joshm
  • 81
  • 7
  • What's with second last `|` ? – Rahul Jun 08 '17 at 13:17
  • @Rahul empty replacement string (match pattern, replace with empty string) – Aaron Jun 08 '17 at 13:18
  • @Aaron is right – joshm Jun 08 '17 at 13:20
  • We need more info on your regex engine, because your regex is basically correct. What you need is to either specify through a flag that `.` should match linefeeds, or to instead use a character class that will match everything `.` matches plus linefeeds. Edit : the anchor is fine contrary to what I previously said – Aaron Jun 08 '17 at 13:27
  • @Aaron, the regex engine presumably runs globally and scans for the string pattern in all lines until the EOF. The tool name is BFG (used to remove sensitive data from Git repository). https://stackoverflow.com/questions/4110652/how-to-substitute-text-from-files-in-git-history – joshm Jun 08 '17 at 13:39
  • @joshm Java regex then, looking at their requirement ; you should see how you can provide a pattern parameter, or use the shorthand flag notation. I will add an answer detailing the solution. Rahul's solution would also work, but isn't best practice – Aaron Jun 08 '17 at 13:42
  • Please check [my solution](https://stackoverflow.com/a/59049106/3832970) that you may use in `sed`. – Wiktor Stribiżew Jun 04 '21 at 12:05

4 Answers4

1

This is tagged as 'sed', but on surface, sed is not the right tool for this task. grep ad bash will make the task simpler. As per OP, the requirement is to express the condition with regexp, which grep will do.

With grep, there is no need to scan complete files, etc. For single file

grep -q 'password=' $file && true > $file

For multiple files

for file in $(grep -l 'password=' *.txt) ; do
    true > $file
done

The construct 'true > file' will truncate 'file' to 0 bytes, same as cp /dev/null file, but will usually resolved inside the shell with no additional process to fork.

dash-o
  • 13,723
  • 1
  • 10
  • 37
  • 1
    AFAIK you can remove `true` and have your scripts behave in the same way. `truncate -s 0 "$file"` is another alternative that might be more explicit in its meaning – Aaron Nov 26 '19 at 15:43
  • Good Point. It will be more explicit to use truncate. I personally like the 'true' construct :-) – dash-o Nov 26 '19 at 16:38
  • Very nice solution, but the second could be problematic when funny filenames are involved (spaces, newlines) – kvantour Apr 07 '20 at 08:20
  • The following is a safer solution: `grep -lZ 'password=' *.txt | xargs -0 -I{} sh -c 'true > {}'` – kvantour Apr 07 '20 at 08:31
0

You may read all text from a file into memory with the well-known 1h;2,$H;$!d;g construct (be cautious with very large files!) and then run a simple .*<YOUR_PATTERN>.* pattern within a substitution command:

sed -e '1h;2,$H;$!d;g' -e 's/.*password=.*//' file > tmp && mv tmp file

Or, you may read and append line after line until it matches your pattern, and then delete the text inside pattern space and then remove the rest of the lines one by one with:

sed ':a;N;/password=/!ba;d{:b;N;d;bb}' file > tmp && mv tmp file

See sed online demo:

res="Result: '$(sed -e '1h;2,$H;$!d;g' -e 's/.*password=.*//' <<< "$s")'"
echo "$res"
# => Result: ''    
res3="Result: '$(sed ':a;N;/password=/!ba;d{:b;N;d;bb}' <<< "$s")'"
echo "$res3"
# => Result: ''
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
-1

You said it should delete entire content. But is .* matching entire content ?

I think you should use [\s\S] instead of .

Regex: ^[\s\S]*?password=[\s\S]*

Regex101 Demo

Rahul
  • 2,658
  • 12
  • 28
  • Good answer if the regex is run with JS, not so good otherwise. I would have waited for precision on the regex engine used. – Aaron Jun 08 '17 at 13:29
  • @Rahul, your regex does seem to match entire content, so that is exactly what I want. But when I try to test it out using sed (as below), the replacement doesn't seem to be happening (below command outputs the entire file content): cat test.txt | sed 's|^[\s\S]*?password=[\s\S]*||' – joshm Jun 08 '17 at 13:37
  • @joshm: Shouldn't you be using `/` instead of `|` ? – Rahul Jun 08 '17 at 13:43
  • That's just a delimiter of my choice to make the command more legible. I did try it with / as well, same result. – joshm Jun 08 '17 at 13:48
  • @joshm by default `sed` uses BRE, which does not define the `\s` and `\S` shorthands. Switching to `ERE` with the `-r` (GNU) or `-E` (BSD, recent GNU) flag might make it kinda work (it would replace the whole line containing `password=`). To make it really work, you'd need to use `N` to load additional lines in the pattern space. For a dirty test, you can prefix your `s` command with as much `N;` as there are lines in your input file, for a cleaner test just don't use `sed` (although I guess you could technically make a loop to `N` until the whole file has been read). – Aaron Jun 08 '17 at 14:48
-1

Note that this answer is based on OP's comments on his answer where he discloses that he is only using sed to test his regex and that his final solution uses BFG. This tool uses Java regexes so testing the solution with sed makes little sense, which is why my solution doesn't match the tags of the question.


The documentation of the tool you use is lackluster, I couldn't find if there was a way to specify a regex flag separated from the regex itself.

If you find such a way, you should aim to specify the use of Pattern.DOTALL, which will make . match linefeeds.

If you don't, you can specify the use of the DOTALL mode from inside the regex pattern by using its shorthand (?s), which will apply to the rest of the pattern :

(?s)^.*password=.*"

I've tested it on ideone, feel free to adapt the code to make sure it works for you.

You won't be able to test this with sed ; the line-by-line problem could be avoided by loading the whole file in the pattern space (which would be a bad idea in itself), but (GNU?) sed only accepts BRE and ERE regexs, which do not implement a DOTALL flag.

To test it on individual files regex101 will do, to test it on a whole git repo I'd just clone it and run the target tool rather than a substitute command.

Aaron
  • 24,009
  • 2
  • 33
  • 57