0

I'm currently working on a bash script to automate a list of regex for a list of links to clean up the file. Currently i'm doing all manually on kate with find/replace, but having it as a script would be more comfortable. Since i'm fairly new to bash scripting, i ask you for help.

Example list of urls:

0: "/suburl0"
​
1: "/suburl1"
​
2: "/suburl2"
​
3: "/suburl3"
​
4: "/suburl4"

Currently script i have:

#!/bin/bash
awk '[^\x00-\x7F]+' $1 #there are non-ascii chars in the file, so clean it out
awk 'NF' $1 # remove non-character lines
awk '^[0-900]{0,3}: ' $1 #delete all those number infront of the link
awk '"' $1 # remove those quotation marks
awk '!seen[$0]++' $1 #remove duplicate lines
awk '{print "http://example.com/" $0}' $1 #prepend the full url to the suburl

The goal is to apply all those regexes to the file, so the file ends cleaned up

My guess is, that i'm not redirecting the output of awk correctly, but when i tried to pipe it into the file, the file was just empty lines.

Lukas S
  • 315
  • 1
  • 3
  • 15
  • Each `awk` invocation produces a modified output, but leaves the input file untouched. You have multiple solutions : 1) redirect the output of each `awk` invocation to a file, have the next invocation work on that file; 2) pipe the output of each `awk` into the following `awk` invocation and do not provide them a file input : they'll work on their standard input, populated by the previous one's output. Of course the first must still take the file as input, and the last's output can be redirected to a file; 3) use a single `awk` invocation that will do all the actions. – Aaron Nov 27 '19 at 13:35
  • Note that most your `awk` commands aren't correct either. You might want to test your commands one at a time on your input file and test whether they produce the expected result – Aaron Nov 27 '19 at 13:38
  • Could you please post sample of your Input and expected output in your question and let us know then, please make sure you are wrapping your samples/codes in CODE TAGS. – RavinderSingh13 Nov 27 '19 at 14:04
  • 1
    Your awk scripts don't do what the comments next to them suggest you think they do. – Arkku Nov 27 '19 at 14:22
  • @Aaron when i do them seperately, an error occurs awk '{print [^\x00-\x7F]+/}' testfile ^ backslash not last character on line the syntax of the regex should be correct since it's working in kate without a problem RavinderSingh13 as i mentioned the input are the lines in the file above for example: 0: "/suburl0" ​ 1: "/suburl1" output should be: http://example.com/suburl0 ​http://example.com/suburl1 Arkku as i mentioned i'm fairly new to shell scripting. Doing those regexes manually in kate works – Lukas S Nov 27 '19 at 14:55
  • @LukasS, If you could simply mention sample of input and expected output in your question, trust me I am pretty sure this could be done in a single `awk` itself, kindly do update your question and let us know then, as it is not clear still. – RavinderSingh13 Nov 27 '19 at 15:00
  • `awk` is complaining because it doesn't understand the syntax you're using. `[^\x00-\x7F]+` is a valid regex and will work as well in awk than in kate, but the rest is little more than gibberish. For starters `print` doesn't take regexs as argument. Maybe you want to use `gsub`, you could check [this question](https://stackoverflow.com/questions/14432209/substitute-a-regex-pattern-using-awk) but I feel like you ought to check an `awk` tutorial before trying to use it – Aaron Nov 27 '19 at 15:00
  • And if you're simply trying to do regexs search/replace, you might find `sed` easier to use. For instance `sed -E 's/regex/replacement/g' file` will replace all the occurences matching the regex by "replacement" in a file. – Aaron Nov 27 '19 at 15:07
  • Can you extend sample input with few additional cases. Code implies there is lot of cleanup, but input does not looks like that. When there is bad input non-ascii, do you want to drop the line, or drop the non-ascii characters ? – dash-o Nov 27 '19 at 17:41
  • As per may comments - please post sample output. It unclear what is the goal, and the posted script is not helping much. – dash-o Nov 29 '19 at 09:08

1 Answers1

1

A more-or-less translation of what you wanted, without restricting to awk:

cat $1 \
        | tr -cd '[:print:][:space:]' \
        | grep . \
        | sed -r 's/^[0-9]{1,3}: //' \
        | tr -d '"' \
        | sort -u \
        | awk '{print "http://example.com" $0}'

Note that sort will change the order, I am assuming the order doesn't matter.

Also note that sed -r is GNU.

A slightly simplified and more portable version:

cat $1 \
        | tr -cd '[:graph:]\n' \
        | grep . \
        | tr -d '"' \
        | sort -u \
        | sed 's,^[0-9]*:,http://example.com,'

Output:

http://example.com/suburl0
http://example.com/suburl1
http://example.com/suburl2
http://example.com/suburl3
http://example.com/suburl4
root
  • 5,528
  • 1
  • 7
  • 15
  • "sed -r is GNU" I suggest using `sed -E` as a replacement, it works both with modern GNU sed and BSD sed, plus it's consistent with `grep`'s flags. It won't work with older GNU sed versions where you want `-r` instead and it's not POSIX-defined either, but on somewhat modern systems you have better chance it works without having to know which `sed` you're coding for – Aaron Dec 12 '19 at 16:58