awk - remove character in regex

Question

I want to remove 1 with awk from this regex: ^1[0-9]{10}$ if said regex is found in any field. I've been trying to make it work with sub or substr for a few hours now, I am unable to find the correct logic for this. I already have the solution for sed: s/^1$[0-9]\{10\}$$/\1/, I need to make this work with awk.

Edit for input and output example. Input:

10987654321
2310987654321
1098765432123

(awk twisted and overcomplicated syntax)

Output:

0987654321
2310987654321
1098765432123

Basically the leading 1 needs to be removed only when it's followed by ten digits. The 2nd and 3rd example lines are correct, 2nd has 23 in front of 1, 3rd has a leading 1 but it's followed by 12 digits instead of ten. That's what the regex specifies.

Its much better you do post some data, and what you like to do with it. This will else only be guessing. — Jotne, Aug 26 '14 at 12:45
but you can use the `match` function and then use the values set in RSTART and RLENGTH. See http://www.grymoire.com/Unix/Awk.html#uh-47 Good luck! — shellter, Aug 26 '14 at 12:46
I edited my question in order to include specific examples of what I want awk to do, even if the regex is self explanatory and I also provided the sed alternative. — one-liner, Aug 26 '14 at 13:46

Kent · Answer 1 · 2014-08-26T14:09:56.253

1

if gnu awk is available for you, you could use gensub function:

echo '10987654321'|awk '{s=gensub(/^1([0-9]{10})$/,"\\1","g");print s}'
0987654321

edit:

do it for every field:

awk '{for(i=1;i<=NF;i++)$i=gensub(/^1([0-9]{10})$/,"\\1","g", $i)}7 file

test:

kent$  echo '10987654321 10987654321'|awk '{for(i=1;i<=NF;i++)$i=gensub(/^1([0-9]{10})$/,"\\1","g", $i)}7'                                                                  
0987654321 0987654321

edited Aug 26 '14 at 14:09

answered Aug 26 '14 at 13:02

Kent

189,393
32
233
301

It works but not on multiple fields, I tried `echo '10987654321 10987654321'`. Is there no way of doing this with `sub`/`gsub`? `Substr` also did not work at all. – one-liner Aug 26 '14 at 13:37
This is the reason I wanted to use awk in the first place, to perform the substitution on each field. By default awk's field separator is one or more spaces. – one-liner Aug 26 '14 at 14:07
@linux_newbie I see what you meant, you need loop the fields: `awk '{for(i=1;i<=NF;i++)$i=gensub(/^1([0-9]{10})$/,"\\1","g", $i)}7' file` – Kent Aug 26 '14 at 14:08
Thank you for the help. I chose Steve's solution since I found it more straightforward but I am upping your solution as well. – one-liner Aug 27 '14 at 11:51

Steve · Accepted Answer · 2014-08-26T14:04:24.453

1

With sub(), you could try:

awk '/^1[0-9]{10}$/ { sub(/^1/, "") }1' file

Or with substr():

awk '/^1[0-9]{10}$/ { $0 = substr($0, 2) }1' file

If you need to test each field, try looping over them:

awk '{ for(i=1; i<=NF; i++) if ($i ~ /^1[0-9]{10}$/) sub(/^1/, "", $i) }1' file

https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

edited Aug 26 '14 at 14:04

answered Aug 26 '14 at 13:56

Steve

51,466
13
89
103

Thanks, this works: `echo '10987654321,10987654321' | awk -F ',' '{for(i=1; i<=NF; i++) if ($i ~ /^1[0-9]{10}$/) sub(/^1/, "", $i) }1'`...but I want to add another sub before the initial one: `sub(/ |-|+|$|$/,"")` which basically preformats the strings by removing spaces,+,-,(,) in order to be matched by the regex and also remove the leading one. I tried but I keep getting syntax error. – one-liner Aug 26 '14 at 14:45
Haven't tried it yet but I don't think it will work for the intended input. Consider this string being piped to awk: `+1 0(987)654-321`. This needs to be preformatted to 10987654321 so that the `^1[0-9]{10}$` regex will match and then awk will proceed with removing the leading `1` with the second sub. The `if` condition will not be met if I don't do the preliminary formatting sub first, correct? – one-liner Aug 26 '14 at 16:02
@linux_newbie: Is there any reason why you cannot throw a `gsub(/[ () +-]*/, "")` in front of the loop? That would be the simplest solution IMO. If you want to apply that to a subset of fields, just move it inside the loop and set a target. For example: `awk -F, '{ for(i=1;i<=NF;i++) { gsub(/[ () +-]*/, "", $i); if ($i ~ /^1[0-9]{10}$/) { sub(/^1/, "", $i) } } }1' OFS=, file` – Steve Aug 26 '14 at 23:08
This does exactly what I want: `awk -F ',' '{gsub(/[ () +-]*/, ""); for(i=1; i<=NF; i++) if ($i ~ /^1[0-9]{10}$/) sub(/^1/, "", $i) }1'`. Can you please explain the sub syntax from the first example you gave: `{ sub(/^1/, "") }1`. What exactly does the 1 outside the curly braces do? And can you explain the final solution loop syntax? I always had trouble understanding loop syntaxes in awk. In any case, thanks a lot for the help! – one-liner Aug 27 '14 at 11:49
1

@linux_newbie: No worries. The `1` on the end forces the command to return true. By default, AWK will print the record (which, by default, is a single line) when the expression evaluates to true. Of course, you don't necessarily need to use `1` (you could use any non-zero integer), but the use of `1` to return true is best practice. The long equivalent would be: `awk 'BEGIN { FS="," } { gsub(/[() +-]*/, ""); for (i=1; i<=NF; i++) { if ($i ~ /^1[0-9]{10}$/) { sub(/^1/, "", $i) } } print }' file`. The placement of the braces is critical. – Steve Aug 27 '14 at 12:52
@linux_newbie: The `for` loop used is your typical C-style loop, which in this case will loop from one to the number of fields in the row, `NF`. `$i` is therefore the actual field value, and `i` is its field position. Another common type of loop you will see regularly in `AWK` code is one that loops over the indices of an array. For example, `for (i in a) { print i, a[i] }` will print the key (`i`) followed by the key's value (`a[i]`). HTH. – Steve Aug 27 '14 at 12:57
Thanks for taking the time to answer my questions. The one thing that still bothers me because I don't understand the logic: why the gsub with `() +-` is being applied to all fields while the regex sub needs a loop to achieve this? – one-liner Aug 27 '14 at 17:16
Guess I spoke too soon... OFS is not printed out if the input is '453452,34545' (less than 11 digits). This is the syntax used: `awk -F ',' -v OFS='|' '{gsub(/[ ()+-/,""); for(i=1; i<=NF; i++) if ($i~/^1[0-9]{10}$/) sub(/^1/,"",$i)}1'`. The OFS is still `,`. Furthermore, if I expand the initial gsub character class, there is no more separator in the output: `awk -F ',' -v OFS='|' '{gsub(/[ ()+-\/\\\[\]\|]/,""); for(i=1; i<=NF; i++) if ($i~/^1[0-9]{10}$/) sub(/^1/,"",$i)}1'`. This is increasingly frustrating, I am wasting hours for a single syntax that's supposed to do a very simple thing. – one-liner Aug 27 '14 at 21:21
@linux_newbie: WRT#1: I was under the impression that you wanted to strip these characters from each line. Doing so makes it easy to then test to see if the number starts with `1` and is followed by ten digits. WRT#2: If no changes are made to a line, AWK will print the line without setting the new `OFS`. This is a good thing, because it makes AWK run fast. If you want AWK to force a change to the line's field separator, the AWKish way is to say let `$1=$1`. Try: `awk -F, -v OFS='|' '{ gsub(/[() +-]*/, ""); for (i=1;i<=NF;i++) { if ($i ~ /^1[0-9]{10}$/) { sub(/^1/, "", $i) } } $1=$1 }1' file` – Steve Aug 27 '14 at 23:04
@linux_newbie: WRT#3: Remember, if you're really stuck with substitutions, you can often use multiple calls `gsub()`. Yes, it's less efficient but it will get the job done and save some frustration. I believe the problem you're having with the regex is because you're trying to escape some characters. A better way to write that character class would be: `gsub(/[][() /|\+-]*/, "")`. – Steve Aug 27 '14 at 23:20
@linux_newbie: WRT#4: Only you know what your _actual_ input is and only you know what the expected output ought to be. From what I can tell, your actual input is a table of strangely formatted numbers, some of which look like phone numbers. There may be extra rows or columns in there, but I really don't know for sure. You've asked a question, but it wasn't the question you really wanted to ask. If you are still having difficulty, please [edit](https://stackoverflow.com/posts/25506106/edit) your question with some actual input and expected output. Include as many edge cases as possible. HTH. – Steve Aug 27 '14 at 23:33
1

Steve, thanks a lot for the feedback. I did try with `$1=$1` but like this: `; $1=$1; print`. This is why it didn't work probably. I did try to specify the characters in the class without escaping them but strange things happened, that is why I escaped them. I believe you fully answered my question given the information I provided. I actually found a fully working solution on my own using just sed and it only took me a few minutes. But your solution still provides valuable insight and I might come back to it should my needs demand it. I highly appreciate your help and feedback on this. – one-liner Aug 28 '14 at 04:03

awk - remove character in regex

2 Answers2

edit:

Linked