Substring of numbers from a Non Ascii string in bash

Question

I have a string which is read from a file and it contains all types of non-ascii characters like this

line=^AÀÀ^P^G^P^@^H15552655^@^@E$4c<84>%ÿ~^@^@^Ac<8f>/qu^Q»í&.WÈå

Now I just need to extract '15552655' number from this.

What I tried :

line=$(sed -n '1p' < file)

number=$(echo "${line//[!0-9]/}")
              or
number=$(echo $line | sed 's/[^0-9]*//g')

But this returns '155526554', so I need a way to extract substring from the line that contains continuously at least 4 consecutive numbers [ Guaranteed that there will be atleast 4 numbers in that pattern ]

Any help is greatly appreciated.

Update-1 :

number=$(echo $line | sed 's/[^0-9]*\([0-9]\{1,\}\).*$/\1/')

This seems to work for the above case, but it will fail if the input is of this format

line=^AÀÀ^P^4G^P^@^H15552655^@^@E$4c<84>%ÿ~^@^@^Ac<8f>/qu^Q»í&.WÈå

In this case it returns 4 i.e. it returns first run of numbers. I need to add something that says give me longest or more than 4 numbers.

score 1 · Accepted Answer · answered May 23 '18 at 23:44

I'd use head and grep:

head -1 filename | grep -o '[0-9]\{4,\}'

Here [0-9]\{4,\} matches any run of four or more digits. The -o switch tells grep to print only those matches (on a line of their own).

If this still gives you false positives, you could process those further to find the largest number in the bunch by using sort and tail, as in

head -1 filename | grep -o '[0-9]\{4,\}' | sort -n | tail -1

This will in turn:

get the first line from the file,
isolate all instances of four or more consecutive numbers,
sort these numerically, and
print the last of the sorted list, i.e. the largest one.

Perfect, this does the job. Thanks – Magic May 24 '18 at 00:02 — Magic, May 24 '18 at 00:02

score 1 · Answer 2 · answered May 23 '18 at 23:59

How about this:

number=$(echo "$line" | tr -cs '0-9' '\n' | awk '{if (length>l) { n=$0; l=length }} END { print n }')

Explanation: Double-quotes around $line prevent the shell from doing anything weird if the string contains certain shell metacharacters. tr -cs '0-9' '\n' replaces everything that isn't a digit with newlines, "squeezing" together runs of the replaced characters; this essentially produces a list of numbers in the file, one per line. Then in awk, the {if (length>l) { n=$0; l=length }} says that for each input line, if its length is longer than what it's seen before (l), set n to the current line and l to its length. The END { print n } part makes it print the longest line when it gets to the end of the input.

Leonard · Answer 3 · 2018-05-24T00:22:10.697

My suggestion is that you break the line into comma-separated numbers and then examine those numbers to your heart's content:

line="^AÀÀ^P^G^P^@^H15552655^@^@E$4c<84>%ÿ~^@^@^Ac<8f>/qu^Q»í&.WÈå"

number=$(echo $line | sed -E 's/[^0-9]+/,/g')
echo $number
==> ,15552655,84,8,

Finding the longest is complex. Here's one solution, but Gordon Davisson's solution is a one-liner.

#!/bin/bash


line="^AÀÀ^P^G^P^@^H15552655^@^@E$4c<84>%ÿ~^@^@^Ac<8f>/qu^Q»í&.WÈå"

number=$(echo $line | sed -E 's/[^0-9]+/\\n/g')
max_length=0
this_index=0
saved_index=-1

echo $number | 
{ while read num ; do
pieces[$this_index]=$num
this_length=$(echo $num | wc -c | sed 's/ //g')
if [ $this_length -gt $max_length ]  ; then
    max_length=$this_length
    saved_index=$this_index
fi
this_index=$(expr $this_index + 1)
done


echo maxnum is ${pieces[$saved_index]}

}

This would be great if I can fetch largest among them without splitting them and checking the length and all that. There could be numbers before '15552655' as well. — Magic, May 23 '18 at 23:36

s3n0 · Answer 4 · 2018-05-24T00:40:10.590

If "H" character is prefix before number, you can use it as prevent to getting right number.

#!/bin/bash

line="^AÀÀ^P1^G88^P^@^H15552655^@^@E$4c<84>%ÿ~^@^@^Ac<8f>/qu^Q»í&.WÈå"
echo -e "line="$line"\n"

strA=$(   echo $line | sed -E 's/.*H([0-9]+).*/\1/g'   )
strB=$(   echo $line | sed -n 's/[^0-9]*\([0-9]\+\).*/\1/p'  )
strC=$(   echo $line | sed -E 's/.*\^H([0-9]+)\^.*/\1/g'   )
strD=$(   echo $line | sed -E 's/(.*)([0-9]{8})(.*)/\2/g'   )    # without 'H' prefix

echo $strA    # 15552655
echo $strB    # 1
echo $strC    # 15552655
echo $strD    # 15552655

But your question is duplicated ! > sed extracting group of digits

Substring of numbers from a Non Ascii string in bash

4 Answers4