Linux - Get Substring from 1st occurence of character

Question

FILE1.TXT

0020220101

or

01 20220101

Need to extra date part from file where text starts from 2

Options tried:

t_FILE_DT1='awk -F"2" '{PRINT $NF}' FILE1.TXT'
t_FILE_DT2='cut -d'2' -f2- FILE1.TXT'

echo "$t_FILE_DT1"
echo "$t_FILE_DT2"

1st output : 0101

2nd output : 0220101

Expected Output: 20220101

Im new to linux scripting. Could some one help guide where Im going wrong?

Please take a look at [How do I format my posts using Markdown or HTML?](https://stackoverflow.com/help/formatting). — Cyrus, Jul 12 '22 at 17:06

score 1 · Accepted Answer · answered Jul 12 '22 at 17:06

1

Use grep like so:

echo "0020220101\n01 20220101" | grep -P -o '\d{8}\b'
20220101
20220101

Here, GNU grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.

SEE ALSO:
grep manual
perlre - Perl regular expressions

answered Jul 12 '22 at 17:06

Timur Shtatland

12,024
2
30
47

I think OP was more concerned about extracting the date after the first 2 characters -- Or first two characters + space .. – Zak Jul 12 '22 at 17:15
3

The `-P` is not portable, but can easily be avoided. This extracts all occurrences of eight digits if there are multiple on a line. – tripleee Jul 12 '22 at 20:09

Ed Morton · Answer 2 · 2022-07-12T19:27:32.207

Using any awk:

$ awk '{print substr($0,length()-7)}' file
20220101
20220101

The above was run on this input file:

$ cat file
0020220101
01 20220101

Regarding PRINT $NF in your question - PRINT != print. Get out of the habit of using all-caps unless you're writing Cobol. See correct-bash-and-shell-script-variable-capitalization for some reasons.

The 2 in your scripts is telling awka and cut to use the character 2 as the field separator so each will carve up the input into substrings everywhere a 2 occurs.

The 's in your question are single quotes used to make strings literal, you were intending to use backticks, `cmd`, but those are deprecated in favor of $(cmd) anyway.

score 0 · Answer 3 · answered Jul 12 '22 at 17:14

0

I would instead of looking for "after" the 2 .. (not having to worry about whether there is a space involved as well) )

Think instead about extracting the last 8 characters, which you know for fact is your date ..

input="/path/to/txt/file/FILE1.TXT"
while IFS= read -r line
do
   # read in the last 8 characters of $line .. You KNOW this is the date .. 
   # No need to worry about exact matching at that point, or spaces .. 

   myDate=${line: -8}
   echo "$myDate"
done < "$input"

answered Jul 12 '22 at 17:14

Zak

6,976
2
26
48

1

Please read [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice). – Ed Morton Jul 12 '22 at 19:19
@EdMorton -- I understand, and that's a valid point ... I was more or less demonstrating the IDEA behind what I was doing .. However you get there .. With `awk` or other methods (as `cat` demonstrates in your answer) .. My hope was to point out the logic error vs syntactically making something more complicated than needs be. – Zak Jul 12 '22 at 19:28
That's fine but IMHO 6 lines of shell is more complicated than 1 line of awk in addition to being orders of magnitude slower as well as less portable so there's just no point writing it and if you do write it, it should always come with that warning not to actually use it in production code. – Ed Morton Jul 12 '22 at 19:31

score 0 · Answer 4 · answered Jul 12 '22 at 19:39

About the cut and awk commands that you tried:

Using awk -F"2" '{PRINT $NF}' file will set the field separator to 2, and $NF is the last field, so printing the value of the last field is 0101

Using cut -d'2' -f2- file uses a delimiter of 2 as well, and then print all fields starting at the second field, which is 0220101

If you want to match the 2 followed by 7 digits until the end of the string:

awk '
match ($0, /2[0-9]{7}$/) {
  print substr($0, RSTART, RLENGTH)
}
' file

Output

20220101

score 0 · Answer 5 · answered Jul 12 '22 at 20:05

The accepted answer shows how to extract the first eight digits, but that's not what you asked.

grep -o '2.*' file

will extract from the first occurrence of 2, and

grep -o '2[0-9]*' file

will extract all the digits after every occurrence of 2. If you specifically want eight digits, try

grep -Eo '2[0-9]{7}'

maybe also with a -w option if you want to only accept a match between two word boundaries. If you specifically want only digits after the first occurrence of 2, maybe try

sed -n 's/[^2]*\(2[0-9]*\).*/\1/p' file

Linux - Get Substring from 1st occurence of character

5 Answers5