-1

In GCP - how to identify the no of lines in a file has more than specific delimiter count, by ignoring header & trailer - Python/Bash operator

Eg

Data

HDR|Filename
10|1000|CHN|TVL|TWD
10|1000|CHN|TVL|TWD
10|1000|CHN|TVL|TWD
10|1000|CHN|TVL|TWD
10|1000|CHN|TVL
TRL|Filename

Expected result

Should ignore HDR TRL line

Count : 1 (as the 10|1000|CHN|TVL has only 3 delimiter)

Need to know the efficient way to achieve the function in Airflow operators

  • 2
    What have you tried where did it failed? – Jetchisel Jun 02 '23 at 08:53
  • I am new to GCP / Airlfow operators.. Need suggestions which operators would be efficient. Also, would like to know if there is any way to do without having the whole data keeping in memory – Mani Shankar.S Jun 02 '23 at 08:57
  • 2
    Being new to something is not a justification. When you ask, we expect you've already tried something and made some investigation. Then you show us that and why that hasn't worked. As well as asking for suggestions (because you're asking an efficient way to do it) is off-topic. Try something and come back with your attempt. – Puteri Jun 02 '23 at 16:50
  • 1
    yeah... Thanks for the comment guys... I was able to do an attempt and come up with the solution... Please refer the link as mentioned https://stackoverflow.com/questions/76405431/bash-operator-to-access-gcs-bucket-file-in-gcp-astronomer?noredirect=1#comment134739502_76405431 – Mani Shankar.S Jun 06 '23 at 07:18

1 Answers1

0

@Mani Shankar.S, Based on the stack link you mentioned in the comment. Using the gsutil cat bash command we can identify the number of lines in a file that has more than a specific delimiter count, by ignoring the header & trailer .

bash_operator = BashOperator(
   task_id='mani_bash',
   bash_command="""if [ `gsutil cat gs://<bucketname>/<location>/filename.txt | awk -F: '/^[^HDR][^TRL]/ { print }' | awk -F "|" '{print NF-1}' | uniq | wc -l` -eq 1 ];
then
if [ `gsutil cat gs://<bucketname>/<location>/filename.txt | awk -F: '/^[^HDR][^TRL]/ { print }' | awk -F "|" '{print NF-1}' | uniq` -eq 9 ]; then
echo 'rite';
fi;
else
echo 'not rite';
fi""",
)

Posting the answer as community wiki for the benefit of the community that might encounter this use case in the future.

Feel free to edit this answer for additional information.

kiran mathew
  • 1,882
  • 1
  • 3
  • 10