awk with or condition

Question

I have a file with a word Sweden in different variations.

I am trying to get if 34th column has Sweden there

awk -F\" '$34 ~ /Sweden/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /sweden/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /SWEDEN/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^se$/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^Se$/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^SE$/  {print $0}' $ipp >> sweden.csv &

As far as i know it gonna be so slow, as I have 650 million rows.

Is there any way I can get all variation in 1 awk command?

Since you are processing a CSV-file, please have a look at [What's the most robust way to efficiently parse CSV using awk?](https://stackoverflow.com/questions/45420535) — kvantour, Jul 13 '22 at 10:37

score 4 · Accepted Answer · answered Jul 13 '22 at 10:30

4

You can use this awk:

awk -F\" 'tolower($34) ~ /sweden|^se$/' "$ipp" >> sweden.csv

answered Jul 13 '22 at 10:30

anubhava

761,203
64
569
643

3

Don't forget, if you are angry, you can use `awk -F\" 'toupper($34) ~ /SWEDEN|^SE$/' "$ipp" >> sweden.csv` – kvantour Jul 13 '22 at 12:06

score 4 · Answer 2 · answered Jul 13 '22 at 10:42

4

With your shown samples, attempts please try following awk code. Simply making field separator as " and in main block checking if field 34th is either containing sweden(including upper and lower cases to match any kind of combinations of it) OR it starts from se9with both lower and upper case for letters) if any of the condition passes then print that line.

awk -F\" '$34 ~ /[Ss][Ww][Ee][Dd][Ee][Nn]|^[Ss][Ee]$/' "$ipp" >> sweden.csv

answered Jul 13 '22 at 10:42

RavinderSingh13

130,504
14
57
93

1

It would be interesting to measure the performance of between the different approaches. I suspect converting to lowercase would probably be the slowest. – P.P Jul 13 '22 at 11:00
@P.P, could be, lets see what OP says when OP tests all codes with actual samples, cheers. – RavinderSingh13 Jul 13 '22 at 11:01

P.P · Answer 3 · 2022-07-13T11:05:24.890

2

If you're using GNU awk, you can use IGNORECASE option:

awk -F\" 'BEGIN{IGNORECASE=1} $34 ~ /sweden|^se$/' "$ipp" >> sweden.csv

edited Jul 13 '22 at 11:05

answered Jul 13 '22 at 10:36

P.P

117,907
20
175
238

score 0 · Answer 4 · answered Jul 13 '22 at 12:24

Your code might be ameloriated as already explained, more generally you might put 6 pattern-action pairs in single awk call rather than 6 separate that is

awk -F\" '$34 ~ /Sweden/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /sweden/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /SWEDEN/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^se$/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^Se$/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^SE$/  {print $0}' $ipp >> sweden.csv &

might be written more concisely as

awk -F\" '$34 ~ /Sweden/  {print $0}$34 ~ /sweden/  {print $0}$34 ~ /SWEDEN/  {print $0}$34 ~ /^se$/  {print $0}$34 ~ /^Se$/  {print $0}$34 ~ /^SE$/  {print $0}' $ipp >> sweden.csv &

Note that if line does contain both Sweden and SWEDEN it will appear twice (in 6 x awk and 1 x awk solution) and also order of lines in output might be different between these 2 approaches.

score 0 · Answer 5 · answered Jul 13 '22 at 16:34

0

awk 'sub(/^[sS][eE]$|^[Ss]weden$|^SWEDEN$/,$4,$4)' "$ipp" >> sweden.csv

answered Jul 13 '22 at 16:34

ufopilot

3,269
2
10
12

score 0 · Answer 6 · answered Jul 14 '22 at 01:45

mawk '$34~__'                           FS='[\"]'\
          __="^[Se][Ee]|[Ss][Ww][Ee][Dd][Ee][Nn]|"\
          "([Kk][Oo][Nn][Uu][Nn][Gg][Aa][Rr][Ii]"  \
           "[Kk][Ee][Tt] +)?[Ss][Vv][Ee][Rr][Ii][Gg][Ee]" "$ipp" >> sweden.csv

Between the 2-letter country code and its name in 2 languages this should be at least somewhat comprehensive.

awk with or condition

6 Answers6