2

I have a file with a word Sweden in different variations.

I am trying to get if 34th column has Sweden there

awk -F\" '$34 ~ /Sweden/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /sweden/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /SWEDEN/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^se$/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^Se$/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^SE$/  {print $0}' $ipp >> sweden.csv &

As far as i know it gonna be so slow, as I have 650 million rows.

Is there any way I can get all variation in 1 awk command?

Amit Singh
  • 188
  • 9
  • 2
    Since you are processing a CSV-file, please have a look at [What's the most robust way to efficiently parse CSV using awk?](https://stackoverflow.com/questions/45420535) – kvantour Jul 13 '22 at 10:37

6 Answers6

4

You can use this awk:

awk -F\" 'tolower($34) ~ /sweden|^se$/' "$ipp" >> sweden.csv 
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 3
    Don't forget, if you are angry, you can use `awk -F\" 'toupper($34) ~ /SWEDEN|^SE$/' "$ipp" >> sweden.csv` – kvantour Jul 13 '22 at 12:06
4

With your shown samples, attempts please try following awk code. Simply making field separator as " and in main block checking if field 34th is either containing sweden(including upper and lower cases to match any kind of combinations of it) OR it starts from se9with both lower and upper case for letters) if any of the condition passes then print that line.

awk -F\" '$34 ~ /[Ss][Ww][Ee][Dd][Ee][Nn]|^[Ss][Ee]$/' "$ipp" >> sweden.csv
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • 1
    It would be interesting to measure the performance of between the different approaches. I suspect converting to lowercase would probably be the slowest. – P.P Jul 13 '22 at 11:00
  • @P.P, could be, lets see what OP says when OP tests all codes with actual samples, cheers. – RavinderSingh13 Jul 13 '22 at 11:01
2

If you're using GNU awk, you can use IGNORECASE option:

awk -F\" 'BEGIN{IGNORECASE=1} $34 ~ /sweden|^se$/' "$ipp" >> sweden.csv
P.P
  • 117,907
  • 20
  • 175
  • 238
0

Your code might be ameloriated as already explained, more generally you might put 6 pattern-action pairs in single awk call rather than 6 separate that is

awk -F\" '$34 ~ /Sweden/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /sweden/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /SWEDEN/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^se$/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^Se$/  {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^SE$/  {print $0}' $ipp >> sweden.csv &

might be written more concisely as

awk -F\" '$34 ~ /Sweden/  {print $0}$34 ~ /sweden/  {print $0}$34 ~ /SWEDEN/  {print $0}$34 ~ /^se$/  {print $0}$34 ~ /^Se$/  {print $0}$34 ~ /^SE$/  {print $0}' $ipp >> sweden.csv &

Note that if line does contain both Sweden and SWEDEN it will appear twice (in 6 x awk and 1 x awk solution) and also order of lines in output might be different between these 2 approaches.

Daweo
  • 31,313
  • 3
  • 12
  • 25
0
awk 'sub(/^[sS][eE]$|^[Ss]weden$|^SWEDEN$/,$4,$4)' "$ipp" >> sweden.csv 
ufopilot
  • 3,269
  • 2
  • 10
  • 12
0
mawk '$34~__'                           FS='[\"]'\
          __="^[Se][Ee]|[Ss][Ww][Ee][Dd][Ee][Nn]|"\
          "([Kk][Oo][Nn][Uu][Nn][Gg][Aa][Rr][Ii]"  \
           "[Kk][Ee][Tt] +)?[Ss][Vv][Ee][Rr][Ii][Gg][Ee]" "$ipp" >> sweden.csv  

Between the 2-letter country code and its name in 2 languages this should be at least somewhat comprehensive.

RARE Kpop Manifesto
  • 2,453
  • 3
  • 11