If you have GNU grep
you can use grep -Pwi '(?<!#)(Corona|Covid)' filea.csv
but this obviously doesn't allow you to restrict matching to a specific field.
Here is a moderately complex regular expression for targeting the fifth column and only fetch matches which are not immediately preceded by a hash sign.
grep -Ei '^([^,]*,){4}(#?[^,#]+)*\b(Corona|Covid)\b' filea.csv
^([^,]*,){4}
skips the first four comma-separated fields
(#?[^,#]+)*
allows a hash mark followed by non-hash, non-comma characters, repeated to consume all such combinations before the match
\b(Corona|Covid)\b
then can only match if the immediately preceding character is not a hash mark or a comma. The \b
anchors require a word boundary on both sides of the match. (This is not entirely portable; see below.)
In some sense, a simpler and more readable way to target a specific column is to use Awk.
awk -F, -v col=5 '{ field = tolower($col); gsub(/#[A-Za-z0-9_]+/, "", field) }
field ~ /\<(corona|covid)\>/' filea.csv
In some more detail,
-F,
says the field separator is comma
-v col=5
sets the variable col
to the string "5"
(the quotes are implicit here; feel free to add them when necessary; perhaps see also When to wrap quotes around a shell variable?)
- The first line creates an internal variable
field
and normalizes it
field = tolower($col)
sets field
to the col
th field in the (comma-separated, per -F
option) current input line, converted to lower case. Awk silently converts col
from a string to a number where necessary.
gsub(/#[A-Za-z0-9_]+/, "", field)
replaces any matches on the regular expression with an empty string in field
field ~ /.../
prints any lines for which field
(the relevant field after normalization) matches this regular expression.
- The regular expression also needs to be all-lowercase in order for it to match the lowercased version of the extracted field.
- The
\<
and \>
anchors are how you indicate word boundaries in Awk, and some versions of grep
.
However, in the general case, this script does not cope well with CSV files with complex quoting. You can make Awk parse such files correctly, but it will be significantly more complex. (In some more detail, commas are not field separators when they are inside a quoted field, surrounded by double quotes; and double quotes are not quoting when they are duplicated. There are variations, but this is the most common CSV dialect.) If you really need proper CSV support, perhaps switch to Python.
#!/usr/bin/env python3
import csv
import re
import sys
reader = csv.reader(sys.stdin)
writer = csv.writer(sys.stdout)
for line in reader:
if re.search(r'(?<!#)\b(?:Corona|Covid)\b', line[4], re.IGNORECASE):
writer.writerow(line)
In very brief, line
will be an array of the fields in the current input line with indices starting at 0, so line[4]
is the 5th field; and the regular expression uses a negative lookahead (?<!#)
to require the parenthesized main regex to not be preceded by a literal hash sign in order to be allowed to match. re.IGNORECASE
says to match case-insensitively.
You would save this script as csvcovid.py
and run it like python3 csvcovid.py <filea.csv
It's not exactly clear from your question what your conditions for a match are, but the word boundaries are an attempt to guesstimate what you mean. Hoe exactly to indicate a word boundary somewhat depends on your regex variant; e.g. MacOS wants [[:<:]]
in front and [[:>:]]
in the back instead of \<
/\>
(trad. grep -E
) or \b
(Perl-style).
Demo for all of these: https://ideone.com/Xqr4wr