grep multiple exact strings in a particular column, exclude #hashtag matches

Question

I am working on tweets and I am looking for a method to exactly grep several strings. I want to tweets with only strings: Covid or Corona. I don't want those with #Covid, or #Corona or Coronavirus or whatever!

An example of a tweet is as follows (I have modified it for the purpose of this post):

1237891053686075392,38489678,2020-03-12T00:00:00Z,JAMA_current,This JAMA Insights article reviews care for the most severely ill patients with Corona #Corona Covid #Covid #coronavirus disease 2019 (#COVID19)  including standards of management of #ARDS  preventing #SARSCoV2 spread in health care settings  and surge preparation,Sprinklr,,,,FALSE,FALSE,1249,135,,,,335352,805,,2009-05-07T18:45:39Z,TRUE,en

It has 22 columns and is a csv file.

Currently, I am using this command and it still returns strings starting with #.

grep -Ew --color "Corona|Covid" file.csv

And even more complicated! What if I want to do it for a certain column?!

Any suggestion?

The problem is that # is a non-word character, so the string "#Covid" does correctly contain the word Covid. You'll have to find the word Covid preceded by the start of string or a non-word-non-hash character. — glenn jackman, May 17 '20 at 03:35

score 0 · Answer 1 · answered May 17 '20 at 11:00

You could perform a negative grep on the result of your positive grep.

test.txt
covid
covid #covid
#covid

grep covid test.txt | grep -v '#'

output
covid

alternatively you could including leading and/or trailing spaces (or other characters) to dis/allow specific characters. If you want to search specific columns then awk is your friend.

score 0 · Answer 2 · answered May 29 '23 at 05:09

0

Add any other acceptable variations of leading or ending punctuation to the first regular expression. I assumed you want case insensitive match.

grep -Ei " Corona | Covid " filea.csv | grep -iv "#Corona" | grep -iv "#Covid"

Regarding, a specific column that can done with below up until column 5 the tweet. Columns before can be added to the output. Columns after 5 may break delimiters as tweets can have commas and a delimited file with freeform text requires a delimiter that is prohibited from the freeform text to be guaranteed.

cut -d "," -f 5 filea.csv | grep -Ei " Corona|Corona | Covid|Covid " | grep -iv "#Corona" | grep -iv "#Covid"

answered May 29 '23 at 05:09

1

This obviously discards any lines which contain both hashtags and non-hashtag matches. (Also, you could easily combine the final `grep -ivE '#Co(rona|vid)'`) – tripleee May 29 '23 at 05:31
Also, requiring spaces around the string will discard any matches where the string is the last word in the field or the last word on the line, and of course strings where the match is a prefix, like Covid19 – tripleee May 29 '23 at 05:35
`cut` is hard-coded to count literal commas, so it will not cope correctly with CSV files with quoted commas in fields before the one you target. – tripleee May 29 '23 at 05:36

tripleee · Answer 3 · 2023-05-29T09:47:05.130

If you have GNU grep you can use grep -Pwi '(?<!#)(Corona|Covid)' filea.csv but this obviously doesn't allow you to restrict matching to a specific field.

Here is a moderately complex regular expression for targeting the fifth column and only fetch matches which are not immediately preceded by a hash sign.

grep -Ei '^([^,]*,){4}(#?[^,#]+)*\b(Corona|Covid)\b' filea.csv

^([^,]*,){4} skips the first four comma-separated fields
(#?[^,#]+)* allows a hash mark followed by non-hash, non-comma characters, repeated to consume all such combinations before the match
\b(Corona|Covid)\b then can only match if the immediately preceding character is not a hash mark or a comma. The \b anchors require a word boundary on both sides of the match. (This is not entirely portable; see below.)

In some sense, a simpler and more readable way to target a specific column is to use Awk.

awk -F, -v col=5 '{ field = tolower($col); gsub(/#[A-Za-z0-9_]+/, "", field) }
  field ~ /\<(corona|covid)\>/' filea.csv

In some more detail,

-F, says the field separator is comma
-v col=5 sets the variable col to the string "5" (the quotes are implicit here; feel free to add them when necessary; perhaps see also When to wrap quotes around a shell variable?)
The first line creates an internal variable field and normalizes it
- field = tolower($col) sets field to the colth field in the (comma-separated, per -F option) current input line, converted to lower case. Awk silently converts col from a string to a number where necessary.
- gsub(/#[A-Za-z0-9_]+/, "", field) replaces any matches on the regular expression with an empty string in field
field ~ /.../ prints any lines for which field (the relevant field after normalization) matches this regular expression.
- The regular expression also needs to be all-lowercase in order for it to match the lowercased version of the extracted field.
- The \< and \> anchors are how you indicate word boundaries in Awk, and some versions of grep.

However, in the general case, this script does not cope well with CSV files with complex quoting. You can make Awk parse such files correctly, but it will be significantly more complex. (In some more detail, commas are not field separators when they are inside a quoted field, surrounded by double quotes; and double quotes are not quoting when they are duplicated. There are variations, but this is the most common CSV dialect.) If you really need proper CSV support, perhaps switch to Python.

#!/usr/bin/env python3

import csv
import re
import sys

reader = csv.reader(sys.stdin)
writer = csv.writer(sys.stdout)
for line in reader:
    if re.search(r'(?<!#)\b(?:Corona|Covid)\b', line[4], re.IGNORECASE):
        writer.writerow(line)

In very brief, line will be an array of the fields in the current input line with indices starting at 0, so line[4] is the 5th field; and the regular expression uses a negative lookahead (?<!#) to require the parenthesized main regex to not be preceded by a literal hash sign in order to be allowed to match. re.IGNORECASE says to match case-insensitively.

You would save this script as csvcovid.py and run it like python3 csvcovid.py <filea.csv

It's not exactly clear from your question what your conditions for a match are, but the word boundaries are an attempt to guesstimate what you mean. Hoe exactly to indicate a word boundary somewhat depends on your regex variant; e.g. MacOS wants [[:<:]] in front and [[:>:]] in the back instead of \</\> (trad. grep -E) or \b (Perl-style).

Demo for all of these: https://ideone.com/Xqr4wr

grep multiple exact strings in a particular column, exclude #hashtag matches

3 Answers3