1

I have a 3-4M lines csv file (my_csv.csv) with two columns as :

col1,col2
val11,val12
val21,val22
val31,val32
...

The csv contains only two columns with one comma per line. Col1 and Col2 values are only strings (nothing else). The result shown above is the result of the command head my_csv.cs..

I would like to check if a string test_str is into the col2 values. What I mean here is, if test_str = val12 I would like the test to return True because val12 is located in column 2 (as show in the example). But if test_str = val1244 I want the code to return False.

In python it would be something as :

import pandas as pd
df = pd.read_csv('my_csv.csv')
test_str = 'val42'

if test_str in df['col2'].to_list():
    # Expected to return true
    # Do the job

But I have no clues how to do it in bash.

(I know that df['col2'].to_list() is not a good idea, but I didn't want to use built-in pandas function for the code to be easier to understand)

Panda50
  • 901
  • 2
  • 8
  • 27
  • Parsing CSV is difficult... unless your fields do not contain commas, newlines... And you don't do what you want in bash, you do it using utilities like awk or grep that would also be available with dash, zsh or another shell. So, if you have a very simple CSV format you can use grep: `if grep -q ',.*val42.*' my_csv.csv; then ...; fi`. – Renaud Pacalet Jun 27 '22 at 06:25
  • Is the input file **really** a CSV-file with all its bells and whistles? In particular, could it be that you have CSV fields where the content is a multiline string? Could it be that there are fields where the content contains a comma? Or seeing it from a different position: Is it possible that fields are quoted (as the CSV-standard allows)? Without specifying this, it is difficult to give a good advice. – user1934428 Jun 27 '22 at 06:26
  • Also, your example contains spaces between the comma and the value. Are theses spaces significant? – user1934428 Jun 27 '22 at 06:27
  • Thanks for your answers. The csv file only contains two columns with no spaces (I'll edit). It's only two columns with one comma per line. Column 1 and 2 contains only strings. When I use ```head my_csv.csv``` I don't have any " for my strings. – Panda50 Jun 27 '22 at 06:30
  • 1
    @Panda50 Thanks for your explanations. Can you please edit your question to add them, such that other visitors can benefit from it and its answer(s)? Please also explain what you mean with "_if a string test_str is into the col2 values_" (exact equality, sub-string?). – Renaud Pacalet Jun 27 '22 at 06:38
  • 1
    @Panda50 Note that even if your CSV file is very simple you may encounter more complex ones one day or another. `python` and `panda` are thus a much better option than `awk`, `grep` or other general purpose utilities. And if you cannot use `python` and `panda` you could maybe install a dedicated command-line utility like [`csvgrep`](https://csvkit.readthedocs.io/en/1.0.5/scripts/csvgrep.html). – Renaud Pacalet Jun 27 '22 at 06:42
  • 1
    @RenaudPacalet thanks for you answer. I edited. I try my best to do it using bash instead of python but I think that, for this part, I may use python because bach looks a bit too heavy. – Panda50 Jun 27 '22 at 07:26
  • 1
    Thanks for the edits. If you really want to do this with only bash (no awk, grep or other external utilities) then yes, this will be very very slow (if this is what you mean with heavy). But if you use external utilities like grep, then it's super fast and simple: `if grep -q ',val42$' my_csv.csv; then ...; fi`. – Renaud Pacalet Jun 27 '22 at 07:35
  • @RenaudPacalet I tried your comment and it works exactly the same as Python do. Could you please make your comment as an answer? It's less heavier (in time and ressources) than the accepted answer. – Panda50 Jun 27 '22 at 11:20
  • If you want to do this repeatedly for many values, it might make more sense to import the data into a database; then you don't need to traverse the entire file every time to check each value. Of course, if you know all the values in advance, you only really need to read the file once. – tripleee Jun 27 '22 at 12:24

3 Answers3

3

awk is most suited amongst the bash utilities to handle csv data:

awk -F, -v val='val22' '$2 == val {print "found a match:", $0}' file

found a match: val21,val22

An equivalent bash loop would be like this:

while IFS=',' read -ra arr; do
   if [[ ${arr[1]} == 'val22' ]]; then
      echo "found a match: ${arr[@]}"
   fi
done < file

But do keep in mind that Bash while read loop extremely slow compared to cat, why?

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 2
    Perhaps mention [Bash `while read` loop extremely slow compared to `cat`, why?](https://stackoverflow.com/questions/13762625/bash-while-read-loop-extremely-slow-compared-to-cat-why) to explain why the Awk solution should be preferred. – tripleee Jun 27 '22 at 12:22
  • Good point @tripleee, I will add it – anubhava Jun 27 '22 at 13:33
1

Parsing CSV is difficult... unless your fields do not contain commas, newlines... And you don't do what you want in bash, on a large file it will be extremely slow. You do it using utilities like awk or grep that would also be available with dash, zsh or another shell. So, if you have a very simple CSV format you can use, e.g., grep:

if grep -q ',val42$' my_csv.csv; then
  <do that>
fi

We can also put the string to search for in a variable but remember that some characters have a special meaning in regular expressions and shall be escaped. Example if there are no special characters in the string to search for:

test_str="val42"
if grep -q ",$test_str$" my_csv.csv; then
  <do that>
fi
Renaud Pacalet
  • 25,260
  • 3
  • 34
  • 51
  • those are both testing if a string matching a regexp exists in the file, not if a string exists in the file, – Ed Morton Jun 27 '22 at 20:42
0

3-4M rows is a small file to awk. might as well just do

{m,g}awk 'END { exit !index($_,","(__)"\n") }' RS='^$' FS='^$' __="${test_str}"
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11