0

So I have two files file.csv and file2.txt.

file1.csv is of the following nature:

aaa,bbb,ccc,ddd
aab,bba,ccd,eee
fff,ggg,hhh,iii
jjj,kkk,lll,mmm

and file2.txt is of the following nature:

aaa
jjj

So I wrote a piece of code, which finds if the contents of file2.txt exist in the first column of file1.csv and if the exist then delete the entire row.

Here is the command:

grep -ivf file2.txt file1.csv>output.csv

so after running this command i get the following outptut:

aab,bba,ccd,eee
fff,ggg,hhh,iii

what I am looking for is an alternate solution to this. I want to open file1.csv read the first column row by row(preferably load it into an array) compare it with file2.txt every time. My actual script si big and i dont want to create any temp files in between so ia m looking to load the contents of the first column of the file1 into an array then compare it with file2.txt and at the end dump the entire array into a csv file.

Bsquare ℬℬ
  • 4,423
  • 11
  • 24
  • 44
Anuj Kulkarni
  • 137
  • 10
  • 1
    Why can't you use the working solution? – choroba Jan 18 '19 at 09:16
  • I can use it but as I said since it operates directly on the files and I want perform further actions on file1.csv. If I use the current solution I have to create an output.csv file at the end of it and then do the remaining processing on output.csv rather than file1.csv. After processing output.csv i will have to create another file for the final output.Hence creation of temp files is too much and I want to eliminate it completely. Also I am looking for solutions other than my own. – Anuj Kulkarni Jan 18 '19 at 09:21
  • @tripleee I disagree about the duplicate marker => in this post, both files DO NOT have the same count of columns, so the solution of linked Q/A can not be applied. – Bsquare ℬℬ Jan 18 '19 at 15:27
  • @Bsquare I agree that this is not the correct duplicate but this is a common FAQ. The standard inner join works with a trivial modification but there are probably questions about exactly this with good answers too. – tripleee Jan 18 '19 at 15:42
  • Duplicate target updated. – tripleee Jan 18 '19 at 15:45

2 Answers2

1

Note: this is a shell command-line, but using additional tool.

Since CSV is kind of like a database table, tools that let you treat CSVs like database tables and let you perform SQL queries against them can be useful, if you happen to want to do an operation that can be expressed as a SQL query. One such tool is fsql. Assuming you have a header row on your CSV files:

file1.csv:

col1,col2,col3,col4
aaa,bbb,ccc,ddd
aab,bba,ccd,eee
fff,ggg,hhh,iii
jjj,kkk,lll,mmm

file2.txt:

col1
aaa
jjj

Then:

% fsql --add-csv file1.csv --add-csv file2.txt \
    "SELECT file1.* FROM file1 LEFT JOIN file2 ON file1.col1=file2.col WHERE file2.col1 IS NULL"

will result in:

col0,col1,col2,col3
aab,bba,ccd,eee
fff,ggg,hhh,iii

Pro's: you can perform more complex operations without having to deal with complicated cut or bash arrays. Con's: not the fastest solution.

Perl Ancar
  • 580
  • 3
  • 10
0

Stricto sensu, about what you asked:

I want to open file1.csv read the first column row by row(preferably load it into an array) compare it with file2.txt every time

You can perform it, this way:

#!/bin/bash

FILE1="/tmp/file1.csv"
FILE2="/tmp/file2.txt"

# Reads each row, one after the other.
while IFS= read -r row; do
  # Creates an array for the row.
  IFS=',' read -r -a rowAsArray <<< "$row"

  # Extracts the first column of the row.
  firstColumn="${rowAsArray[0]}"

  # Checks if this column exists in the filter file.
  # If it is NOT the case, prints the row which is kept.
  [ $( grep -ce "^$firstColumn$" "$FILE2" ) -eq 0 ] && echo "${rowAsArray[1]},${rowAsArray[3]}"

done < "$FILE1"

Edit:

  • I added comments to explain the script
  • in this version, you have the array like you want
  • at the end you can adapt which columns you want to show

    • for all of them, simply use "${rowAsArray[@]}"
    • for specific ones, like in my instance, use the corresponding index, and add which you want "${rowAsArray[2]}" ...
Bsquare ℬℬ
  • 4,423
  • 11
  • 24
  • 44
  • could you please elaborate on your logic here.Thanks – Anuj Kulkarni Jan 18 '19 at 09:45
  • Yes, I updated my question with a better version, and explanation; let me know if it is enough for you. – Bsquare ℬℬ Jan 18 '19 at 09:59
  • But i could create an array with all the elements in one row and then use only array[1] for comparison. My insistence on using arrays is due to the fact that my file1.csv will have 3-4 columns but my final output of the script will have more 10 columns. so I want to dump the elements of my array into the columns of my choice at the end. Thanks – Anuj Kulkarni Jan 18 '19 at 10:11
  • I understand, and why not using awk with all the print instruction you want then? You will get the same result. For instance `awk -F',' '{print $3, $5, $7}'` would return the fields 4, 6 and 8 of the considered row. – Bsquare ℬℬ Jan 18 '19 at 10:14
  • https://mywiki.wooledge.org/DontReadLinesWithFor and don't use `while read` or `grep` if you are using Awk anyway. – tripleee Jan 18 '19 at 10:17
  • It is exactly why I precised `this version only works if you do NOT have any space characters` and `Let me know if you need a version supporting space characters` – Bsquare ℬℬ Jan 18 '19 at 10:18
  • Anyway, I wrote a complete version supporting all space characters and more, avoiding a `for` loop. In addition @AnujKulkarni you will now have an true, and be able to do whatever you want with it ;) – Bsquare ℬℬ Jan 18 '19 at 10:44
  • @Bsquare thanks for the efforts mate.This version seems more complete.I will experiment with it now. – Anuj Kulkarni Jan 18 '19 at 10:46
  • @tripleee like I said in previous version of my answer, it was not supporting `space characters`. Anyway, to provide the highest quality, I updated my answer with a complete support version, avoiding usual GNU/Bash Pitfalls. What do you think of the final version? – Bsquare ℬℬ Jan 18 '19 at 10:52
  • Much better, but the standard Awk solution is probably still much preferable. See also https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice – tripleee Jan 18 '19 at 11:05
  • The `grep` in a loop is still a huge antipattern. See also http://www.iki.fi/era/unix/award.html#backticks (maybe especially the "clueless pretzel" subpage). Also, what's with the `grep -r`? – tripleee Jan 18 '19 at 11:06
  • @tripleee True the `-r` is useless (removed). I don't see any relation with backticks in this script? What would be a better solution than using `grep` here? I didn't see anything about the antipattern in your link. – Bsquare ℬℬ Jan 18 '19 at 11:29
  • Backticks are an obsolete syntax for `$(grep ...)`. The standard solution is to read the file into memory just once instead of reread it every time through the loop. – tripleee Jan 18 '19 at 11:35
  • Yes I agree about backticks; it is why I used $( ... ) ^^ For the fact ro read the file into memory ... even if it is not the case here, how would you manage heavy data volume? – Bsquare ℬℬ Jan 18 '19 at 12:38
  • @AnujKulkarni what gives your experiment? :) – Bsquare ℬℬ Jan 18 '19 at 15:25
  • @tripleee For the fact ro read the file into memory ... even if it is not the case here, how would you manage heavy data volume? – Bsquare ℬℬ Jan 18 '19 at 17:14
  • 1
    If `FILE2` is too big to fit it into memory all at once, partition it into smaller sets (ideally two or three should suffice) and only loop that many times. Or just import everythig to a database if your data is too big to be handled by ad-hoc shell script snippets anyway. – tripleee Jan 18 '19 at 18:45
  • @AnujKulkarni what gives your experiment? :) – Bsquare ℬℬ Jan 22 '19 at 10:55
  • @Bsquare it certainly has given me a direction although to make it fully work according to what I have in mind I might have to make a few changes.Unfortunately I have been stuck up with a different project altogether and havent been able to test this out fully.I will surely update once try it thoroughly. – Anuj Kulkarni Jan 22 '19 at 10:58
  • @AnujKulkarni On Stackoverflow you can give [up-vote](https://stackoverflow.com/help/privileges/vote-up) to people's helpful answers to thank them and select any one of the answer as [correct answer](https://stackoverflow.com/help/someone-answers) too out of all. – Bsquare ℬℬ Feb 07 '19 at 15:50