0

I have a large .csv file in which I have to remove columns which are empty. By empty, I mean that they have a header, but the rest of the column contains no data.

I've written a Bash script to try and do this, but am running into a few issues. Here's the code:

#!/bin/bash

total="$(head -n 1 Reddit-cleaner.csv | grep -o ',' | wc -l)"
i=1
count=0
while [ $i -le $total ]; do
        cat Reddit-cleaner.csv | cut -d "," -f$i | while read CMD; do if [ -n CMD ]; then count=$count+1; fi; done
        if [ $count -eq 1 ]; then
                cut -d "," -f$i --complement <Reddit-cleaner.csv >Reddit-cleanerer.csv
        fi
        count=0
        i=$i+1
done

Firstly I find the number of columns, and store it in total. Then while the program has not reached the last column, I loop through the columns individually. The nested while loop checks if each row in the column is empty, and if there is more than one row that is not empty, it writes all other columns to another file.

I recognise that there are a few problems with this script. Firstly, the count modification occurs in a subshell, so count is never modified in the parent shell. Secondly, the file I am writing to will be overwritten every time the script finds an empty column.

So my question then is how can I fix this. I initially wanted to have it so that it wrote to a new file column by column, based on count, but couldn't figure out how to get that done either.

Edit: People have asked for a sample input and output.

Sample input:
User, Date, Email, Administrator, Posts, Comments
a, 20201719, a@a.com, Yes, , 3
b, 20182817, b@b.com, No, , 4
c, 20191618, , No, , 4
d, 20190126, , No, , 2

Sample output:
User, Data, Email, Administrator, Comments
a, 20201719, a@a.com, Yes, 3
b, 20182817, b@b.com, No, 4
c, 20191618, , No, 4
d, 20190126, , No, 2

In the sample output, the column which has no data in it except for the header (Posts) has been removed, while the columns which are either entirely or partially filled remain.

anubhava
  • 761,203
  • 64
  • 569
  • 643
Ferex
  • 65
  • 10
  • 2
    Can you add your sample input data with expected output. – anubhava Feb 06 '20 at 15:29
  • you could use bash to call a python script and solve this with like 3 lines (load pandas csv / [remove emptys](https://stackoverflow.com/a/51794989/5476399) / save csv). Are you insisting on a bash solution, maybe because you can't guarantee python / pandas is available? – blkpingu Feb 06 '20 at 15:31
  • [maybe this will help then](https://superuser.com/questions/750651/delete-empty-column-from-csv-file-with-bash-script) – blkpingu Feb 06 '20 at 15:37
  • 2
    Also, please consider @anubhava 's request. We can't help you if you don't give us a way to verify our idea of what the input and output is supposed to look like – blkpingu Feb 06 '20 at 15:41
  • If you say "Bash only", you can't use any of these: `head`, `grep`, `wc`, `cat`, `cut` – what's the exact limitation? "No other scripting languages"? – Benjamin W. Feb 06 '20 at 15:54
  • That comes with Python: http://releases.ubuntu.com/16.04.6/ubuntu-16.04.6-desktop-amd64.manifest – Benjamin W. Feb 06 '20 at 18:09
  • 1
    Btw you should run your script(s) through https://www.shellcheck.net/ and read [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice). The guys who invented shell to create/destroy files and processes and sequence calls to tools also invented awk as the tool for shell to call to manipulate text. – Ed Morton Feb 06 '20 at 19:54

4 Answers4

1

I may be misinterpreting the question (due to its lack of example input and expected output), but this should be as simple as:

$ x="1,2,3,,4,field 5,,,six,7"
$ echo "${x//,+(,)/,}"
1,2,3,4,field 5,six,7

This requires bash with extglob enabled. Otherwise, you can use an external call to sed:

$ echo "1,2,3,,4,field 5,,,six,7" |sed 's/,,,*/,/g'
1,2,3,4,field 5,six,7

There's a lot of redundancy in your sample code. You should really consider awk since it already tracks the current field count (as NF) and the number of lines (as NR), so you could add that up with a simple total+=NF on each line. With the empty fields collapsed, awk can just run on the field number you want.

$ echo "1,2,3,,4,field 5,,,six,7" |awk -F ',+' '
  { printf "line %d has %d fields, the 6th of which is <%s>\n", NR, NF, $6 }'
line 1 has 7 fields, the 6th of which is <six>

This uses printf to denote the number of records (NR, the current line number), the number of fields (NF) and the value of the sixth field ($6, can also be as a variable, e.g. $NF is the value of the final field since awk is one-indexed).

Adam Katz
  • 14,455
  • 5
  • 68
  • 83
  • Thank you, I'm only new to bash, so am currently only used to commands like cut and grep etc. I've never heard of awk before, but will look into it now that you have mentioned it. – Ferex Feb 06 '20 at 18:18
  • 2
    `awk` is a posix/bash programmer's best friend. I've stopped using `sed` for anything but simple replacements and it makes `cut` feel like using safety scissors. – Adam Katz Feb 06 '20 at 18:30
1

It is actually job of a CSV parser but you may use this awk script to get the job done:

cat removeEmptyCellsCsv.awk

BEGIN {
   FS = OFS = ", "
}
NR == 1 {
   for (i=1; i<=NF; i++)
      e[i] = 1  # initially all cols are marked empty
   next
}
FNR == NR {
   for (i=1; i<=NF; i++)
      e[i] = e[i] && ($i == "")
   next
}
{
   s = ""
   for (i=1; i<=NF; i++)
      s = s (i==1 || e[i-1] ? "" : OFS) (e[i] ? "" : $i)
   print s
}

Then run it as:

awk -f removeEmptyCellsCsv.awk file.csv{,}

Using sample data provided in question, it will produce following output:

1, User, Date, Email, Administrator, Comments
2, a, 20201719, a@a.com, Yes, 3
3, b, 20182817, b@b.com, No, 4
4, c, 20191618, , No, 4
5, d, 20190126, , No, 2

Note that Posts columns has been removed because it is empty in every record.

anubhava
  • 761,203
  • 64
  • 569
  • 643
1
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
    if ( NR > 1 ) {
        for (i=1; i<=NF; i++) {
            if ( $i ~ /[^[:space:]]/ ) {
                gotValues[i]
            }
        }
    }
    next
}
{
    c=0
    for (i=1; i<=NF; i++) {
        if (i in gotValues) {
            printf "%s%s", (c++ ? OFS : ""), $i
        }
    }
    print ""
}

$ awk -f tst.awk file file
User, Date, Email, Administrator, Comments
a, 20201719, a@a.com, Yes, 3
b, 20182817, b@b.com, No, 4
c, 20191618, , No, 4
d, 20190126, , No, 2

See also What's the most robust way to efficiently parse CSV using awk? if you need to work with any more complicated CSVs than the one in your question.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

You can use Miller (https://github.com/johnkerl/miller) and its remove-empty-columns verb.

Starting from

+------+----------+---------+---------------+-------+----------+
| User | Date     | Email   | Administrator | Posts | Comments |
+------+----------+---------+---------------+-------+----------+
| a    | 20201719 | a@a.com | Yes           | -     | 3        |
| b    | 20182817 | b@b.com | No            | -     | 4        |
| c    | 20191618 | -       | No            | -     | 4        |
| d    | 20190126 | -       | No            | -     | 2        |
+------+----------+---------+---------------+-------+----------+

and running

mlr --csv remove-empty-columns input.csv >output.csv

you will have

+------+----------+---------+---------------+----------+
| User | Date     | Email   | Administrator | Comments |
+------+----------+---------+---------------+----------+
| a    | 20201719 | a@a.com | Yes           | 3        |
| b    | 20182817 | b@b.com | No            | 4        |
| c    | 20191618 | -       | No            | 4        |
| d    | 20190126 | -       | No            | 2        |
+------+----------+---------+---------------+----------+
aborruso
  • 4,938
  • 3
  • 23
  • 40