1

I have a csv file with hundreds of columns and 80 rows. I need to remove the first 13 columns from the file. I have tried using:

cut -d, -f1-13 --complement input.csv > output.csv

But the issue is one of the first columns contains quoted text including commas and cut can't deal with this format. Example input:

HeaderA, HeaderB, HeaderC, HeaderD, HeaderE, HeaderF, HeaderG, HeaderH A, "B, B", C, "D, D, D, D", E, F, G, H A, "B, B", C, "D, D, D, D", E, F, G, H A, "B, B", C, "D, D, D, D", E, F, G, H

Desired output:

HeaderE, HeaderF, HeaderG, HeaderH E, F, G, H E, F, G, H E, F, G, H

In the example I am working with a smaller number of columns in similar conditions. Ideally the solution should be extendable to any number of columns and rows (13 columns and 80 rows in my case). I have tried looking for solutions with awk, but often they include printing all columns except those to remove and this isn't a practical solution due to the sheer number of columns.

Renato
  • 15
  • 5
  • Possible duplicate of [Dealing with commas in a CSV file](https://stackoverflow.com/questions/769621/dealing-with-commas-in-a-csv-file) – zzevannn Oct 30 '18 at 17:54
  • 1
    There is no good way to do this in vanilla bash. Is there any chance of using a language which has csv parsing included, like python? – Alex Stiff Oct 30 '18 at 17:55
  • @Alex Stiff how do you know that? – oguz ismail Oct 30 '18 at 18:01
  • 1
    @AlexStiff I think the two answers submitted disprove your claim. – CJK Oct 30 '18 at 18:03
  • Look into using [csvkit](https://csvkit.readthedocs.io/en/1.0.3/) – Shawn Oct 30 '18 at 18:50
  • 1
    The answers fail to cope with doubled `""` as used in some CSV dialects to quote quotes within a quoted field. – tripleee Oct 31 '18 at 01:20
  • @CJK - one of the submitted answers was done in Python, not bash, and the other does not handle quotes in quoted strings. – Alex Stiff Oct 31 '18 at 11:35
  • @AlexStiff The OP has stated the conditions he requires a potential solution to be able to operate under. Quotes in quoted strings is not one of them. As you can see from my note, this did cross my mind, and I do have a solution for it, but as it is not required, there's little value in muddying the water with overly complex regex that isn't fruitful. **PS.** The second answer to which I referred was done in bash, but has since been deleted for some reason. It was authored by oguzismail. – CJK Oct 31 '18 at 11:40
  • @CJK ah sorry - didn't notice you were the author of the bash response. I have voted it up because it works for this specific example. – Alex Stiff Oct 31 '18 at 11:50

2 Answers2

1

Something like this would work given the specific conditions you've set out:

grep -oE '(("[^"]+"|[^,]+),?){5}$' input.csv > output.csv

where the 5 in {5} is replaced with the number of columns you want remaining. Thus, given N columns, of which the first 13 are to be omitted, this value would be N - 13.

Note: The regular expression won't match correctly if any of the quoted column data, themselves, contain quotes, e.g. "some \"data\" in quotes", although the pattern can be adjusted to cater for that (but it becomes quite complex).

CJK
  • 5,732
  • 1
  • 8
  • 26
  • Thank you, I was only able to test your solution today and it worked, even if it is a bit slower than expected to process. I will try testing the python pandas suggestion as well even if I'd prefer not complicating my script too much. – Renato Nov 02 '18 at 07:22
  • Unfortunately I have difficulty implementing your solution on larger files, I am unable to open them in an editor to see how many columns are present in total and I can't make awk correctly count them due to the comma in quotes issue. Therefore I am unable to perform the N-13 calculation necessary for your solution. EDIT: I will still mark your answer as correct as it works for smaller files. – Renato Nov 02 '18 at 16:43
  • What if you used `head -n 1 input.csv` to extract the first line of the file, i.e. the header row, and count the columns in that one row. You can use a similar `grep` to above but without the `{5}`, or a simpler version assuming the headers don’t feature quotes. The whole thing would be done like this: `head -n 1 input.csv | grep -oE '[^,]+' | wc -l`, the result returned being the total number of columns. – CJK Nov 13 '18 at 06:22
  • Thank you for the suggestion, the suggested command works fine. – Renato Dec 03 '18 at 09:50
1

You can do this using pandas in python.

To do so you can write a simple function that does the following:

  • Load the csv to pandas dataframe
  • Remove the columns that you don't need
  • Save the everything back to a new csv file or back to the same.

Code:

import pandas as pd

def remove_columns_from_csv(data_path):
    #loading the file
    data_file = pd.read_csv(data_path)
    data = pd.dateFrame(data_file)

    # Use this list if you want to remove by columns names 
    columns_list = ['first_columns', 'second_column']

    # Use this list if you want to remove by index
    columns_list = [0,1,3] # subselection of columns you want removed
    data = data.drop(columns=columns_list)

    # saving the data back to a csv
    file_name = 'type the file name here'
    data.to_csv(file_name, sep='\t', encoding='utf-8')


# Function call    
datapath = 'C:\\Users\\default_user\\Documents\\csv_filename.csv'
remove_columns_from_csv(datapath)

Note: index here starts from zero, so if your columns number is 1 in csv it will correspond to 0 in python.

Rohandeep Singh
  • 59
  • 2
  • 11
  • I am trying to test your answer, does it require the specific column headers fir the list or can I use the column number or position to identify how many columns to cut? – Renato Nov 02 '18 at 07:41
  • Looking at the pandas documentation it seems that the specific labels should be used in the list, which is not very practical as I need to loop the command to apply on multiple csv files that have headers with different prefixes. I need a solution that removes columns based on position/order and not specific label name. I am sorry I didn't specify this in the original question. – Renato Nov 02 '18 at 07:53
  • Sure you can remove columns based on position(index). You can simply change the list of labels with the list of indices. I will update the solution and see if that is of any use to you. – Rohandeep Singh Nov 02 '18 at 12:39
  • Sorry to bother you further, I am trying to implement your solution but am not very familiar with python. Could you explain better how to use your script? I have installed pandas for both python 2 and 3 on my system and have entered the path to the file where indicated, but still am receiving various errors on the terminal. – Renato Nov 02 '18 at 17:05
  • In particular, I am now receiving: ValueError: DataFrame constructor not properly called! – Renato Nov 02 '18 at 17:10
  • It could the data path to the file I believe. datapath example in code: datapath = 'C:\\Users\\singroha\\Documents\\csv_filename.csv' – Rohandeep Singh Nov 02 '18 at 17:31