0

I have big_file.csv containing a bunch of company information. Here's a snippet

CompanyName, CompanyNumber,RegAddress.CareOf,...
"! # 1 AVAILABLE LOCKSMITH LTD","05905727","",...
"!NSPIRED LIMITED","06019953",""...
"CENTRE FOR COUNSELLING, PSYCHOTHERAPY AND TRAINING LTD","07981734",""...

I only need the CompanyName and CompanyNumber fields, so I did the following:

cut -d, -f 1,2 big_file.csv > big_file_names_codes_only.csv

As you can see tho (and I understand why) the third entry in the big_file.csv gets cut after the first comma which is actually part of CompanyName. I know how to remove in sed the first comma (but that would break the whole csv strucutre), so i was wondering if any of you knew how to remove the comma from the first (it's always on position 1) "string, with, commas, or not and non alphanum chars!".

So basically the intermediate output i am looking for is:

CompanyName, CompanyNumber
"! # 1 AVAILABLE LOCKSMITH LTD","05905727"
"!NSPIRED LIMITED","06019953"
"CENTRE FOR COUNSELLING PSYCHOTHERAPY AND TRAINING LTD","07981734"

But this last line becomes:

"CENTRE FOR COUNSELLING, PSYCHOTHERAPY AND TRAINING LTD"

Once I get this intermediate output I need to clean the company of all non alpha num characters in the name and leading spaces - which works very well with this:

sed -i 's/[^a-zA-Z0-9 ,]//g; s/^[ \t]*//'

In the end my file should be:

CompanyName, CompanyNumber,RegAddress.CareOf,...
AVAILABLE LOCKSMITH LTD,05905727
NSPIRED LIMITED,06019953
CENTRE FOR COUNSELLING PSYCHOTHERAPY AND TRAINING LTD,07981734
Toto
  • 89,455
  • 62
  • 89
  • 125
Tytire Recubans
  • 967
  • 10
  • 27

10 Answers10

3

It's always better to work with structured data like CSV files with embedded commas in fields using tools that are actually aware of the format instead of trying to hack something together with things like regular expressions (Same with XML, JSON, etc.). In the long run it's a lot easier and will save you a ton of pain dealing with edge cases and odd data that doesn't exactly match your expectation.

The csvkit set of utilities has a bunch of useful command line tools and is commonly available via OS package managers:

$ csvcut -c CompanyName,CompanyNumber blah.csv                                              
CompanyName,CompanyNumber
! # 1 AVAILABLE LOCKSMITH LTD,05905727
!NSPIRED LIMITED,06019953
"CENTRE FOR COUNSELLING, PSYCHOTHERAPY AND TRAINING LTD",07981734

You can then continue to use sed to remove the characters you're not interested in.

(Note: I had to get rid of extra spaces in the header line of your sample data for this to work)


Edit: Also, perl version using the handy Text::AutoCSV module, that strips out characters:

$ perl -MText::AutoCSV -e 'Text::AutoCSV->new(out_fields => [ "COMPANYNAME", "COMPANYNUMBER" ],
               read_post_update_hr => sub {
                 my $hr = shift;
                 $hr->{"COMPANYNAME"} =~ s/[^[:alnum:]\s]+//g;
                 $hr->{"COMPANYNAME"} =~ s/^\s+//;
               })->write();' < blah.csv | sed -e 's/"//g'
CompanyName,CompanyNumber
1 AVAILABLE LOCKSMITH LTD,05905727
NSPIRED LIMITED,06019953
CENTRE FOR COUNSELLING PSYCHOTHERAPY AND TRAINING LTD,07981734
Shawn
  • 47,241
  • 3
  • 26
  • 60
  • Hi Shawn - I come back to you after some time because I am now finding myself to perform the same operation (eliminating extra commas in a csv enclosed in double quotes) not just on the 1st columns (here it was company name) but on all columns. Would you be so kind to help me out again? Thanks :) – Tytire Recubans Mar 11 '19 at 20:40
1

A solution with awk

$ awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$2}' big_file.csv
CompanyName, CompanyNumber
"! # 1 AVAILABLE LOCKSMITH LTD","05905727"
"!NSPIRED LIMITED","06019953"
"CENTRE FOR COUNSELLING, PSYCHOTHERAPY AND TRAINING LTD","07981734"

My suggestion would be to use programming languages like R, Python, Perl for such tasks

Sonny
  • 3,083
  • 1
  • 11
  • 19
1

Similar to @Sonny's solution, but using the gsub function from GNU's awk to trim the quotes and commas from the output per your output expectation, and to prioritize fields enclosed in quotes over those that are not:

awk -vFPAT='("[^"]+")|([^,]*)' -vOFS=, '{for(n=1;n<3;++n)gsub(/^"|"$|,/,"",$n);print$1,$2}' big_file.csv

This outputs:

CompanyName, CompanyNumber
! # 1 AVAILABLE LOCKSMITH LTD,05905727
!NSPIRED LIMITED,06019953
CENTRE FOR COUNSELLING PSYCHOTHERAPY AND TRAINING LTD,07981734
blhsing
  • 91,368
  • 6
  • 71
  • 106
1

You was almost there.

Since I don't know how many commas there is on the first line but if its only company name and company number this command is probably the shortest you can get if you will use bash:

The Easiest method to get rid of unwanted characters is with xargs after we run xargs -L1 things look better:

xargs -L1

Output:

CompanyName, CompanyNumber,RegAddress.CareOf,...
! # 1 AVAILABLE LOCKSMITH LTD,05905727,,...
!NSPIRED LIMITED,06019953,...
CENTRE FOR COUNSELLING, PSYCHOTHERAPY AND TRAINING LTD,07981734,...

Now we can add cut -f1,2,3 which you have tried I guess

xargs -L1 | cut -d, -f1,2,3

Output:

CompanyName, CompanyNumber,RegAddress.CareOf
! # 1 AVAILABLE LOCKSMITH LTD,05905727,
!NSPIRED LIMITED,06019953,...
CENTRE FOR COUNSELLING, PSYCHOTHERAPY AND TRAINING LTD,07981734

Okay now I got into the same problem as in your example, we got the number behind LTD as well since we added nr 3 to cut but the unwanted characters at the end still exist:

Solution, read file with sed and pipe it with xargs -L1!

sed 's/,...$//;s/,$//;s/, / /g' big_file.csv

Let's break it down:

sed 's/,...$//;s/,$//;s/, / /g' big_file.csv|xargs -L1|cut -d, -f1,2

End Result:

 CompanyName CompanyNumber,RegAddress.CareOf
 ! # 1 AVAILABLE LOCKSMITH LTD,05905727
 !NSPIRED LIMITED,06019953
 CENTRE FOR COUNSELLING PSYCHOTHERAPY AND TRAINING LTD,07981734

Edit

Since I forgot a comma before my edit I found a better solution:

sed 's/,\ / /g' big_file.csv|xargs -L1|cut -d, -f1,2
Community
  • 1
  • 1
wuseman
  • 1,259
  • 12
  • 20
1

Using Perl

$ perl -lne ' if($.>1) { /^"(.+?)","(.+?)"/ ;$x=$1;$y=$2; $x=~s/[,]//g; print "$x,$y" } 
             else { print } ' big_file.csv
CompanyName, CompanyNumber,RegAddress.CareOf,...
! # 1 AVAILABLE LOCKSMITH LTD,05905727
!NSPIRED LIMITED,06019953
CENTRE FOR COUNSELLING PSYCHOTHERAPY AND TRAINING LTD,07981734

$
stack0114106
  • 8,534
  • 3
  • 13
  • 38
1

You can try with this sed :

sed -E '
  :A
    s/^("[^,"]*),(.*)/\1\2/
    # label A if CompanyName can have more than 1 comma
    tA
  s/"//g;s/([^,]*,[^,]*).*/\1/
' big_file.csv
ctac_
  • 2,413
  • 2
  • 7
  • 17
0

awk is your friend

maybe this helps

➜  ~  awk 'BEGIN {FS="\",\""} { printf "%s, %s \n",$1,$2 }' big_file.csv | tr -d '\"'
CompanyName, CompanyNumber,RegAddress.CareOf,...,
! # 1 AVAILABLE LOCKSMITH LTD, 05905727
!NSPIRED LIMITED, 06019953
CENTRE FOR COUNSELLING, PSYCHOTHERAPY AND TRAINING LTD, 07981734
Hernan Garcia
  • 1,416
  • 1
  • 13
  • 24
0

With GNU awk for FPAT:

$ cat tst.awk
BEGIN { FPAT="\"[^\"]+\"|[^,]*"; OFS="," }
NR == 1 { print; next }
{
    for (i=1; i<=NF; i++) {
        gsub(/[^[:alnum:]]+/," ",$i)
        gsub(/^ | $/,"",$i)
    }
    print $1, $2
}

$ awk -f tst.awk file
CompanyName, CompanyNumber,RegAddress.CareOf,...
1 AVAILABLE LOCKSMITH LTD,05905727
NSPIRED LIMITED,06019953
CENTRE FOR COUNSELLING PSYCHOTHERAPY AND TRAINING LTD,07981734

See What's the most robust way to efficiently parse CSV using awk? if you need to work with any more exotic CSVs.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Hi @Ed Morton, thanks for the answer! I am working on a the same big_file.csv again - turns out other columns have extra commas in their strings - not just "CompanyName". Is there a way to perform this operation on all columns of the csv (delete all non alphanumeric values in the strings of all columns of the csv)? I just started to learn AWK! – Tytire Recubans Mar 11 '19 at 21:26
  • You're welcome. My script already does what you're asking for in your comment, it doesn't make any assumptions/restrictions about any specific field. Did you try it? – Ed Morton Mar 11 '19 at 21:29
  • Oh thank you! I was trying to parse it mentally bit by bit. I guess if I want to print all cols and not just the first 2 I simply replace `print $1, $2` with `print` right? :) – Tytire Recubans Mar 11 '19 at 21:36
  • 1
    Right. _____________ – Ed Morton Mar 11 '19 at 21:38
0

Based on the inputs of two of the answers below I tried several approaches:

  1. The following one worked, but with 4m rows and several columns it was extremely slow:
    • First get rid of the extra leading space in the second column with: sed -i '0,/ CompanyNumber/ s//CompanyNumber/' big_file.csv
    • Then combine xargs -L1 with csvcut and sed:
      sed 's/,\ / /g' big_file.csv | xargs -L1 | csvcut -c CompanyName,CompanyNumber > big_file_cleaned.csv

This worked, but was super slow.

  1. A solution in Perl from one of the kind contributors!
    • First clean first line with perl: perl -lne ' if($.>1) { /^"(.+?)","(.+?)"/ ;$x=$1;$y=$2; $x=~s/[,]//g; print "$x,$y" } else { print } ' big_file.csv > big_file_clean.csv
    • Then filter out only the columns I need with: csvcut -c CompanyName,CompanyNumber big_file_clean.csv > big_file_clean_namecodesonly.csv

THANK YOU

Tytire Recubans
  • 967
  • 10
  • 27
0
awk 'NR>1{gsub(/"/,"")sub(/.{4}$/,"")gsub(/!|,$/,"")sub(/, /," ")sub(/.{5}A/,"A")}1' file

CompanyName, CompanyNumber,RegAddress.CareOf,...
AVAILABLE LOCKSMITH LTD,05905727
NSPIRED LIMITED,06019953
CENTRE FOR COUNSELLING PSYCHAPY AND TRAINING LTD,07981734 
Claes Wikner
  • 1,457
  • 1
  • 9
  • 8