1

This is my sample data in csv file. As you can see for ID = '51126' there is a column which has data in word wrap format, The data is entered using atl+enter. I need to remove word wrap and enter into a single line for entire csv file. There are many such word wraps in the file!

ID,OPPORTUNITY ID,CREATED_DATE,TIR NAME,MS Rep,SRC_SSR_REP,REGION,HP PBM NAME,COMPANY NAME,COMPANY ADDRESS,COMPANY CITY,COMPANY STATE,COMPANY ZIPCODE,COMPANY AMID,COMPANY USER CONTACT NAME,COMPANY USER TITLE,COMPANY USER PHONE,COMPANY USER EMAIL,PARTNER COMPANY NAME,PARTNER REP NAME,PARTNER REP EMAIL,PARTNER LID,WHOLESALER,PURCHASEDGE AC NUMBER,USAGE PERIOD,DEAL TYPE,CLWB WORKED ON,DEAL NUMBER,NAMED TERRITORY SLED,MONO HP SHARE %,COLOR HP SHARE %,TOTAL HP TONER SHARE %,DEAL VALUE MONO,DEAL VALUE COLOR,TOTAL TONER DEAL VALUE,EST DISCOUNT VALUE,REBATE TYPE MONO,REBATE TYPE COLOR,DISCOUNT TYPE,DEAL START DATE,DEAL END DATE,DEAL EXTENDED END DATE,DEAL POSITION,ECLIPSE ID,ECLIPSE DEAL STATUS,ECLIPSE APPROVED DATE,ECLIPSE DEAL APPROVED BY,LOST REASON,USAGE FILE LOCATION,CREARTED BY,MODIFIED BY,MODIFIED DATE,FINALISATION_RECEIVED_DATE,FINALISATION_WORKED_DATE,DEAL_PROCESSED_BY,DEAL_FINALISED_BY,FUNNEL_COMMENT,AV_SENT_DATE,PL_REMAN_VALUE,PL_REMAN_SHARE,FINALISATION_DOC_PATH,TIME ELAPSING ON,APPROVAL SENT DATE,APPROVAL RECEIVED DATE,SECONDARY_WHOLESALER,PREVIOUSECLIPSE_ID,PurchasEdge_(Y/N),HP_TONER_UNITS,PL_REMAN_UNITS,FINALISATION_COMMENTS,RENEWAL_POSITION,PROGRAM_NAME,CUSTOMERONBOARDEDON
51128,OPP-048699,3/23/2020 21:02,Adam Dohm,Cheryl Glenn,Tiffany Debose,MARKET SOURCE,,"Flathead Valley School District (Kalispell, Whitefish, Columbia Falls)",233 1st Ave E,Kalispell,MT,59901,,Joe Biangone,Purchasing,406-758-8392,biangonej@sd5.k12.mt.us,TONERPORT INCORPORATED, ,,10293955,ESSENDANT,,12 months,Renewal,,CL091515474R4-A,SLED,97,100,98,21592,16781,38373,2452,Defend,Defend,Defend,4/15/2020 0:00,4/14/2021 0:00,4/14/2021 0:00,Won,42921984,,,,,/E/Data/Funnel/Submit/FLATHEAD VALLEY SCHOOL DISTRICT USAGE_51128.xlsx,Tiffany Debose,Tiffany Debose,3/26/2020 14:49,3/26/2020 0:00,,Bhavana P V,,,,613.97,1.6,,,,,NA,42085906,N,179,3,3/26 - Deal added on eclipse ,,SMBA,
51126,OPP-048697,3/23/2020 19:52,Xavier Weems,,Tiffany Debose,EAST,Vladimir Jaksic,"Gray Television, Inc.","​Gray Television, Inc.
4370 Peachtree Rd, NE.
​Atlanta, Ga  30319
​

",,GA,30319,DN042973875,Dottie Boudreau,Manager,404-266-8333,dottie@gray.tv,"STAPLES, INC", ,,"10264576,10252948",NA,,12 months,New,,CL200351126,Commercial - Named,84,89,86,16143,7335,23478,3149,Defend,Defend,Defend,,,,AV summary and PPT sent,,,,,,"/E/Data/Funnel/Submit/GRAY TELEVISION, INC USAGE_51126.xlsb",Tiffany Debose,Tiffany Debose,3/26/2020 8:55,,,Deepthi K,,,3/26/2020 0:00,3239.96,13.8,,6/24/2020 0:00,,,NA,,N,168,27,3/24/2020 - sent for specialist approval 3/26/2020 - aV sent,,MCBigDeal,
51125,OPP-048696,3/23/2020 18:01,Xavier Weems,,Tiffany Debose,WEST,Jenni HoGlin,STURM FINANCIAL GROUP,3033 East First Avenue,Denver,CO,80206,,,,,,"STAPLES, INC", ,,"10264576,10252948",NA,,12 months,New,,CL200351125,Commercial - Non Named,42,87,65,10201,14198,24399,6369,Winback,Defend,Winback,,,,AV summary and PPT sent,,,,,,/E/Data/Funnel/Submit/STURM FINANCIAL GROUP USAGE_51125.xlsx,Tiffany Debose,Tiffany Debose,3/24/2020 7:49,,,Teja Ravi,,,3/24/2020 0:00,8417.66,34.5,,6/22/2020 0:00,,,NA,,N,127,67,3/24-AV Summary and PPT sent,,SMBA,

Output should be like below. I've entered only ID= 51126 and 51125 for your reference, there will be 51128 too! There are 73 columns!

"ID","OPPORTUNITY ID","CREATED_DATE","TIR NAME","MS Rep","SRC_SSR_REP","REGION","HP PBM NAME","COMPANY NAME","COMPANY ADDRESS","COMPANY CITY","COMPANY STATE","COMPANY ZIPCODE","COMPANY AMID","COMPANY USER CONTACT NAME","COMPANY USER TITLE","COMPANY USER PHONE","COMPANY USER EMAIL","PARTNER COMPANY NAME","PARTNER REP NAME","PARTNER REP EMAIL","PARTNER LID","WHOLESALER","PURCHASEDGE AC NUMBER","USAGE PERIOD","DEAL TYPE","CLWB WORKED ON","DEAL NUMBER","NAMED TERRITORY SLED","MONO HP SHARE %","COLOR HP SHARE %","TOTAL HP TONER SHARE %","DEAL VALUE MONO","DEAL VALUE COLOR","TOTAL TONER DEAL VALUE","EST DISCOUNT VALUE","REBATE TYPE MONO","REBATE TYPE COLOR","DISCOUNT TYPE","DEAL START DATE","DEAL END DATE","DEAL EXTENDED END DATE","DEAL POSITION","ECLIPSE ID","ECLIPSE DEAL STATUS","ECLIPSE APPROVED DATE","ECLIPSE DEAL APPROVED BY","LOST REASON","USAGE FILE LOCATION","CREARTED BY","MODIFIED BY","MODIFIED DATE","FINALISATION_RECEIVED_DATE","FINALISATION_WORKED_DATE","DEAL_PROCESSED_BY","DEAL_FINALISED_BY","FUNNEL_COMMENT","AV_SENT_DATE","PL_REMAN_VALUE","PL_REMAN_SHARE","FINALISATION_DOC_PATH","TIME ELAPSING ON","APPROVAL SENT DATE","APPROVAL RECEIVED DATE","SECONDARY_WHOLESALER","PREVIOUSECLIPSE_ID","PurchasEdge_(Y/N)","HP_TONER_UNITS","PL_REMAN_UNITS","FINALISATION_COMMENTS","RENEWAL_POSITION","PROGRAM_NAME","CUSTOMERONBOARDEDON"
"51126","OPP-048697","3/23/2020 19:52",Xavier Weems","","Tiffany Debose","EAST","Vladimir Jaksic","Gray Television, Inc.","​Gray Television, Inc. 4370 Peachtree Rd, NE. Atlanta, Ga  30319","","GA","30319","DN042973875","Dottie Boudreau","Manager","404-266-8333","dottie@gray.tv","STAPLES, INC","","","10264576,10252948","NA","","12 months","New","","CL200351126","Commercial - Named","84","89","86","16143","7335","23478","3149","Defend","Defend","Defend","","","","AV summary and PPT sent","","","","","","/E/Data/Funnel/Submit/GRAY TELEVISION, INC USAGE_51126.xlsb","Tiffany Debose","Tiffany Debose","3/26/2020 8:55","","","Deepthi K","","","3/26/2020 0:00","3239.96","13.8","","6/24/2020 0:00","","","NA","","N","168","27","3/24/2020 - sent for specialist approval 3/26/2020 - aV sent","","MCBigDeal",""
"51125","OPP-048696","3/23/2020 18:01","Xavier Weems","","Tiffany Debose","WEST","Jenni HoGlin","STURM FINANCIAL GROUP","3033 East First Avenue","Denver","CO","80206","","","","","","STAPLES, INC","","","10264576,10252948","NA","","12 months","New","","CL200351125","Commercial - Non Named","42","87","65","10201","14198","24399","6369","Winback","Defend","Winback","","","","AV summary and PPT sent","","","","","","/E/Data/Funnel/Submit/STURM FINANCIAL GROUP USAGE_51125.xlsx","Tiffany Debose","Tiffany Debose","3/24/2020 7:49","","","Teja Ravi","","","3/24/2020 0:00","8417.66","34.5","","6/22/2020 0:00","","","NA","","N","127","67","3/24-AV Summary and PPT sent","","SMBA",""

I have tried the below code to remove the word wrap!

awk -F '"[^"]+"' 'NF<73{s = s $0; next} s{print s; s=""} 1; END{if (s) print s}' file

and also

awk -F, 'NF!=73&&!line{line=$0;next} NF!=73&&line{line=line $0} {n=split(line, a, ",")} n==73{print line;line=""}' file.csv

Nothing seems to be actually working!

Please suggest the linux code without any usage of external unix packages

James Z
  • 12,209
  • 10
  • 24
  • 44
Alekhya varma
  • 93
  • 2
  • 8
  • This is a fairly involved problem, but you are aided by the fact that not all fields are double-quoted in the input file. Since `awk` processes a record (line) at a time, you will need to check if the first character in the field is `'"'`. If so, start a counter and continue concatenating lines to the current record (variable representing that record) until the next `'"'` is found and then check that your total field count is 73. Using `NF != 73` is a bit nebulous since your fields can contain embedded `','` characters, e.g. `"Atlanta, Ga 30319"` – David C. Rankin Apr 22 '20 at 02:07
  • Often with CSVs like that (e.g. as exported from MS-Excel) the newline within the quoted field is `\n` alone while the newline at the end of each record is `\r\n` - is that the case with your file or not? If that's the case then you can trivially parse each record by just setting `RS='\r\n'` in GNU awk and removing all `\n`s within each record. Run `cat -v file` and let us know if/where you seen `^M`s show up. – Ed Morton Apr 22 '20 at 10:40

3 Answers3

2

Try this

gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }'  

Demo

$gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }'  < file1.txt
ID,OPPORTUNITY ID,CREATED_DATE,TIR NAME,MS Rep,SRC_SSR_REP,REGION,HP PBM NAME,COMPANY NAME,COMPANY ADDRESS,COMPANY CITY,COMPANY STATE,COMPANY ZIPCODE,COMPANY AMID,COMPANY USER CONTACT NAME,COMPANY USER TITLE,COMPANY USER PHONE,COMPANY USER EMAIL,PARTNER COMPANY NAME,PARTNER REP NAME,PARTNER REP EMAIL,PARTNER LID,WHOLESALER,PURCHASEDGE AC NUMBER,USAGE PERIOD,DEAL TYPE,CLWB WORKED ON,DEAL NUMBER,NAMED TERRITORY SLED,MONO HP SHARE %,COLOR HP SHARE %,TOTAL HP TONER SHARE %,DEAL VALUE MONO,DEAL VALUE COLOR,TOTAL TONER DEAL VALUE,EST DISCOUNT VALUE,REBATE TYPE MONO,REBATE TYPE COLOR,DISCOUNT TYPE,DEAL START DATE,DEAL END DATE,DEAL EXTENDED END DATE,DEAL POSITION,ECLIPSE ID,ECLIPSE DEAL STATUS,ECLIPSE APPROVED DATE,ECLIPSE DEAL APPROVED BY,LOST REASON,USAGE FILE LOCATION,CREARTED BY,MODIFIED BY,MODIFIED DATE,FINALISATION_RECEIVED_DATE,FINALISATION_WORKED_DATE,DEAL_PROCESSED_BY,DEAL_FINALISED_BY,FUNNEL_COMMENT,AV_SENT_DATE,PL_REMAN_VALUE,PL_REMAN_SHARE,FINALISATION_DOC_PATH,TIME ELAPSING ON,APPROVAL SENT DATE,APPROVAL RECEIVED DATE,SECONDARY_WHOLESALER,PREVIOUSECLIPSE_ID,PurchasEdge_(Y/N),HP_TONER_UNITS,PL_REMAN_UNITS,FINALISATION_COMMENTS,RENEWAL_POSITION,PROGRAM_NAME,CUSTOMERONBOARDEDON
51128,OPP-048699,3/23/2020 21:02,Adam Dohm,Cheryl Glenn,Tiffany Debose,MARKET SOURCE,,"Flathead Valley School District (Kalispell, Whitefish, Columbia Falls)",233 1st Ave E,Kalispell,MT,59901,,Joe Biangone,Purchasing,406-758-8392,biangonej@sd5.k12.mt.us,TONERPORT INCORPORATED, ,,10293955,ESSENDANT,,12 months,Renewal,,CL091515474R4-A,SLED,97,100,98,21592,16781,38373,2452,Defend,Defend,Defend,4/15/2020 0:00,4/14/2021 0:00,4/14/2021 0:00,Won,42921984,,,,,/E/Data/Funnel/Submit/FLATHEAD VALLEY SCHOOL DISTRICT USAGE_51128.xlsx,Tiffany Debose,Tiffany Debose,3/26/2020 14:49,3/26/2020 0:00,,Bhavana P V,,,,613.97,1.6,,,,,NA,42085906,N,179,3,3/26 - Deal added on eclipse ,,SMBA,
51126,OPP-048697,3/23/2020 19:52,Xavier Weems,,Tiffany Debose,EAST,Vladimir Jaksic,"Gray Television, Inc.","​Gray Television, Inc.4370 Peachtree Rd, NE.​Atlanta, Ga  30319​",,GA,30319,DN042973875,Dottie Boudreau,Manager,404-266-8333,dottie@gray.tv,"STAPLES, INC", ,,"10264576,10252948",NA,,12 months,New,,CL200351126,Commercial - Named,84,89,86,16143,7335,23478,3149,Defend,Defend,Defend,,,,AV summary and PPT sent,,,,,,"/E/Data/Funnel/Submit/GRAY TELEVISION, INC USAGE_51126.xlsb",Tiffany Debose,Tiffany Debose,3/26/2020 8:55,,,Deepthi K,,,3/26/2020 0:00,3239.96,13.8,,6/24/2020 0:00,,,NA,,N,168,27,3/24/2020 - sent for specialist approval 3/26/2020 - aV sent,,MCBigDeal,
51125,OPP-048696,3/23/2020 18:01,Xavier Weems,,Tiffany Debose,WEST,Jenni HoGlin,STURM FINANCIAL GROUP,3033 East First Avenue,Denver,CO,80206,,,,,,"STAPLES, INC", ,,"10264576,10252948",NA,,12 months,New,,CL200351125,Commercial - Non Named,42,87,65,10201,14198,24399,6369,Winback,Defend,Winback,,,,AV summary and PPT sent,,,,,,/E/Data/Funnel/Submit/STURM FINANCIAL GROUP USAGE_51125.xlsx,Tiffany Debose,Tiffany Debose,3/24/2020 7:49,,,Teja Ravi,,,3/24/2020 0:00,8417.66,34.5,,6/22/2020 0:00,,,NA,,N,127,67,3/24-AV Summary and PPT sent,,SMBA,
$

Digvijay S
  • 2,665
  • 1
  • 9
  • 21
  • The combining the lines script was wokring fine, except the spaces were converted to special characters like this ​ for ex. for id 51126 "Gray Television, Inc.","​Gray Television, Inc.4370 Peachtree Rd, NE.Atlanta, Ga  30319​" Except for that, everything else works! Thank you Digvijay :) – Alekhya varma Apr 23 '20 at 06:22
1

Assuming that the newline within fields and at the end of each record is \n because if it was \n within the fields and \r\n at the end of each record as exported by MS-Excel then this would be trivial, the following uses GNU awk for various extensions (multi-char RS, RT, FPAT and \s).

This will combine the lines:

awk -v RS='"[^"]+"' -v ORS= '{
    gsub(/\n/,"",RT)
    print $0 RT
}'

and this will remove leading/trailing spaces and enclose each field in quotes:

awk -v FPAT='[^,]*|"[^"]+"' -v OFS=',' '{
    for (i=1;i<=NF;i++) {
        gsub(/^"?\s*|\s*"?$/,"",$i)
        printf "\"%s\"%s", $i, (i<NF ? OFS : ORS)
    }
}'

so you could just use them together in pipe:

$ awk -v RS='"[^"]+"' -v ORS= '{gsub(/\n/,"",RT); print $0 RT}' file |
    awk -v FPAT='[^,]*|"[^"]+"' -v OFS=',' '{for (i=1;i<=NF;i++) {gsub(/^"?\s*|\s*"?$/,"",$i); printf "\"%s\"%s", $i, (i<NF ? OFS : ORS)} }'
"ID","OPPORTUNITY ID","CREATED_DATE","TIR NAME","MS Rep","SRC_SSR_REP","REGION","HP PBM NAME","COMPANY NAME","COMPANY ADDRESS","COMPANY CITY","COMPANY STATE","COMPANY ZIPCODE","COMPANY AMID","COMPANY USER CONTACT NAME","COMPANY USER TITLE","COMPANY USER PHONE","COMPANY USER EMAIL","PARTNER COMPANY NAME","PARTNER REP NAME","PARTNER REP EMAIL","PARTNER LID","WHOLESALER","PURCHASEDGE AC NUMBER","USAGE PERIOD","DEAL TYPE","CLWB WORKED ON","DEAL NUMBER","NAMED TERRITORY SLED","MONO HP SHARE %","COLOR HP SHARE %","TOTAL HP TONER SHARE %","DEAL VALUE MONO","DEAL VALUE COLOR","TOTAL TONER DEAL VALUE","EST DISCOUNT VALUE","REBATE TYPE MONO","REBATE TYPE COLOR","DISCOUNT TYPE","DEAL START DATE","DEAL END DATE","DEAL EXTENDED END DATE","DEAL POSITION","ECLIPSE ID","ECLIPSE DEAL STATUS","ECLIPSE APPROVED DATE","ECLIPSE DEAL APPROVED BY","LOST REASON","USAGE FILE LOCATION","CREARTED BY","MODIFIED BY","MODIFIED DATE","FINALISATION_RECEIVED_DATE","FINALISATION_WORKED_DATE","DEAL_PROCESSED_BY","DEAL_FINALISED_BY","FUNNEL_COMMENT","AV_SENT_DATE","PL_REMAN_VALUE","PL_REMAN_SHARE","FINALISATION_DOC_PATH","TIME ELAPSING ON","APPROVAL SENT DATE","APPROVAL RECEIVED DATE","SECONDARY_WHOLESALER","PREVIOUSECLIPSE_ID","PurchasEdge_(Y/N)","HP_TONER_UNITS","PL_REMAN_UNITS","FINALISATION_COMMENTS","RENEWAL_POSITION","PROGRAM_NAME","CUSTOMERONBOARDEDON"
"51128","OPP-048699","3/23/2020 21:02","Adam Dohm","Cheryl Glenn","Tiffany Debose","MARKET SOURCE","","Flathead Valley School District (Kalispell, Whitefish, Columbia Falls)","233 1st Ave E","Kalispell","MT","59901","","Joe Biangone","Purchasing","406-758-8392","biangonej@sd5.k12.mt.us","TONERPORT INCORPORATED","","","10293955","ESSENDANT","","12 months","Renewal","","CL091515474R4-A","SLED","97","100","98","21592","16781","38373","2452","Defend","Defend","Defend","4/15/2020 0:00","4/14/2021 0:00","4/14/2021 0:00","Won","42921984","","","","","/E/Data/Funnel/Submit/FLATHEAD VALLEY SCHOOL DISTRICT USAGE_51128.xlsx","Tiffany Debose","Tiffany Debose","3/26/2020 14:49","3/26/2020 0:00","","Bhavana P V","","","","613.97","1.6","","","","","NA","42085906","N","179","3","3/26 - Deal added on eclipse","","SMBA",""
"51126","OPP-048697","3/23/2020 19:52","Xavier Weems","","Tiffany Debose","EAST","Vladimir Jaksic","Gray Television, Inc.","Gray Television, Inc.4370 Peachtree Rd, NE.​Atlanta, Ga  30319","","GA","30319","DN042973875","Dottie Boudreau","Manager","404-266-8333","dottie@gray.tv","STAPLES, INC","","","10264576,10252948","NA","","12 months","New","","CL200351126","Commercial - Named","84","89","86","16143","7335","23478","3149","Defend","Defend","Defend","","","","AV summary and PPT sent","","","","","","/E/Data/Funnel/Submit/GRAY TELEVISION, INC USAGE_51126.xlsb","Tiffany Debose","Tiffany Debose","3/26/2020 8:55","","","Deepthi K","","","3/26/2020 0:00","3239.96","13.8","","6/24/2020 0:00","","","NA","","N","168","27","3/24/2020 - sent for specialist approval 3/26/2020 - aV sent","","MCBigDeal",""
"51125","OPP-048696","3/23/2020 18:01","Xavier Weems","","Tiffany Debose","WEST","Jenni HoGlin","STURM FINANCIAL GROUP","3033 East First Avenue","Denver","CO","80206","","","","","","STAPLES, INC","","","10264576,10252948","NA","","12 months","New","","CL200351125","Commercial - Non Named","42","87","65","10201","14198","24399","6369","Winback","Defend","Winback","","","","AV summary and PPT sent","","","","","","/E/Data/Funnel/Submit/STURM FINANCIAL GROUP USAGE_51125.xlsx","Tiffany Debose","Tiffany Debose","3/24/2020 7:49","","","Teja Ravi","","","3/24/2020 0:00","8417.66","34.5","","6/22/2020 0:00","","","NA","","N","127","67","3/24-AV Summary and PPT sent","","SMBA",""

Otherwise see What's the most robust way to efficiently parse CSV using awk? for how to do what you want with a single call to any awk.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • The combining the lines script was wokring fine, except the spaces were converted to special characters like this ​ for ex. for id 51126 "Gray Television, Inc.","​Gray Television, Inc.4370 Peachtree Rd, NE.Atlanta, Ga  30319​" – Alekhya varma Apr 23 '20 at 01:24
  • Except for that, everything else works! Thank you Morton :) – Alekhya varma Apr 23 '20 at 06:21
  • You're welcome. There's nothing in either script in my answer that adds "special characters". The first script removes newlines, the second removes `"` at the start/end of each field and then adds `"` alone at the start/end of each field. The sample input you posted already had "special "chartacters" in it - when I copy/paste it from your question I see `"<200b>Gray Television, Inc. 4370 Peachtree Rd, NE. <200b>Atlanta, Ga 30319 <200b> "`. – Ed Morton Apr 23 '20 at 11:16
0

This might work for you (GNU sed):

sed -E ':a;N;s/^([^"]*("[^"]*"[^"]*)*"[^"\n]*)\n/\1/;ta;P;D' file |
sed -E ':a;s/^([^"]*("[^"]*"[^"]*)*"[^",]*),/\1\n/;ta;s/"//g;s/[^,]*/"&"/g;y/\n/,/'

The solution is in two parts:

  1. Remove any newlines between double quotes.
  2. Surround any comma delimited fields by double quotes

The first sed invocation appends the following lines (removes the intervening newlines) until a line has a balanced set of double quotes. The first of these lines in printed and remainder processed along with the following line until all lines have been printed.

The second invocation, replaces any commas within double quotes with newlines, any double quotes are removed and all non-comma fields surrounded by double quotes. The newlines are then replaced by commas.

potong
  • 55,640
  • 6
  • 51
  • 83