1

I am trying to pre-process UTF-8 CSV files, to create SQL script to create tables in a postgresql DB. Headers come from Excel and are human optimized:

Ex:

GKCode,GKCode Neu,Name,Markttyp,Jahr,Leerstandsrate Büro (MFG) (%),Spitzenmiete Büro City (Euro/qm),Durchschnittsmiete Büro City (Euro/qm),Nettoanfangsrendite Büro zentrale Lagen (%),Nettoanfangsrendite Büro dezentrale Lagen (%),Nettoabsorption (qm)

What I want to do:

  1. get the headers (first line

  2. remove BOM if any

  3. prepend doublequotes after and append doublequotes and ' varchar' before each delimiter

  4. prepend doublequotes at the beggining of stream

  5. append doublequotes and ' varchar' at the end of stream (last varchar replaced with '### bad append ###' for test purpose

What did I try so far:

  1. tried to use a file as intermediate

  2. tried to use echo and AWK

my code (best possible example so far):

csv_delimiter=","
head -1 raw_import |sed 's/^\xEF\xBB\xBF//'| sed -e "s|$csv_delimiter|\" varchar$csv_delimiter\"|g"| sed -e "s|$|\" ### bad append ###|g"

I receive unexpected result, with 'bad append' being in wrong place:

GKCode" varchar,"GKCode Neu" varchar,"Name" varchar,"Markttyp" varchar,"Jahr" varchar,"Leerstandsrate Büro (MFG) (%)" varchar,"Spitzenmiete Büro City (Euro/qm)" varchar,"Durchschnittsmiete Büro City (Euro/qm)" varchar,"Net" ### bad append ###o zentrale Lagen (%)" varchar,"Nettoanfangsrendite Büro dezentrale Lagen (%)" varchar,"Nettoabsorption (qm)

results are stable with simple headers (alphanumeric)

echo "1,wqd15,15,dq5w15,d5qw,4dq,51dqw5d,q51d,qw51d,qw51d,5q,wd,56" |sed '1s/^\xEF\xBB\xBF//'| sed -e "s|$csv_delimiter|\" varchar$csv_delimiter\"|g"| sed -e "s|$|\" varchar|g"
  • 1
    Do you have any `\r` `0x0d` characters in the input? `$csv_delimiter=","` assignments are without the `$`. It should work. Try to pipe the output of `head -1` into `hexdump -C `or `xxd -p` and inspect (and post?) the output. – KamilCuk Aug 02 '19 at 22:15
  • yes, I have 0x0d almost at the end, for some reason.... `head -1 raw_import | hexdump -C | grep 0d` `00000100 29 0d 0a |)..|` how can I proceed? – Mihail Gershkovich Aug 02 '19 at 22:37
  • 2
    You have dos like endings. So I think it works like this: the string "bad append" is inserted after `0x0d` and before `0x0a`. Either convert to unix endings `dos2unix`, or just match the 0x0d `s|\r$|....` – KamilCuk Aug 02 '19 at 22:43
  • already done so, somehow I did not think about it, that sed might really return the carret while working with stream. Thanks a lot! – Mihail Gershkovich Aug 02 '19 at 22:45

1 Answers1

1

Thanks to Kamil Cuk I found the error: the file came from windows environment, using ´´´ dos2unix raw_import ´´´

before proceeding, fixes the error!!!!!