0

I iterate through a collection of CSV files in bash, running:

iconv --from-code=ISO-8859-1 --to-code=UTF-8 ${FILE} | \
    sed -e 's/\"//g' | \
    sed -e 's/, /,/g' \
    > ${FILE}.utf8

Running iconv to fix UTF-8 characters, then the first sed call removes the double quote characters, and the final sed call is supposed to remove leading and trailing whitespace around the commas.

HOWEVER, I still have a line like this in the saved file:

FALSE,,,, 2.40,,

The COPY command in postgres is kind of dumb, so it thinks " 2.40" is not valid syntax for a numeric value.

Where am I going wrong w/ my processing of the CSV file? Thanks!

Wells
  • 10,415
  • 14
  • 55
  • 85

4 Answers4

2

Probably what is happening is that you have more than one space or more than one field that has a space, and so you are seeing the results of a single successful trailing-blank substitution on a line that had more than one.

BTW, you can give multiple -e arguments to sed. Try this:

... | sed -e 's/"//g' -e 's/ *, */,/g'
DigitalRoss
  • 143,651
  • 25
  • 248
  • 329
0

Your second sed only removes trailing spaces (in fact, only a single trailing space). What here is supposed to remove leading spaces?

geekosaur
  • 59,309
  • 11
  • 123
  • 114
  • Forgetting leading spaces.. its not removing a single trailing space, either. – Wells Apr 14 '12 at 23:25
  • You don't know that; if it had *two* trailing spaces, you get one on output. That was what I meant by the parenthetical. – geekosaur Apr 14 '12 at 23:33
0

Do a construct one of this to remove spaces:

sed -e ':a; s/, /,/g; ta'
sed -e 's/,[ ][ ]*/,/g'

The first one will recursively perform updates till it finds no matching combinations. Second one will search for space followed by a possible group of spaces.

And it is not necessary to remove ", as by default PostgreSQL understands them when using COPY ... WITH (FORMAT CSV)

vyegorov
  • 21,787
  • 7
  • 59
  • 73
0

A general alternative to preparing the CSV to fit the data-types exactly is to import all data as text (into a temporary table with text columns) and let the PostgreSQL type-coercion mechanisms do some of the work automatically.

In particular, leading and trailing white space is trimmed automatically when a text is cast to a numeric type or date / time type. Try:

SELECT '   234 '::text::int;
SELECT '    23.4 '::text::float8; -- incl. a leading tab
SELECT '    2012-12-1    '::text::date; -- incl. a leading & trailing tab

So if you have a table like:

CREATE TABLE foo
( id    integer,
  col_a date,
  col_b double precision
);

You can:

CREATE TEMP TABLE foo_tmp  -- dropped automatically at end of session
( id    text,
  col_a text,
  col_b text
);
COPY foo_tmp FROM '/path/to/foo_file.csv';
INSERT INTO foo
SELECT id::int
      ,col_a::date
      ,col_b::double precision
FROM   foo_tmp;

Or use any other Postgres function to prepare the string as needed.

All you need is a valid CSV format.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228