5

I have a Haskell application that, as one of many steps, needs to store and retrieve raw binary blob data in a database. I'm not completely above deciding to, instead, store that data in plain disk files, but that does start leading to an additional round of permissions issues, so right now I want to go with the database.

I've created a table with a column of type bytea.

I have a Lazy Bytestring in memory.

When I make a call like this

run conn "INSERT INTO documents VALUES (?)" [toSql $ rawData mydoc]

postgres gets a bit angry at the data. The exact error message is

invalid byte sequence for encoding \"UTF8\": 0xcf72

I also know beyond doubt that I have NUL values in the data stream. So, with all of that in mind, what is the correct way to encode the data safely for insertion?


Updated

Here is the description for my table

db=> \d+ documents
                          Table "public.documents"
     Column      |            Type             | Modifiers | Storage  | Description 
-----------------+-----------------------------+-----------+----------+-------------
 id              | character varying(16)       | not null  | extended | 
 importtime      | timestamp without time zone | not null  | plain    | 
 filename        | character varying(255)      | not null  | extended | 
 data            | bytea                       | not null  | extended | 
 recordcount     | integer                     | not null  | plain    | 
 parsesuccessful | boolean                     | not null  | plain    | 
Indexes:
    "documents_pkey" PRIMARY KEY, btree (id)

This is the full text of a module that demonstrates the current problem I'm having after adding jamsdidh's code. My error message has changed from the encoding problem above to "invalid input syntax for type bytea".

module DBMTest where

import qualified Data.Time.Clock as Clock
import Database.HDBC.PostgreSQL
import Database.HDBC
import Data.ByteString.Internal
import Data.ByteString hiding (map)
import Data.Char
import Data.Word8
import Numeric

exampleData = pack ([0..65536] :: [Word8]) :: ByteString

safeEncode :: ByteString -> ByteString
safeEncode x = pack (convert' =<< unpack x)
    where
    convert' :: Word8 -> [Word8]
    convert' 92 = [92, 92]
    convert' x | x >= 32 && x < 128 = [x]
    convert' x = 92:map c2w (showIntAtBase 8 intToDigit x "")

runTest = do
    conn <- connectPostgreSQL "dbname=db"
    t <- Clock.getCurrentTime
    withTransaction conn
        (\conn -> run conn
            "INSERT INTO documents (id, importTime, filename, data, recordCount, parseSuccessful) VALUES (?, ?, ?, ?, ?, ?)"
            [toSql (15 :: Int),
             toSql t,
             toSql ("Demonstration data" :: String),
             toSql $ safeEncode exampleData,
             toSql (15 :: Int),
             toSql (True :: Bool)])
Savanni D'Gerinel
  • 2,379
  • 17
  • 27
  • I've never used HDBC, but I have had success with postgresql-simple? (I haven't tried it for bytea data though.) – Tom Ellis Dec 10 '13 at 23:09
  • So far, no. I have quite a lot of code already and already have a handle on this API. I cannot afford the cost of switching right now. – Savanni D'Gerinel Dec 10 '13 at 23:41
  • Turn query-logging on at the PostgreSQL end and see how the data is being escaped. It should be octal (older format) or hex (newer format). http://www.postgresql.org/docs/current/static/datatype-binary.html#AEN5318 – Richard Huxton Dec 11 '13 at 17:16
  • Are you using HDBC-postgresql or HDBC-odbc? I'd recommend you try HDBC-odbc to play with postgresql, since the bindings are more developed (and faster). – Nicolas Wu Dec 15 '13 at 19:01
  • I believe this is an issue in postgresql-simple as well. At least I'm getting the same error with what I believe is essentially the same code. – Justin Blank Apr 24 '16 at 14:51

1 Answers1

2

I beleive this is a bug in HDBC-postgresql. I can explain why I think this is so, and can give you a workaround that I put together and tested.


I would expect HDBC-postgresql to convert a bytestring to the appropriate format to be inserted, but you can quickly verify that it is instead expecting the bytestring to hold the octal-backspace-escaped values of the data. For instance,

run conn "INSERT INTO documents VALUES (?)" [toSql $ B.pack [92, 0x31, 0x30, 0x31]]

inserts the single character 'A' into the database! This only makes sense when you realize that [92, 0x31, 0x30, 0x31] is the ascii representation of "\101", and "\101" is the octal representation of 'A'. Because the octal-backspace-escaped strings guarantee allowing values in the range 32-127 to be passed directly through (see the link Richard Huxton supplied in the comments for the details), the insert query does in fact work properly for standard english text, and could have gone unnoticed....

run conn "INSERT INTO documents VALUES (?)" [toSql $ B.pack [65]]

also inserts 'A'. Values higher than 127 aren't guaranteed to work, and are interpreted based on the character encoding used. If you look at the HDBC-postgresql code, or the logs of a query, you can see that it is setting the variable 'client_encoding' to utf8. The data coming in from the bytestring is therefore expected to be valid utf8, and complains when it sees a sequence that could not exist as a utf8 character.

The proper fix would be to wait for the bug to be fixed by the HDBC-postgresql guys, but in the meantime, you can use this code as a workaround....

import Data.ByteString.Internal
import Data.Char
import Data.Word8
import Numeric
import Text.Printf

convert::B.ByteString->B.ByteString
convert x = B.pack (convert' =<< B.unpack x)
          where
            convert'::Word8->[Word8]
            convert' 92 = [92, 92]
            convert' x | x >= 32 && x < 128 = [x]
            convert' x = 92:map c2w (printf "%03o" x) 

Now you can just use

run conn "INSERT INTO documents VALUES (?)" [toSql $ convert $ rawData mydoc]
jamshidh
  • 12,002
  • 17
  • 31
  • I've submitted a bug report to hdbc-postgresql, if I am wrong about this being a bug, they will set me straight soon! See https://github.com/hdbc/hdbc-postgresql/issues/33 – jamshidh Dec 15 '13 at 19:56
  • I've implemented this with the difference that my ByteString is a Lazy bytestring. Now I just get postgres telling me that I'm giving an invalid input syntax for type bytea. There is no additional information in the query log, even though I have logging set to 'all'. Although I also have logging set to syslog, so maybe it sends less data in that case. – Savanni D'Gerinel Dec 17 '13 at 17:08
  • @SavanniD'Gerinel- Can you get it work with strict bytestrings? It will be easier to debug if I know this.... – jamshidh Dec 17 '13 at 17:21
  • No, I can't. I added a big swath of example code that illustrates what I am doing. If you like, I can also add library versions. – Savanni D'Gerinel Dec 17 '13 at 17:44
  • @SavanniD'Gerinel- I had a bug in my code, and fixed it.... `showIntAtBase 8 intToDigit x ""` doesn't add leading zeros, so bytes lower than 32 were broken (remember, 32-127 was just passed through anyway). Also, in your example you used `[1..65536]::[Word8]`. This doesn't do what you think it will. Remember, `Word8` only goes from 0-255, and strangly Haskell just truncates conversions from numbers greater than 255 rather than complain, so `([0..65536]::[Word8]) = ([0..0]::[Word8]) = [0]`. If you want to test the full range of chars, use `[0..255]::[Word8]` – jamshidh Dec 17 '13 at 19:22
  • I have the code appearing to work now (though I don't have my my retrieval operation working to verify it). I modified your code slightly to have a minimal set of modules and explicit namespaces, since I had a lot of trouble finding c2w. – Savanni D'Gerinel Dec 29 '13 at 18:45