1

I need to extract a PDF file stored in a NTEXT column (Collation latin1_general_ci_as) of a SQL Server database table (SQL Server 2014).

The problem is, no matter how I read this column, I always get a corrupted PDF file. My guess is that I'm not reading it with the correct charset, since the pdf content seems ok when reading it in SQL Server Studio or DBVisualizer.

I'm using the JTDS driver and the standard JDBC API to access the database. To read the column, I've tried the following:

 PreparedStatement stmt = conn.prepareStatement("select my_ntext_column from my_table where id = 1");
 ResultSet rs = stmt.executeQuery();
 if (rs.next()) {
   byte[] data = IOUtils.toByArray(rs.getAsciiStream());
   FileUtils.writeByteArrayToFile(new File("C:\\test.pdf"), data);
 } 

With this code, almost the entire file content is replaced by question marks.

 PreparedStatement stmt = conn.prepareStatement("select my_ntext_column from my_table where id = 1");
 ResultSet rs = stmt.executeQuery();
 if (rs.next()) {
   byte[] data = IOUtils.toByArray(new InputStreamReader(rs.getBinaryStream(), "WINDOWS-CP1252"));
   FileUtils.writeByteArrayToFile(new File("C:\\test.pdf"), data);
 } 

With the above code, a few chars are replaced by question marks, although, adobe reader and other pdf visualizers can't open it. I've also tried other charsets (UTF8, UTF16, ISO-8859-1), but no luck.

 PreparedStatement stmt = conn.prepareStatement("select convert(varbinary(max),convert(nvarchar(max),my_ntext_column)) from my_table where id = 1");
 ResultSet rs = stmt.executeQuery();
 if (rs.next()) {
   byte[] data = IOUtils.toByArray(rs.getBinaryStream());
   FileUtils.writeByteArrayToFile(new File("C:\\test.pdf"), data);
 }

Reading the content as a binary also doesn't help. The result file can't be read by any pdf visualizer.

Edit:

These are the first 8 bytes of the NTEXT column:

 %PDF-1.3
luizmineo
  • 956
  • 7
  • 7
  • Have your application dump the first 8 characters of the NTEXT data to the console using `System.out.println(rs.getString(1).substring(0, 8));` and [edit] your question to show us what it looks like. – Gord Thompson Mar 06 '15 at 08:39
  • Thanks! I've just updated my question. – luizmineo Mar 06 '15 at 11:08
  • Do you have any information on the application that inserted those files into the database? Specifically, do you know *for certain* that the application that inserted the files can successfully extract them? PDF files are not plain text so storing them in an NTEXT column can lead to unfortunate (and perhaps irreversible) character conversions if the file is not encoded first (Base64 or similar). A recent similar question [here](http://stackoverflow.com/q/28233750/2144390) may be of interest. – Gord Thompson Mar 06 '15 at 15:50
  • You are right. It turned out that pdf files which contains embedded images are corrupted in the database, and there's nothing we can do about it. Thanks very much for your help! – luizmineo Mar 08 '15 at 14:00

0 Answers0