0

ASP .NET 6 MVC controller extracts text from pdf files using iText 7 like

string text = PdfTextExtractor.GetTextFromPage(page, strategy);

and store it in text column in PostgreSQL 12.2 database. Database encoding is UTF8.

For some pdf files, iText returns 0x00 bytes in string. EF Data provider for EF Core is used to store texts. If text contains 0x00 character, saving it using

await ctx.Doks.AddAsync( new Dok() { Text = text } );
await ctx.SaveChangesAsync();

thorws exception

Microsoft.EntityFrameworkCore.DbUpdateException: An error occurred while saving the entity changes. See the inner exception for details. ---> Npgsql.PostgresException (0x80004005): 22021: invalid byte sequence for encoding "UTF8": 0x00 at Npgsql.Internal.NpgsqlConnector.g__ReadMessageLong|215_0(NpgsqlConnector connector, Boolean async,

occurs. How to fix this. There may be two possibilites:

  1. Force iText to return only legal unicode texts. Havent found such option.

  2. Replace illegal charactrs with hex notation like \0x00 before storing in C# code. How to find all illegal characters ?

Andrus
  • 26,339
  • 60
  • 204
  • 378

0 Answers0