Informatica - Is it feasible to determine & if required convert a files encoding through use of a program called via a Command Task?

Question

Our company processes invoice data from various markets & in multiple languages. Typically the data is delivered in .txt or .dat files. The format or layout of the invoice data in these files may be common between many markets. E.g. The placement of an Invoice Number in a file from Brazil will be the same as file from Russia or a file from the United Kingdom. The encoding of the source files can vary. A file from the UK may be encoded as ASCII, a file from Brazil ANSI & a file from Russia UTF-8. This is not set in stone. Our target database is configured as UTF-8.

As the data layout between every file is fundamentally identical we would like to, if possible, process all files through the one Informatica workflow & where needed convert the file encoding at runtime.

I'm not a Java Developer but it occurred to me whether a jar could be called from a Command Task to check a files encoding and run a conversion if required.

Or should I be looking at another type of solution?

is it possible to convert everything to UTF-8 irrespective of their encoding. — Koushik Roy, Feb 22 '21 at 04:59
Do you know in advance the encoding for each file and is that encoding consistent e.g. UK is always ASCII - or are you requiring a process that will automatically detect the encoding? — NickW, Feb 22 '21 at 09:56
@NickW No, it's possible that a UK file may be encoded ANSI or UTF-8. It may even be that the file is marked as ASCII encoded but contains occasional UNICODE characters! A problem for a later date perhaps... — ejamesmord, Feb 23 '21 at 02:24
@KoushikRoy In most cases that's possible manually using a text editor (e.g. Notepad++) but not as yet programmatically. Plus if you see my reply to NickW there are some exceptions. — ejamesmord, Feb 23 '21 at 02:32
The idea is 'convert all files to UTF-8 which is highest set and can handle pretty much all character sets. So, you can use powershesll (https://superuser.com/questions/1163753/converting-text-file-to-utf-8-on-windows-command-prompt) or UNIX/LINUX (https://stackoverflow.com/questions/64860/best-way-to-convert-text-files-between-character-sets). You dont have to worry about input character set - it can be anything. — Koushik Roy, Feb 23 '21 at 04:37
@KoushikRoy That's a great suggestion, thank you! As each file arrives we can check the 'file --mime-encoding' property for the current charset & convert to utf-8 using iconv if required. Very much appreciated! — ejamesmord, Feb 23 '21 at 23:03

score 0 · Answer 1 · answered Feb 24 '21 at 03:35

The idea is 'convert all non UTF files to UTF-8 which is highest set and can handle pretty much all character sets. So, follow below steps -

use file --mime-encoding inp_file to check encoding
use powershesll (link - superuser.com/questions/1163753/…) or UNIX/LINUX shell (link - stackoverflow.com/questions/64860/…)

Informatica - Is it feasible to determine & if required convert a files encoding through use of a program called via a Command Task?

1 Answers1