3

I'm trying to convert many text files to xls files. The style of the txt file is as follow:

"Name";"Login";"Role"
"Max Muster";"Bla102";"user"
"Heidi Held";"Held100";"admin"

I tried to work with this bash script:

for file in *.txt; do
      tr ";" "," < "$file" | paste -d, <(seq 1 $(wc < "$file")) - > "${file%.*}.xls"
      soffice --headless --convert-to xls:"MS Excel 95" filename.xls "${file%.*}.xls"
done

with this, I lost the header row and I also get a column with many Chinese signs, but the rest looks okay:

攀挀琀 |  Max Muster |  Bla102   |  user
氀愀猀 |  Heidi Held |  Held100  |  admin

How can I get rid of these Chinese signs and keep the header row?

Dominique
  • 16,450
  • 15
  • 56
  • 112
susliks2
  • 69
  • 7
  • If `soffice` discards the first line, why not simply synthesize a bogus first line for it to discard? – tripleee Nov 20 '18 at 09:54
  • The odd `seq` command looks like you are reinventing `nl` but on the whole, I guess `awk -F ';' 'BEGIN { OFS="," } { print FNR, $0 }' "$file"` would solve the problem more elegantly and efficiently (this also replaces the `tr`). – tripleee Nov 20 '18 at 09:55
  • It is unclear on what grounds exactly you want the Chinese to be discarded. Should we just discard the first column wholesale? Or replace non-ASCII text with nothing? Please clarify (you should [edit] your question to outline in more detail what you want). – tripleee Nov 20 '18 at 09:57
  • Actually I do not totally understand your `awk`-command. It reads the input file and replaces the `;` with `,`. But can I produce an Output? Like this: `for file in *.txt; do awk -F ';' 'BEGIN { OFS="," } { print FNR, $0 }' "$file" > "${file%.*}.xls" done` – susliks2 Nov 20 '18 at 10:34
  • You lack a semicolon before `done` but yeah, that's the idea. – tripleee Nov 20 '18 at 10:40
  • Your sample data doesn't contain any Chinese in the input; do you mean the Chinese strings appear out of thin air? Is this something `soffice` does for some reason? – tripleee Nov 20 '18 at 10:40
  • You could add `FNR==1 { for (i=1; i<=NF; i++) printf "bogus%s" (i==NF ? "\n" : OFS }` to synthesize a first line for `soffice` to discard. – tripleee Nov 20 '18 at 10:43
  • Yes excatly, I have no idea where this Chinese letter are from. First I used another script, but with that I had always a spare line between every row. Then I changed it an I got this Chinese-column. I think the soffice is a LibreOffice command to produce the xls-file with it – susliks2 Nov 20 '18 at 10:57
  • Are there invisible characters in your input file? I'm guessing something gets interpreted as UTF-16 which seems consistent with speculations e.g. here: https://forum.openoffice.org/en/forum/viewtopic.php?f=9&t=69798 – tripleee Nov 20 '18 at 11:00
  • Or maybe you could add an UTF-8 BOM to the beginning of the very first line and see if that helps anything. Try `FNR==1 { printf "\357\273\277" }` (possibly with some locale tweaks, I'm guessing maybe `LC_ALL=C`?) – tripleee Nov 20 '18 at 11:04
  • Yes, when i change Unicode UTF-16 to Unicode (UTF-8) then the Chinese letter disappear - but a needless empty line appears after every row. When I convert it manually (just open the txt-File with LibreOffice and save it as .xls) everything is like it should be. – susliks2 Nov 20 '18 at 11:18
  • Does the input file have DOS carriage returns? https://stackoverflow.com/questions/39527571/are-shell-scripts-sensitive-to-encoding-and-line-endings – tripleee Nov 20 '18 at 11:32
  • Yes, I think this is the case. – susliks2 Nov 20 '18 at 12:10

1 Answers1

2

The question unfortunately does not provide enough details to be sure what exactly the issues are; but we have identified in comments at least the following.

  • Apparently, the input file contains DOS carriage returns.
  • Apparently, soffice attempted to read the file as UTF-16, which is what produced the essentially random Chinese characters. (The characters could be anything; it's just more probable that a random Unicode BMP character will be in a Chinese/Japanese block.)

With those observations and a refactoring of the existing script, try

for file in *.txt; do
    awk -F ';' 'BEGIN { OFS="," }
        FNR==1 {
            # Add UTF-8 BOM
            printf "\357\273\277"
            # Generate header line for soffice to discard
            for (i=1; i<=NF; i++) printf "bogus%s", (i==NF ? "\n" : OFS)
        }
        { sub(/\015/, ""); print FNR, $0 }' "$file" > "${file%.*}.xls"
    soffice --headless --convert-to xls:"MS Excel 95" filename.xls "${file%.*}.xls"
done

In so many words, the Awk script splits each input line on semicolons (-F ';') and sets the output field separator OFS to a comma. On the first output line, we add a BOM and a synthetic header line for soffice to discard before the real output, so that the header line appears like a regular data line in the output. The sub takes care of removing any DOS carriage return character, and the variable FNR is the current input line's line number.

I'm not sure if the BOM or the bogus header line are strictly necessary, or if perhaps you need to pass in some additional options to make soffice treat the input as proper UTF-8. Perhaps you also need to include LC_ALL=C somewhere in the pipeline.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Thanks a lot for your suggestion. With this i get a runtime error: `awk: run time error: not enough arguments passed to printf("bogus%s") FILENAME="GRA_EPF_CC_RF.txt" FNR=1 NR=1` – susliks2 Nov 20 '18 at 14:54
  • Sorry, my bad; added the missing comma now. – tripleee Nov 20 '18 at 16:41
  • 1
    You could probably also say `RS="\r\n"` in the `BEGIN` block to avoid the `sub`, though this is arguably a micro-optimization. – tripleee Nov 20 '18 at 17:23