4

A txt-File shall be analyzed: It has many lines each having 20 fields separated by TABs and each field can contain any type of data (Integer, FloatingPoint, DateTime, text also conatining BLANKS and "" etc.) and in addition to that fields can be empty, e.g. such a line would start like

111TABTABWalterTAB11.1234TABThis is a sample TextTAB"Another sample"TABTABTAB555...

How can I read each line of the file into an array arrLine having 20 columns, e.g.

  • arrLine(0)=111
  • arrLine(1)=empty
  • arrLine(2)=Walter
  • ...

I tried this proposal like

while IFS=$'\t' read -r -a arrLine; do
    echo "${#arrLine[@]} items: ${arrLine[@]}"
    echo "${arrLine[3]} || ${arrLine[4]} || ${arrLine[5]}"
done < "test.txt"

but empty colums are not filled into the array.

Thank you!

StOMicha
  • 315
  • 1
  • 3
  • 12

2 Answers2

8

With IFS="anyWithespaceCharacter" read there is no way to read an empty field between two other fields; with non-whitespace characters this would work. This inconsistent behavior is dictated by posix:

The term " IFS white space" is used to mean any sequence (zero or more instances) of white-space characters that are in the IFS value (for example, if IFS contains <space>/<comma>/<tab>, any sequence of <space> and <tab> characters is considered IFS white space).

  • IFS white space shall be ignored at the beginning and end of the input.
  • Each occurrence in the input of an IFS character that is not IFS white space, along with any adjacent IFS white space, shall delimit a field, as described previously.
  • Non-zero-length IFS white space shall delimit a field.

But you still can use mapfile to split each line into fields:

while IFS= read -r line; do
    mapfile -td $'\t' arrLine < <(printf %s "$line")
    declare -p arrLine # prints the array for debugging
done < test.txt

Alternatively, swap the whitespace delimiter (in this case tab) for any other non-whitespace character that does not appear in the input. In this case we use the ascii symbol "unit separator" \037.

while IFS=$'\037' read -ra arrLine; do
    declare -p arrLine # prints the array for debugging
done < <(tr \\t \\037 < test.txt)
Socowi
  • 25,550
  • 3
  • 32
  • 54
  • 3
    To visualize the content of array arrLine I recommend: `declare -p arrLine` – Cyrus Mar 04 '21 at 18:30
  • Thank you Socowi, looks perfect. @Cyrus: I don't understand your comment, where to add that code? – StOMicha Mar 04 '21 at 18:42
  • @StOMicha: To quickly get an overview of what is in array arrLine, you can remove both lines with `echo` and put there one `declare -p arrLine`. – Cyrus Mar 04 '21 at 18:55
  • Thanks Cyrus, perfect for debugging! – StOMicha Mar 04 '21 at 19:00
  • This sentence is valid only for *whitespace* `IFS` characters: `With IFS=... read there is no way to read an empty field between two other fields` – M. Nejat Aydin Mar 04 '21 at 19:29
  • @M.NejatAydin True, thank you for the note. This also gave me an idea for an alternative solution. – Socowi Mar 04 '21 at 21:21
  • 1
    @Socowi The alternative solution may not work if the line already contains a *unit separator* (`\037`) character (unusual, but not impossible). But there is a remedy for that: Replace the `tr \\t \\037` with `tr '\t\037' '\037\t'` and add this as a first line within the `while` loop: `arrLine=("${arrLine[@]//$'\t'/$'\037'}")` – M. Nejat Aydin Mar 04 '21 at 21:47
1

You can split an input line into an array using IFS but in this case bash wants to glob the tab so you lose columns if there are consecutive tabs. You can sidestep that by translating the tabs to a different delimiter.

#IFS=$'\t'
IFS='|'
while read -a arrLine; do
  for i in {0..19}; do
    echo "arrLine [$i]: ${arrLine[$i]}"
  done
done < <(cat input.txt | tr '\t' '|')

arrLine [0]: 111
arrLine [1]:
arrLine [2]: Walter
arrLine [3]: 11.1234
arrLine [4]: This is a sample Text
arrLine [5]: "Another sample"
arrLine [6]:
arrLine [7]:
arrLine [8]: 555...
arrLine [9]:
arrLine [10]:
etc.....
dpippen
  • 36
  • 3