0

I'm trying to extract a multi line text from a text file where values are separated by delimiters and save it into a string or an array. Most of the values are extracted and saved to a variable by awk but the problem occurs when I need to extract a multi line description of a specific product into a variable/array.

The simplified input file syntax looks like this: ID;Name;value1;value2;DESCRIPTION;valueX;valueY;

I'm extracting the first values with awk -F ";" '{print $1}' assigning them to variables fro future manipulation and it works fine but the problem occurs at the "DESCRIPTION" part since its multi line with HTML tags. An example of how the DESCRIPTION looks like:

value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>


<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>

<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY

Can you suggest a way of getting this done the way so I can assign the DESCRIPTION in to some kind of variable or an array within the bash script and manipulate it further on?

Ronald
  • 13
  • 3
  • bash cannot parse the complex structure of markup languages like HTML – Léa Gris Jan 28 '21 at 16:29
  • What is your expectation is terms of separating the record to fields when you delimiter is part of DESCRIPTION? Specifically, note that `""text-align: center;"">` in your example description contains your delimiter `;`. – Nikolaos Chatzis Jan 28 '21 at 16:35
  • It can be anything else which can be implemented in a script, either perl or python. As for second point, that's also a part of the problem since as you noticed, ";" is sometimes included within HTML tags. As a last resort I could remove all the tags before reading the multi line description text. In case there no other option we can assume there are no HTML tags within the text. But then obviously I would just remove break lines and consider it a single column with AWK. But the again I would miss all the layout. – Ronald Jan 28 '21 at 16:48
  • 1
    Looks like a csv file because the multiline field is quoted and quotes in that field are escaped by doubling (e.g. `"a""b"` means a field with content `a"b`.) Therefore, the question is about extracting a field from a csv file and should be answered here: [Bash: Parse CSV with quotes, commas and newlines](https://stackoverflow.com/questions/36287982/bash-parse-csv-with-quotes-commas-and-newlines). Alternatively, consider installing a specialized tool like [`csvcut`](https://csvkit.readthedocs.io/en/latest/scripts/csvcut.html) or even [`mlr`](https://github.com/johnkerl/miller). – Socowi Jan 28 '21 at 17:51
  • That's correct, it's a csv file. Thanks for pointing to the mentioned direction. I will look into it – Ronald Jan 28 '21 at 19:32

1 Answers1

0

You (originally) asked for an awk-based solution. As others mentioned in the comments there are better tools for the job. That said, based on 4.9 Multiple-Line Records and 4.7 Defining Fields by Content you can try something like:

$ awk --version
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)
[...]
$ awk 'BEGIN {RS = ";\n"; FPAT = "([^;]+)|(\"<p.+p>\")" } { print "NF = ", NF; for (i = 1; i <= NF; i++) { printf("$%d = %s\n", i, $i) } }' testfile
  1. RS = ";\n" is here assuming that your input file has multiple ID;Name;value1;value2;DESCRIPTION;valueX;valueY; records and that the records are separated with a ; (this is the ; after valueY in your example) followed by a newline.
  2. FPAT = "([^;]+)|(\"<p.+p>\")" is a "best-effort" approach to tell (g)awk how the fields of your records look like. You may need to modify it according to your needs. What is actually says is that there are two field formats (see (...)|(...)). The first field format captures strings that do not contain ; and is used to capture all the fields except DESCRIPTION. The second field format captures strings that start with "< and end with >".

Against a file with 2 ID;Name;value1;value2;DESCRIPTION;valueX;valueY;:

$ cat testfile 
ID;Name;value1;value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>


<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>

<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY;
ID;Name;value1;value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
  

<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>

<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY;

$ awk 'BEGIN {RS = ";\n"; FPAT = "([^;]+)|(\"<p.+p>\")" } { print "NF = ", NF; for (i = 1; i <= NF; i++) { printf("$%d = %s\n", i, $i) } }' testfile
NF =  7
$1 = ID
$2 = Name
$3 = value1
$4 = value2
$5 = "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>


<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>

<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>"
$6 = valueX
$7 = valueY
NF =  7
$1 = ID
$2 = Name
$3 = value1
$4 = value2
$5 = "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
  

<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>

<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>"
$6 = valueX
$7 = valueY
Nikolaos Chatzis
  • 1,947
  • 2
  • 8
  • 17