0

I need to process a text file - a big CSV - to correct format in it. This CSV has a field which contains XML data, formatted to be human readable: break up into multiple lines and indentation with spaces. I need to have every record in one line, so I am using awk to join lines, and after that I am using sed, to get rid of extra spaces between XML tags, and after that tr to eliminate unwanted "\r" characters. (the first record is always 8 numbers and the fiels separator is the pipe character: "|"

The awk scrips is (join4.awk)

BEGIN {
  # initialise "line" variable. Maybe unnecessary
  line=""
}

{
  # check if this line is a beginning of a new record
  if ( $0 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]|" ) {
    # if it is a new record, then print stuff already collected
    # then update line variable with $0
    print line
    line = $0
  } else {
    # if it is not, then just attach $0 to the line
    line = line $0
  }
}

END {
  # print out the last record kept in line variable
  if (line) print line
}

and the commandline is

cat inputdata.csv | awk -f join4.awk | tr -d "\r" | sed 's/>  *</></g'   > corrected_data.csv

My question is if there is an efficient way to implement tr and sed functionality inside the awk script? - this is not Linux, so I gave no gawk, just simple old awk and nawk.

thanks,

--Trifo

Trifo
  • 13
  • 2
  • In your regexp comparison `$0 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]|"` - 1) idk if you meant the `|` at the end to be literal or "or null" but a `|` at the start or end of a regexp is undefined behavior so don't do that. 2) The regexp delimiter character is `/`, not `"`, 3) `$0 ~ /foo/` can be written as just `/foo/`, 4) `[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]` can be written as just `[0-9]{8}` in any modern awk (if your awk doesn't support regexp intervals, get a newer awk). – Ed Morton Jul 04 '22 at 13:13
  • How do you define *efficient*? What requirements does solution meet to be accepted as such? – Daweo Jul 04 '22 at 13:16
  • Hang on - do you **literally** mean "old awk" (i.e. /usr/bin/awk on Solaris) when you say `I gave no gawk, just simple old awk and nawk`? Never use that old, broken awk, and on Solaris you also have /usr/xpg4/bin/awk (or xpg6) which is closer to POSIX compliance than the very unfortunately named (as it's now ancient) "new awk", nawk, (e.g. nawk doesn't support POSIX character classes, idk about regexp intervals) so if you're on Solaris use the awk from xpg4 or xpg6 bin, not nawk, and certainly not the default awk. Rather than saying `this is not Linux`, itd be more useful to tell us what it is – Ed Morton Jul 04 '22 at 13:26
  • 1
    I see some other issues in your code. If you post a new question that includes a [mcve] with concise, testable sample input and expected output then we can help you. Don't change this question, just accept an answer to the question you asked here then ask a new question. – Ed Morton Jul 04 '22 at 13:29
  • What is output when you do `nawk --version`? – Daweo Jul 04 '22 at 14:00
  • @EdMorton you are right with the pipe character. There is a backslash missing. And you are also kind a right with the Solaris stuff. It is not solaris, but something old and obscure. I have reasons to stick to the old - really old - awk. So I have to find some ways around its limitations. – Trifo Jul 05 '22 at 06:56
  • @Daweo by efficient I mean to run at least as fast as the chained commands. It would just ease my eyes not to invoke two more commands. Maybe embedding these commands would result in even more memory consumption, or something similar. But now I only consider runtume as efficiency. – Trifo Jul 05 '22 at 06:56

3 Answers3

2
tr -d "\r"

Is just gsub(/\r/, "").

 sed 's/>  *</></g'

That's just gsub(/> *</, "><")

KamilCuk
  • 120,984
  • 8
  • 59
  • 111
  • 1
    I have no gsub in this version of awk. Also it is not an option to install. Tried to get arount this with a cycle like this: # workarounds for the missing gsub function # removing extra spaces between xml tags # " " should look like "" while ( line ~ "> *<") { sub( /> *,"><",line) } # removing extra \r characters the same way while ( line ~ "\r") { sub( /\r/,"",line) } – Trifo Jul 05 '22 at 07:28
  • Event nawk has gsub() and you said you had that available. Don't use old, broken awk. – Ed Morton Jul 05 '22 at 11:26
0
mawk NF=NF RS='\r?\n' FS='> *<' OFS='><' 
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11
0

Thank you all folks!

You gave me the inspiration to get to a solution. It is like this:

BEGIN {
  # initialize "line" variable. Maybe unnecessary.
  line=""
}

{
  # if the line begins with 8 numbers and a pipe char (the format of the first record)...
  if ( $0 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|" ) {

    # ... then the previous record is ready. We can post process it, the print out

    # workarounds for the missing gsub function
    # removing extra spaces between xml tags
    # removing extra \r characters the same way
    while ( line ~ "\r") { sub( /\r/,"",line) }
    # "<text text>      <tag tag>" should look like "<text text><tag tag>"
    while ( line ~ ">  *<") { sub( />  *</,"><",line) }

    # then print the record and update line var with the beginning of the new record
    print line
    line = $0
  } else {

    # just keep extending the record with the actual line
    line = line $0
  }
}

END {
  # print the last record kept in line var
  if (line) {
    while ( line ~ "\r") { sub( /\r/,"",line) }
    while ( line ~ ">  *<") { sub( />  *</,"><",line) }
    print line
  }
}

And yes, it is efficient: the embedded version runs abou 33% faster.

And yes, it would be nicer to create a function for the postprocessing of the records in "line" variable. Now I have to write the same code twice to process the last recond in the END section. But it works, it creates the same output as the chained commands and it is way faster.

So, thanks for the inspiration again!

--Trifo

Trifo
  • 13
  • 2
  • 1) `while ( line ~ "\r") { sub( /\r/,"",line) }` = `while ( sub( /\r/,"",line) ) ;` 2) `sub( /> *` = `sub( /> +`, 3) every `~ "foo"` should be `~ /foo/`. – Ed Morton Jul 05 '22 at 11:23
  • 1) OK, that's right. 2) this awk can not handle the "+" regex operator 3) OK, but why? Is it necessary or is it a convention? I did not see it in docs, just in examples. – Trifo Jul 05 '22 at 16:39
  • 2) You've got to use a different awk - the one you're using is broken in ways you have **and ways you haven't** discovered yet. 3) It's required in the same way that writing `if ( 20 > 3 )` is required instead of writing `if ( "20"+0 > "3"+0 )` is required - in both cases you're writing a string and then asking awk to convert it to the type you really need rather then just writing code using the type you need. `$0 ~ /foo/` is a literal regexp comparison - when you write `$0 ~ "foo"` instead then awk has to convert the string `"foo"` to the regexp `/foo/` first and THEN do the comparison. – Ed Morton Jul 05 '22 at 17:16
  • That has consequence, including execution speed and that double evaluation means any ```\``` you use has to be doubled, so instead of writing `$0 ~ /foo\.bar/` you'd have to write `$0 ~ "foo\\.bar"`. See https://www.gnu.org/software/gawk/manual/gawk.html#Regexp and in particular https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps. – Ed Morton Jul 05 '22 at 17:20
  • 1
    That "pattern" -> /pattern/ change is awesome. The script is way faster now. Also the "while" trick. – Trifo Jul 06 '22 at 07:38
  • Good. See also [how-do-i-find-the-text-that-matches-a-pattern](https://stackoverflow.com/questions/65621325/how-do-i-find-the-text-that-matches-a-pattern) for why to never use the word "pattern" when talking about matching text, despite all of the documentation that uses it :-). – Ed Morton Jul 06 '22 at 11:03