0

I have a large TXT dataset that is delimited by | but there is a field that allows for paragraph text, which contains line breaks and blank lines. All lines that are not part of the paragraph text start with AA|. When I try to import into R via readr these values become NA because it doesn't follow the structure

Is there a way to use sed or awk to take a line if it doesn't start with AA| then to append it to the prior line that does with a space?

Input:

AA|5904060|9001084471200270|9000263372600200|Result Comment:
No (1, 3) Beta-D-Glucan detected.  

This assay does not detect certain fungi, including 
Cryptococcus species, which produce very low levels of (1, 
3) Beta-D-Glucan (BDG) and the Mucorales (e.g., Lichthemia, 
Mucor and Rhizopus), which are not known to produce BDG. 
Additionally, the yeast phase of Blastomyces dermatitidis 
produces little BDG and may not be detected by this assay.
|North Building|0|0

Goal Output:

AA|5904060|9001084471200270|9000263372600200|Result Comment: No (1, 3) Beta-D-Glucan detected.  This assay does not detect certain fungi, including Cryptococcus species, which produce very low levels of (1, 3) Beta-D-Glucan (BDG) and the Mucorales (e.g., Lichthemia, Mucor and Rhizopus), which are not known to produce BDG. Additionally, the yeast phase of Blastomyces dermatitidis produces little BDG and may not be detected by this assay.|North Building|0|0
nahata5
  • 1,163
  • 1
  • 10
  • 20
  • please add a sample example of such lines and complete expected output for that sample to make it clearer as well as provide a test case to use for answers – Sundeep May 30 '20 at 15:03
  • 1
    there's also some Q&A that may help, for example: https://stackoverflow.com/questions/38957730/using-awk-to-join-lines-following-a-match and https://stackoverflow.com/questions/39668670/awk-to-join-or-merge-lines-on-finding-a-pattern – Sundeep May 30 '20 at 15:06
  • great thank you I've added the sample and will check out that qa – nahata5 May 30 '20 at 15:21

3 Answers3

1

With gawk I would do something like this:

awk 'BEGIN {RS="(\n|^)AA\\|"} NR>1 {print "AA|" gensub("\n"," ","g")}' myfile.txt

Explanation: Make the literal string AA|, only when found at the beginning of a line, the record separator. Assuming the very first line will begin with AA|, this will cause an empty record to be found first, and we discard it; processing is done on records from 2 to end (NR > 1). In each record (delimited by this odd delimiter) replace every newline with a space, and print the record with AA| prepended to it (recall that the AA| that existed in the input file is the record separator, so it is no longer in the record itself).

The newline at the end of each record (right before AA| on the next line) is swallowed by the record separator, so you won't have errant spaces at the end of each output line - except for the last record, which is not terminated with "newline AA|" separator. The very last newline in the file survives and is converted to a space in the output; if this extra space at the end of the last record messes up your data, it must be fixed. (Not shown above.)

  • this worked perfectly, I had also determined that my files were produced on windows so the line endings were `\r\n` in addition so just changing this regex made it work perfectly. – nahata5 Jun 01 '20 at 02:23
  • @nahata5 - however, I made a mistake (which I will fix by editing the answer). I added a space between `AA|` and everything immediately following it, which you don't need. That was just careless on my part. The change is very small - just remove the comma from the `print` command. –  Jun 01 '20 at 02:28
0

Try:

#!/bin/bash
awk '
  /^AA\|/ { if (r) print r; r = $0; next }
  { r = r " " $0 }
  END { print r }
' input

If you want to avoid redundant spaces, you may add gsub (/ /, " ", r) in the code above, as follows:

awk '
  /^AA\|/ { if (r) print r; r = $0; next }
  { r = r " " $0; gsub (/  /, " ", r) }
  END { print r }
' input
Pierre François
  • 5,850
  • 1
  • 17
  • 38
  • so this actually just added a space to the lines that didn't start with `AA` at the beginning and did not append the line to the prior line that started with the `AA`. It definitely recognized the right lines, but I can't get it appended to the prior line – nahata5 May 30 '20 at 17:22
  • @nahata5: I don't understand your comment. When I apply my solution to the input you provided above, I get the result you ask. Something nasty must have happened. – Pierre François May 31 '20 at 10:14
0

With GNU awk for multi-char RS and RT and assuming you know how many fields you should have in each record (8):

$ awk -v RS='([^|]*[|]){7}[^\n]*\n' '{$0=RT; $1=$1; gsub(/ *[|] */,"|")}1' file
AA|5904060|9001084471200270|9000263372600200|Result Comment: No (1, 3) Beta-D-Glucan detected. This assay does not detect certain fungi, including Cryptococcus species, which produce very low levels of (1, 3) Beta-D-Glucan (BDG) and the Mucorales (e.g., Lichthemia, Mucor and Rhizopus), which are not known to produce BDG. Additionally, the yeast phase of Blastomyces dermatitidis produces little BDG and may not be detected by this assay.|North Building|0|0

Otherwise if you don't have GNU awk or only know that all records start with a line starting with AA| then using any awk:

$ awk '/^AA\|/ { if (NR>1) prt(); rec="" } { rec = rec OFS $0 } END{ prt() }
    function prt(o){o=$0; $0=rec; $1=$1; gsub(/[[:space:]]*[|][[:space:]]*/,"|"); print; $0=o}
' file
AA|5904060|9001084471200270|9000263372600200|Result Comment: No (1, 3) Beta-D-Glucan detected. This assay does not detect certain fungi, including Cryptococcus species, which produce very low levels of (1, 3) Beta-D-Glucan (BDG) and the Mucorales (e.g., Lichthemia, Mucor and Rhizopus), which are not known to produce BDG. Additionally, the yeast phase of Blastomyces dermatitidis produces little BDG and may not be detected by this assay.|North Building|0|0
Ed Morton
  • 188,023
  • 17
  • 78
  • 185