Bash/C - Correct invalid values in a colum in the output of the script with the values from other output/file

Question

(Revisited and reworked post. Thank you for the clarification.)

I would like to ask you for the help with the problem I have encountered, which course of action is not really clear to me and besides some basic logical base I am struggling with the code implementation.

Specific example - output, showing particular logical volumes

//    $1_1                $1_2      $1_3 (for an ease I will call columns like this to distinguish it from the ones below
ALASKA_VOL00009873      offline    SB98MENO
FRANCE_ICSSI00964       online     FRANCE  //instead of SB91VMA3
JUNIPER_ROOT            suspended  S4MELIC15
NZELAND_VOL339643750    frozen     NZELAND //instead of S6B1B1AQ
DONKEYKONG_ISCSI002194  offline    SB99A95Z

Where the first column is the full name of the logical volume, second is the status (not important for us) and the third one should contain just a server names (in this case FRANCE and NZELAND should not be there, those names and the first part of the volume names are the names of the virtual servers on the "physical" one), therefore should be the ones I have typed after them - fixing these requires a new addition to the script, which takes output of the massive analysis command and format it in this form.

The steps I want would be placed after it - it would be as the addition to it, which would fix the possible issues in the third column (SERVER NAME), when invalid value would appear.

For this matter there would be always the file with the output from other script, where are correct names of the servers in the conjunction with the virtual names.

This file's content would have a form like this:

 //  $2_1         $2_2
    ALASKA       SB98MENO
    FRANCE       SB91VMA3
    JUNIPER      S4MELIC15
    NZELAND      S6B1B1AQ
    DONKEYKONG   SB99A95Z

Where the first column is the name of the virtual server and the second one of the physical server.

In this case it is new bunch of actions to take - I have a few if any experience with selecting something from the file, which is outside the script, comparing particular values in the particular rows and columns and then to make some substitutions, so I would like to ask you for your help and advice.

I have the idea, how it should work - as my knowledge goes (please ignore no syntax, this is like structurogram)

It would be as an addition in the for loop

for (line counter, after each cycle it will move to the another line in the main output (first one), where)
do
 if (the value in the $1_3 matches the value in any line in the $2_1)
  then 
    "Value from the same line, but in the column $2_2" = "Actual scanned (by for) value in the particular line in the $1_3" // basically rewriting the incorrect value with the correct - there is the first problem I have - how to address all there different positions in the file, such as the value in the second column in the same row, where we found a match ????
  else
    continue (or break ???) //break the loop entirely
done

At the end, wanted output should be like:

ALASKA_VOL00009873      offline    SB98MENO
FRANCE_ICSSI00964       online     SB91VMA3
JUNIPER_ROOT            suspended  S4MELIC15
NZELAND_VOL339643750    frozen     S6B1B1AQ
DONKEYKONG_ISCSI002194  offline    SB99A95Z

To give you an correct example of my output...so with my previous one

join -a1 -1 4 -2 1 <(echo "$VAR3") <(echo "$VAR2") | awk  '{print $2" "$3" "$4" - "$5}'

I have received

     $2              $3     $4       $5       <--this is not in output, it is just for simple orientation, which column is which
APPLE_ISCSI01      offline aggrB2 - EELN1723
GRAPEFRUIT_ISCSI13 offline aggr1 -
GRAPEFRUIT_ISCSI04 offline aggr1 - XX643863
WOLFVILLE_ISCSI48  offline aggr1 - A7S5D1DCY0
WOLFVILLE_ISCSI49  offline aggr1 - A7S5D1DCY0
WOLFVILLE_ISCSI50  offline aggr1 - A7S5D1DCY0
WOLFVILLE_ISCSI51  offline aggr1 - A7S5D1DCY0

The problem is that blank space at the $5 in the second row, actually

There are no tabs as well, I have just added them for it being more clear

Without awk, there is of course this column - which is the first and skipped when formatting

  $1
APPLE
XX643863
GRAPEFRUIT
WOLFVILLE
WOLFVILLE
WOLFVILLE
WOLFVILLE

Basically, now I need just to replace the blank space in final output with this one XX643863 (second row). Variables in the command are just those text files (sorted as well), but as I have mentioned, I wanted to avoid making and then removing the files.

I have tried your awk's - unfortunately no success. Is it possible with awk...or sed ?

Any solution is talarmade for the number of columns you have.. So please post the actual (semi-actual) results. We can't work with BLABLA stuff.. — sjsam, Aug 25 '16 at 14:08
[edit] your question to just show us concise, testable sample input and expected output. If the BLABLA columns are present in an input file then show them there rather than having them appear from nowhere in the desired output and don't use `xxxxx` as values - show truly representative values otherwise you're going to end up with a solution that works when your input contains `xxxxx` but fails for your real data. — Ed Morton, Aug 25 '16 at 14:12
Damn, and I just thought he didn't know how to spell *ALABAMA*... — David C. Rankin, Aug 25 '16 at 14:34
This question has been tagged as `c`, `bash` and `awk` but surprisingly I have not found a single source line. — Jdamian, Aug 25 '16 at 19:16

featuredpeow · Accepted Answer · 2016-08-30T08:48:58.150

Process which you described is a left join between two tables on a common key. In SQL, this could be expressed by query:

SELECT volumes.volume_name, volumes.volume_state, 
COALESCE(mapping.real_name, volumes.alias_name) 
FROM volumes LEFT JOIN mapping on volumes.alias_name = mapping.alias_name

actually, one of possible solutions of this for you is to load it in sqlite and perform that query to obtain result given you can obtain data in csv/tsv (tabs) format:

create table volumes(volume_name varchar, status varchar, alias_name varchar);
create table mapping(alias_name varchar, real_name varchar);

.mode csv

.import volumes.csv volumes
.import mapping.csv mapping

.output result.csv

SELECT volumes.volume_name, volumes.status, 
COALESCE(mapping.real_name, volumes.alias_name) 
FROM volumes LEFT JOIN mapping on volumes.alias_name = mapping.alias_name;

This can then be executed as a script using command cat script.sql|sqlite3.

Alternatives are: use join command together with sort and awk or writing a left join manually in bash or other language (awk or python).

Here is solution using sort, join and awk:

sort -k1 mapping > sorted_mapping
sort -k3 volumes > sorted_volumes

join -a1 -1 3 -2 1 sorted_volumes sorted_mapping

will give almost what you wanted:

FRANCE FRANCE_ICSSI00964 online SB91VMA3
NZELAND NZELAND_VOL339643750 frozen S6B1B1AQ
S4MELIC15 JUNIPER_ROOT suspended
SB98MENO ALASKA_VOL00009873 offline
SB99A95Z DONKEYKONG_ISCSI002194 offline

awk can be used to format it properly:

join -a1 -1 3 -2 1 sorted_volumes sorted_mapping|awk '{print $2, $3, NF==3 ? $1 : $NF}'

This solution do so called sort-merge join.

First we sort both inputs using sort command. Then we use join command to match lines with equal keys (match 3rd column from volumes with 1st column from mapping). join outputs the key as first column so we had to manually re-arrange columns using awk to match desired output format. There is also condition (ternary operator) which detects if there was a match or not between volumes and mapping by checking columns count. If there was a match, we take column from mapping (last one), otherwise we take key (1st column).

Update

I don't recommend it because it's dog sloooo-o-o-o-o-o-o-ow and is essentially re-invention of bicycle (doing manually what is already done by join command), but here is bash version of nested loop left join

Previously here was naive implementation of nested loop join in bash, but it was so sl-o-o-o-o-o-o-o-o-ow that I decided to replace it with this evidence of it's slowness instead:

$ wc -l volumes2.txt             
100 volumes2.txt
$ wc -l mapping2.txt             
100 mapping2.txt
$ time ./nested_loops.sh > output                              

./nested_loops.sh > output  8.01s user 28.79s system 121% cpu 30.276 total

Just don't do nested loop join in pure bash.

Thnk fou very much for your help. Is it possible to exclude join ? — dd_code, Aug 30 '16 at 06:47
Yes, if you will write left join logic in some language (bash/python) or use some library which can do it (for example pandas in python) — featuredpeow, Aug 30 '16 at 07:00
Is it possible it would work as I have added in the main post ? It starts with "WOULD THIS WORK?" To be absolutely honest I do not really understand the sort/join/awk solution — dd_code, Aug 30 '16 at 07:20
You can match rows by key (3rd column in first file, 1st column in second file) and output row with substituted correct value. If your files are small (like less than 1000-3000 records each) you can write simple [nested loops join](https://en.wikipedia.org/wiki/Nested_loop_join) algorithm to do it. Only distinction is that it should output a row after nested loop if nested loop didn't found a match (will add bash version to answer) — featuredpeow, Aug 30 '16 at 07:32
But, I have found out an issue join -a1 -1 4 -2 1 <(echo "$VAR3") <(echo "$VAR2") | awk '{print $2" "$3" "$4" - "$5}' This is the final join used, but I can see, if there was (in the very first output, which was needed to be corrected) correct value at first, now, it leaves the blank space, like you have in your output from join, so for example: S4MELIC15 JUNIPER_ROOT suspended it worked because in my particular output I had all values with incorrect names, so the last column had all new ones correct, so I have used awk, to print just the last one with correct values, but if it was correc — dd_code, Aug 31 '16 at 10:24
Yes, join output has different number of columns for case when there was a match and when there was not. When there is no match it just prints row from first file with key (like FRANCE etc) moved to first column. I used `if` (ternary operator) inside awk command to workaround it like this: `awk '{print $2, $3, NF==3 ? $1 : $NF}` . I guess you need something similar like `awk '{print $2, $3, NF==3 ? $1 : $4, "-", NF==3 ? $1 : $5}'` — featuredpeow, Aug 31 '16 at 11:07
Thank you again, kryger, I have successfully implemented the awk and it works absolutely perfect. — dd_code, Sep 05 '16 at 19:34

Bash/C - Correct invalid values in a colum in the output of the script with the values from other output/file

1 Answers1