Merge two files, line by line, after matching pattern in a new line

Question

I need merge 2 files if there is ona match. The match in not static is random but is always after one specific tag

File 1

<can update="x" site="merge-xml-01" site_id="foo.com" xmltv_id="foo@com">foo@com</can>
<can update="x" site="merge-xml-02" site_id="bar.com" xmltv_id="bar@com">bar@com</can>
<can update="x" site="merge-xml-03" site_id="xxx.com" xmltv_id="xxx@com">xxx@com</can>

File 2

<can offset="u" same_as="foo.com" id="foo 01">foo 01</can>
<can offset="u" same_as="foo.com" id="foo 02">foo 02</can>
<can offset="u" same_as="bar.com" id="bar 01">bar 01</can>
<can offset="u" same_as="xxx.com" id="xxx 01">xxx 01</can>
<can offset="u" same_as="xxx.com" id="xxx 02">xxx 02</can>
<can offset="u" same_as="xxx.com" id="xxx 03">xxx 03</can>

I need make the file number 3 like this

<can update="x" site="merge-xml-01" site_id="foo.com" xmltv_id="foo@com">foo@com</can>
<can offset="u" same_as="foo.com" id="foo 01">foo 01</can>
<can offset="u" same_as="foo.com" id="foo 02">foo 02</can>
<can update="x" site="merge-xml-02" site_id="bar.com" xmltv_id="bar@com">bar@com</can>
<can offset="u" same_as="bar.com" id="bar 01">bar 01</can>
<can update="x" site="merge-xml-03" site_id="xxx.com" xmltv_id="xxx@com">xxx@com</can>
<can offset="u" same_as="xxx.com" id="xxx 01">xxx 01</can>
<can offset="u" same_as="xxx.com" id="xxx 02">xxx 02</can>
<can offset="u" same_as="xxx.com" id="xxx 03">xxx 03</can>

I hope is clear, if the tag inside the file 1 "site_id=" match with the tag "same_as=" inside the file 2, I need merge the data.

Honestly I have no idea what I can do to have this result, I checked many posts but all merge data on the same line, I can't find something merge data on new line.

I like if is possible use sed or awk but every suggestion is welcome.

Thank you in advice.

score 1 · Answer 1 · answered Nov 28 '18 at 22:57

assumes file2 is sorted by the key

$ awk -F' |=' 'NR==FNR {for(i=1;i<NF;i++) if($i=="site_id") {a[$(i+1)]=$0; break}; next} 
                       {k=""; for(i=1;i<NF;i++) if($i=="same_as") {k=$(i+1); break}
                        if(!p[k]++) print a[k]}1' file1 file2

<can update="x" site="merge-xml-01" site_id="foo.com" xmltv_id="foo@com">foo@com</can>
<can offset="u" same_as="foo.com" id="foo 01">foo 01</can>
<can offset="u" same_as="foo.com" id="foo 02">foo 02</can>
<can update="x" site="merge-xml-02" site_id="bar.com" xmltv_id="bar@com">bar@com</can>
<can offset="u" same_as="bar.com" id="bar 01">bar 01</can>
<can update="x" site="merge-xml-03" site_id="xxx.com" xmltv_id="xxx@com">xxx@com</can>
<can offset="u" same_as="xxx.com" id="xxx 01">xxx 01</can>
<can offset="u" same_as="xxx.com" id="xxx 02">xxx 02</can>
<can offset="u" same_as="xxx.com" id="xxx 03">xxx 03</can>

ps. this should be dramatically faster than other solutions for large files.

Your solution also is working really well! Thank you about your answer. To say thank my proposal is always valid, if you are near Milano one pincher of beer is avalable for you :) — Tapiocapioca, Nov 28 '18 at 23:07

Michael · Answer 2 · 2018-11-28T21:05:40.347

0

Read a file line by line, find URL and search for it in a second file.

while read -r line; do
        echo "$line" >> file3
        url=$(sed 's/.*site_id="\([^"]\+\)".*/\1/' <<< $line)
        grep $url file2 >> file3
done < file1

$ cat file3
<can update="x" site="merge-xml-01" site_id="foo.com" xmltv_id="foo@com">foo@com</can>
<can offset="u" same_as="foo.com" id="foo 01">foo 01</can>
<can offset="u" same_as="foo.com" id="foo 02">foo 02</can>
<can update="x" site="merge-xml-02" site_id="bar.com" xmltv_id="bar@com">bar@com</can>
<can offset="u" same_as="bar.com" id="bar 01">bar 01</can>
<can update="x" site="merge-xml-03" site_id="xxx.com" xmltv_id="xxx@com">xxx@com</can>
<can offset="u" same_as="xxx.com" id="xxx 01">xxx 01</can>
<can offset="u" same_as="xxx.com" id="xxx 02">xxx 02</can>
<can offset="u" same_as="xxx.com" id="xxx 03">xxx 03</can>

edited Nov 28 '18 at 21:05

answered Nov 28 '18 at 20:50

Michael

5,095
2
13
35

I try, really thank you about your help, you have one pincher of beer paid when you will be in Milano! – Tapiocapioca Nov 28 '18 at 21:57
@Tapiocapioca I'm in middle of Siberia, but I hope to be there sometimes :) – Michael Nov 28 '18 at 22:06
I hope I can visit Siberia with my girlfriend :) But maybe you prefer vodka :P I is working but ignore the last match. What's can be? My result is: foo@com foo 01 foo 02 bar@com bar 01 – Tapiocapioca Nov 28 '18 at 22:21

score 0 · Answer 3 · answered Nov 28 '18 at 20:54

0

IF you know for sure these formats are consistent and always on a single line...

$: cat c $ file 1 is a, file 2 is b
#! /bin/env bash

while read -r line
do pat="${line##* site_id=\"}"
   pat="${pat%%\"*}"
   echo "$line"
   grep " same_as=[\"]$pat[\"] " b
done < a

$: c
<can update="x" site="merge-xml-01" site_id="foo.com" xmltv_id="foo@com">foo@com</can>
<can offset="u" same_as="foo.com" id="foo 01">foo 01</can>
<can offset="u" same_as="foo.com" id="foo 02">foo 02</can>
<can update="x" site="merge-xml-02" site_id="bar.com" xmltv_id="bar@com">bar@com</can>
<can offset="u" same_as="bar.com" id="bar 01">bar 01</can>
<can update="x" site="merge-xml-03" site_id="xxx.com" xmltv_id="xxx@com">xxx@com</can>
<can offset="u" same_as="xxx.com" id="xxx 01">xxx 01</can>
<can offset="u" same_as="xxx.com" id="xxx 02">xxx 02</can>
<can offset="u" same_as="xxx.com" id="xxx 03">xxx 03</can>

answered Nov 28 '18 at 20:54

Paul Hodges

13,382
1
17
36

1

One pincher of beer is wating you if you will be in Milano :) – Tapiocapioca Nov 28 '18 at 21:58
Yes your script also is working really well but have the same problem of the other guy, ignore and doesn't save the last match, I can bypass this trouble adding one fake line at the end, but is strage... I am using one subsystem linux under windows.... Maybe it is the problem.. – Tapiocapioca Nov 28 '18 at 22:29
See this : https://stackoverflow.com/questions/729692/why-should-text-files-end-with-a-newline to understand why you need a `\n` at the end of the last line – Corentin Limier Nov 28 '18 at 22:49
Don't worry, I know about, I use the tool dos2unix before run every script ;) – Tapiocapioca Nov 28 '18 at 23:02
Huh. I explicitly removed the newline and confirmed it's gone, and expected it to fail the same for me but it doesn't. Which version of grep are you using? `grep (GNU grep) 3.0` here. – Paul Hodges Nov 29 '18 at 14:12

potong · Answer 4 · 2018-11-29T20:05:52.287

0

This might work for you (GNU sed):

sed 's#.*same_as=\("[^"]*"\).*#/site_id=\1/a&#' file2 | sed -f - file1

Turn file2 into a sed script that appends each line on matching the value of the same_as with file1's site_id. Then pipe the generated script through to a second invocation of sed which is run against file1. Each time a line from file1 is read in, lines from file2 are appended in sequence to it.

To delete lines from file1 which do not have a match in file2, use:

sed -e 's#.*same_as=\("[^"]*"\).*#/site_id=\1/{a&\nx;s/^/x/;x}#' file2 |
sed -f - -e 'x;/x/{z;x;b};d' file1

This adds a flag in the hold space which is set when a line from file2 is added and when it is not set, to delete the current record from file1

edited Nov 29 '18 at 20:05

answered Nov 29 '18 at 07:05

potong

55,640
6
51
83

Really thank you about you help, also your seggestion is working really well! :) :) – Tapiocapioca Nov 29 '18 at 18:19
I want ask you, is possible DON'T join the first file, if there isn't the match? In this way all lines from the first file are joined but I can't use them without the second line. – Tapiocapioca Nov 29 '18 at 18:31
Also for you one pinch of beer and pizza paid in Milano :) – Tapiocapioca Nov 29 '18 at 20:51

Merge two files, line by line, after matching pattern in a new line

4 Answers4