-1

i have file1 with value:

<action>
 <row>
    <column name="book" label="book">stick man (2020)/</column>
    <column name="referensi" label="referensi"> http://172.22.215.234/Data/Book/Journal/2016_2020/1%20Stick%20%282020%30/</column>
 </row>
<row>
    <column name="book" label="book">python easy (2019)/</column>
    <column name="referensi" label="referensi"> http://172.22.215.234/Data/Book/Journal/2016_2020/2%20Buck%20%282019%30/</column>
 </row>
</action>

i want to get the contents of the file using linux scripting or command (sed, grep or awk). example output:

stick man (2020) | http://172.22.215.234/Data/Book/Journal/2016_2020/1%/20Stick%20%282020%30
python easy (2019) | http://172.22.215.234/Data/Book/Journal/2016_2020/%2/20Buck%20%282019%30

my code:

grep -oP 'href="([^".]*)">([^</.]*)' file1

please help i am newbie :)

john
  • 11
  • 1
  • 4
    [Do not parse html with regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). Use xmllint or xmlstarlet. Use python ruby perl languages with xml support and other xml-aware tools. – KamilCuk Apr 15 '21 at 10:01

3 Answers3

0

This

<action>
 <row>
    <column name="book" label="book">stick man (2020)/</column>
    <column name="referensi" label="referensi"> http://172.22.215.234/Data/Book/Journal/2016_2020/1%20Stick%20%282020%30/</column>
 </row>
<row>
    <column name="book" label="book">python easy (2019)/</column>
    <column name="referensi" label="referensi"> http://172.22.215.234/Data/Book/Journal/2016_2020/2%20Buck%20%282019%30/</column>
 </row>
</action>

does looks like piece of HTML file. If you are allowed to install utilites in your system I suggest giving a try hxselect which is useful when you want to extract something you can describe in CSS language. For example to get content of all columns whose label is referensi from file.html:

cat file.html | hxselect -i -c -s '\n' column[label=referensi]
Daweo
  • 31,313
  • 3
  • 12
  • 25
  • Can the `hxselect` _command_ be written to produce the output shown ion the OP? I ask because the example as coded just returns the **URL** not e.g. `stick man (2020) | http ...`. – user3439894 Apr 15 '21 at 21:47
  • @user3439894 I doubt `hxselect` itself is able to do so, however it should be relatively easy to process after extracting data using that tool – Daweo Apr 16 '21 at 07:15
  • So if I understand your response to my question in the previous comment, and using the _data_ in the OP, I'd have to call `hxselect` twice and process its output with additional normal utilities. If that's the case, then with different _data_ having many _targets_ it's not a very practical utility. Anyway thanks for your response and it's nice to have another _tool_ to work. – user3439894 Apr 16 '21 at 15:08
0
$ awk -v RS='<[^>]+>' 'NF{printf "%s", $0 (++c%2?" |":ORS)}' file

stick man (2020)/ | http://172.22.215.234/Data/Book/Journal/2016_2020/1%20Stick%20%282020%30/
python easy (2019)/ | http://172.22.215.234/Data/Book/Journal/2016_2020/2%20Buck%20%282019%30/

note that forward slashes are in your original data

requires multi-char RS support (GNU awk).

karakfa
  • 66,216
  • 7
  • 41
  • 56
  • The output shown doesn't match the example output in the OP. He's looking for e.g. `stick man (2020) | http ...` not `stick man (2020)/ | http ...`, noting the `/` after the `)` is not wanted. – user3439894 Apr 15 '21 at 21:51
  • I"m almost sure his sample data has copy/paste errors or typos. There is no need to remove non-existent chars since there is no mention of trimming them in the question. – karakfa Apr 15 '21 at 22:40
0

With awk you can try:

awk -F'>|/<'  '{ORS= (NR == 3 || NR == 7) ? " |" : "\n"} $2 != "" {print $2}' file
stick man (2020) | http://172.22.215.234/Data/Book/Journal/2016_2020/1%20Stick%20%282020%30
python easy (2019) | http://172.22.215.234/Data/Book/Journal/2016_2020/2%20Buck%20%282019%30
  • Or shorter:
awk -F'>|/<'  '{ORS= (NR%2) ? " |" : RS} $2 != "" {print $2}' file

Carlos Pascual
  • 1,106
  • 1
  • 5
  • 8