2

I'm looking to extract the programme title and sub-title from the (clipped) XML file below. I was extracting both individually using xmllint and sed and combining them into one file, but I have since discovered that there are the occasional entries that only have a title and no sub-title. In this case I would like to leave sub-title blank. Please could someone suggest a way to account for this discrepancy?

XML File

<programme start="20171013170000 +0100" stop="20171013180000 +0100" channel="b492458d826d592ec7c528545a16c757">
  <title lang="eng">Accessories Gift Hall</title>
  <sub-title lang="eng">Find the perfect gift with fashion accessories by some of our most sought-after brands. From chic purses and wallets to cosy PJs and slippers, there&apos;s something for everyone.</sub-title>
</programme>
<programme start="20171013180000 +0100" stop="20171014130000 +0100" channel="b492458d826d592ec7c528545a16c757">
  <title lang="eng">..programmes start again at 1pm</title>
</programme>
<programme start="20171014130000 +0100" stop="20171014140000 +0100" channel="b492458d826d592ec7c528545a16c757">
  <title lang="eng">Ruth Langsford&apos;s Fashion Edit</title>
  <sub-title lang="eng">TV personality and QVC fashion ambassador, Ruth Langsford, shares her favourite looks and must-have pieces that will transform your wardrobe and have you looking fabulously stylish.</sub-title>
</programme>

Bash commands v1

xmllint --xpath "//programme/title" xmltv | sed -r 's/\n//g' | sed 's/<\/title>/\n/g' | sed 's/<title lang="eng">//g' > 1.txt
xmllint --xpath "//programme/sub-title" xmltv | sed -r 's/\n//g' | sed 's/<\/sub-title>/\n/g' | sed 's/<sub-title lang="eng">//g' > 2.txt
paste <(cat 1.txt) <(cat 2.txt) > 3.txt

Thanks!

3 Answers3

2

Here's an example using the sel command of xmlstarlet from the command line...

$ xmlstarlet sel -T -t -m '//programme' -v 'concat(normalize-space(title)," ",normalize-space(sub-title))' -n input.xml
Accessories Gift Hall Find the perfect gift with fashion accessories by some of our most sought-after brands. From chic purses and wallets to cosy PJs and slippers, there's something for everyone.
..programmes start again at 1pm
Ruth Langsford's Fashion Edit TV personality and QVC fashion ambassador, Ruth Langsford, shares her favourite looks and must-have pieces that will transform your wardrobe and have you looking fabulously stylish.

I'm separating the title and sub-title by a single space, but that can be changed.

Daniel Haley
  • 51,389
  • 6
  • 69
  • 95
0

What I would do :

#!/bin/bash

count=$(xmllint --xpath "count(//programme)" /tmp/file.xml)

for ((i=1; i<=count; i++)); do
    xmllint --xpath "//programme[$i]/title/text()" /tmp/file.xml
    echo -n '|'
    xmllint --xpath "//programme[$i]/sub-title/text()" /tmp/file.xml
    echo
done
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
0

In one pass with sed

sed '/<title/!d;N;/<sub-title/!s/\n.*//' XML File
ctac_
  • 2,413
  • 2
  • 7
  • 17
  • Until someone formats the XML and either the `title` or `subtitle` isn't all on the same line anymore. (https://stackoverflow.com/a/1732454/317052) – Daniel Haley Oct 23 '17 at 22:19
  • Yes I need to strip out the xml and have the title and sub-title next to each other ideally. – user2679016 Oct 23 '17 at 22:26