Background: We are aggregating content from some websites (with permission) for use in supplementary search functions for another application. An example is the news section of https://centenary.bahai.us. We thought to use xidel for this purpose, since the template file paradigm seems an elegant way to extract data from html, e.g. for a template:
<h1 class="title">{$title}</h1>?
<div class="node build-mode-full">
{$url:=$url}
<div class="field-field-audio">?
<audio src="{$audio:='https://' || $host || .}"></audio>?
</div>?
<div class="field-field-clip-img">
<a href="{$image:='https://' || $host || .}" class="imagefield-field_clip_img"></a>*
</div>?
<div class="field-field-pubname">{$publication}</div>?
<div class="field-field-historical-date">{$date}</div>?
<div class="location"><div class="adr">{$location}</div>?</div>?
<div class="node-body">{$text}</div>
</div>?
...we can run a command like the following:
xidel "https://centenary.bahai.us" -e "$(< template.html)" -f "//a[contains(@href, '/news/')]" --silent --color=never --output-format=json-wrapped > index.json
...which will give us json formatted data from all the news pages on centenary.bahai.us. An example article would look like this:
{
"title": "Bahá’ísm the Religion of Brotherhood",
"url": "https://centenary.bahai.us/news/bahaism-religion-brotherhood",
"audio": "https://centenary.bahai.us/sites/default/files/453_0.mp3",
"image": "https://centenary.bahai.us/sites/default/files/imagecache/lightbox-large/images/press_clippings/03-31-1912_NYT_Bahaism_the_Religion_of_Brotherhood.png",
"publication": "The New York Times",
"date": "March 31, 1912",
"location": "New York, NY",
"text": "A posthumous volume of “Essays in Radical Empiricism,” by William James, will be published in April by Longmans, Green & Co. This house will also bring out “Leo XIII, and Anglican Orders,” by Viscount Halifax, and “Bahá’ísm, the Religion of Brotherhood, and Its Place in the Evolution of Creeds,” by Francis H. Skrine. In the latter an analysis is made of the Gospel of Bahá’u’lláh and his successor. ‘Abdu’l-Bahá — whose arrival in this country is expected early in April — and a forecast is attempted of its influence on civilization."
},
That's just beautiful, and a whole lot easier than some mishmash of httrack and pup or (God forbid) sed and regex, but there are a few issues:
- We would like to have separate files for each document, whereas this gives us one large json file.
- Even with the
--silent
flag, we still get status messages in the output which invalidate the json, such as**** Retrieving (GET): https://centenary.bahai.us ****
or**** Processing: https://centenary.bahai.us/ ****
or** Current variable state: **
- The process seems too fragile; if there is any discrepancy between the template and the actual html, the entire process errors out and we get nothing. We would prefer to have it output an error for only the one page and then continue with the next URL.
Xidel seems like a game-changing tool which should make this job possible with a one-line command and a simple extract template file; what am I missing here?