1

in these days I'm totally struggling myself trying to let sas read an xfdf file, an export of comments (annotation) in a pdf with adobe professional. If you never worked with an .xfdf file, don't worry, basically is an XML parent format of adobe.

I can't use SAS XML Mapper, for two reason: first one is that I can't use it on workplace (where I develop my personal projects too, like this), second one is that I'd like to write a procedure that could be always repeated (without mapping anytime).

Usually comments are collected in xfdf with this format:

><freetext rect="300.165985,66.879105,380.165985,86.879105" creationdate="D:-001-1-1-1-1-1-00'30'" name="a7311cdb-77b3-4a48-8eff-62364f94213d" color="#FFBF00" flags="print" date="D:20150730153125+01'00'" page="0"
><contents-richtext
><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:8.0.0" xfa:spec="2.0.2" style="font-size:11.0pt;text-align:left;color:#FF0000;font-weight:normal;font-style:normal;font-family:Arial,sans-serif;font-stretch:normal"
><p
>THE_COMMENT_TO_EXPORT_IS_THIS_STRING</p
></body
></contents-richtext
></freetext

And I gather that data with this portion of xml map:

<COLUMN name='var1'>
<PATH syntax='XPath'>/xfdf/annots/freetext/contents-richtext/body/p</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>60</LENGTH>
</COLUMN>

Sometimes comment are collected in another way:

><freetext rect="331.041992,230.949005,553.198975,250.949005" creationdate="D:-001-1-1-1-1-1-00'30'" name="4f112387-dec6-42f1-ad8c-a1fecf9d8e04" color="#66CCFF" flags="print" date="D:20150730153213+01'00'" page="0"
><contents-richtext
><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:8.0.0" xfa:spec="2.0.2" style="font-size:11.0pt;text-align:left;color:#FF0000;font-weight:normal;font-style:normal;font-family:Arial,sans-serif;font-stretch:normal"
><p dir="ltr"
><span style="font-family:Arial"
>THE_COMMENT_TO_EXPORT_IS_THIS_STRING</span
></p
></body
></contents-richtext
></freetext

No problem also here, I can gather this comment with this xml map portion:

<COLUMN name='var2'>
<PATH syntax='XPath'>/xfdf/annots/freetext/contents-richtext/body/p/span</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>60</LENGTH>
</COLUMN>

But here comes the problem, sometimes the data is collected in this strange format, with a double span tag:

><freetext rect="9.623672,760.177979,210.281006,783.448975" creationdate="D:00000000000000Z" name="4f037e18-9143-4ec1-a6ae-249fa2215528" width="2" color="#66CCFF" flags="print" date="D:20150731152640+01'00'" page="53"
><contents-richtext
><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:8.0.0" xfa:spec="2.0.2" style="font-size:14.0pt;text-align:left;color:#000000;font-weight:normal;font-style:normal;font-family:Arial,sans-serif;font-stretch:normal"
><p dir="ltr"
><span style="font-family:Arial"
>THIS_IS_THE_FIRST_PART </span
><span style="font-family:Arial"
>THIS_IS_THE_SECOND_PART</span
></p
></body
></contents-richtext
></freetext

The second map code hits only the second string (here: THIS_IS_THE_SECOND_PART), can someone please help? How to write an appropriate map for gathering both the informations with sas?

PS: I'm pretty sure that alse SAS XML Mapper can't solve this issue, I found someone with the same problem on the web and using a map created by that tool.

PS2: Path type is xpath 1.0, I gave I try with string-join and I had this error:

ERROR: invalid character in Xpath expression
ERROR: Xpath construct string-join(/xfdf/annots/freetext/contents-richtext/body/p/span, '')
for column var2 is an invalid, unrecognized, or unsupported form

EDIT: Added HTML tag, <P> and <SPAN> are tags related to this language.

stat
  • 699
  • 3
  • 10
  • Can you use `string-join` in SAS's implementation of XPATH? Like in [this question](http://stackoverflow.com/questions/21996965/concatenate-multiple-node-values-in-xpath)? – Joe Aug 05 '15 at 20:32
  • Many thanks @Joe , but I don't think so, or at least I don't know how to implement it. I mean, in that question you kindly linked he had to concatenate 2 objects named in different ways (ELEMENT4 ELEMENT5) , I have to concatenate 2 string inside the same object like him, but these string are collected in different objects but named in the same way: (SPAN SPAN). Any idea? thanks again. – stat Aug 06 '15 at 07:13

1 Answers1

1

I answer my own question, I found out a quite good solution, but if anyone has an optimized version of this, please kindly post it.

I found out that in SAS XML maps you can't use XPath 2.0, but only XPath 1.0. In XPath 1.0 this step can be automatically performed within a single block only knowing the number of <PATH> in advance, using CONCAT('\xxx\xxx[1]',' '\xxx\xxx[2]').

Sadly this function does not work with SAS XML Map, and trying this you will encounter an error ERROR: invalid character in Xpath expression.

But I'm not interested in a perfect format, I can post-process the data I retrieve, hence in the map I reproduced in many variables all the possible cases of repeated <PATH> in this way:

<COLUMN name='vars1'>
<PATH syntax='XPath'>/xfdf/annots/freetext/contents-richtext/body/p/span[1]</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>60</LENGTH>
</COLUMN>

<COLUMN name='vars2'>
<PATH syntax='XPath'>/xfdf/annots/freetext/contents-richtext/body/p/span[2]</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>60</LENGTH>
</COLUMN>

<COLUMN name='vars3'>
<PATH syntax='XPath'>/xfdf/annots/freetext/contents-richtext/body/p/span[3]</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>60</LENGTH>
</COLUMN>

I programmed 6 of these blocks, even if I encountered only 2 <PATH> for making this code the most general as possible. Then I concatenated those string variables within a datastep.

stat
  • 699
  • 3
  • 10