3

Can anyone briefly explain about the Html annotator, Html converter and TEIViewWriter with some examples.I want to create annotations in the initial view.

Awaiting for the Answer.

Main Script:

 PACKAGE uima.ruta.example;
 SCRIPT uima.ruta.example.Html;
 Document{-> EXEC(Html)};
 WORDLIST JOURNALNAMELIST='JournalName.txt';
 WORDLIST CITYPUBLIST='CITYPUB.txt';
 DECLARE JOURNALNAME;
 DECLARE CITYPUB;
 Document{ -> MARKFAST(JOURNALNAME, JOURNALNAMELIST)};
 Document{ -> MARKFAST(CITYPUB, CITYPUBLIST)};
 DECLARE Reference;
 "<a name=para(.+?)>(.+?)</a>"-> 2=Reference;
 DECLARE FirstToken, LastToken;

 BLOCK(InRef) Reference{}
 {
 ANY{POSITION(Reference,1) -> MARK(FirstToken)};
 Document{-> MARKLAST(LastToken)};
 }
 DECLARE FIRSTWORD;
 FirstToken PERIOD CW {->MARK(FIRSTWORD)};

Html Script:

 PACKAGE uima.ruta.example;
 ENGINE utils.HtmlAnnotator;
 ENGINE utils.HtmlConverter;
 ENGINE utils.HtmlViewWriter;
 TYPESYSTEM utils.HtmlTypeSystem;
 TYPESYSTEM utils.SourceDocumentInformation;
 Document{-> EXEC(HtmlAnnotator)};
 Document { -> CONFIGURE(HtmlConverter, "inputView" = "_InitialView","outputView" = "plain"),
 EXEC(HtmlConverter)};
 Document{ -> CONFIGURE(HtmlViewWriter, "inputView" = "plain","outputView" = "_InitialView", "output" = "E:/ruta-2.4.0-source-release/ruta-2.4.0/example-projects/TextRulerExample/output"),
 EXEC(HtmlViewWriter)};

Sample Html Input file:(manually converted into html by changing extension)

<html>
<head>
 <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
 <meta name=Generator content="Microsoft Word 14 (filtered)">
 <style>
 <!--
/* Font Definitions */
 @font-face
 {font-family:Calibri;
 panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
 {margin-top:0in;
 margin-right:0in;
 margin-bottom:10.0pt;
 margin-left:0in;
 line-height:115%;
 font-size:11.0pt;
 font-family:"Calibri","sans-serif";}
span.DAZZLEFN
 {mso-style-name:DAZZLEFN;}
span.DAZZLELN
 {mso-style-name:DAZZLELN;
 color:#92D050;}
.MsoChpDefault
 {font-family:"Calibri","sans-serif";}
.MsoPapDefault
 {margin-bottom:10.0pt;
 line-height:115%;}
@page WordSection1
 {size:8.5in 11.0in;
 margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
 {page:WordSection1;}
-->
</style>

</head>

<body lang=EN-US>

<div class=WordSection1>

<p class=MsoNormal><a name=para0>REFERENCES</a></p>

 <p class=MsoNormal><a name=para1>1.����������� Lawrence RA. A        review of the
 medical benefits and contraindications to breastfeeding in the United    States
 [Internet] . Arlington (VA): National Center for Education in Maternal and
 Child Health; 1997 Oct [cited 2000 Apr 24]. p. 40. Available from:
 www.ncemch.org/pubs/PDFs/Welcometojungle.pdf.</a></p>

 <p class=MsoNormal><a name=para2>2.����������� Shishido A.  Retraction notice:
 Effect of platinum compounds on murine lymphocyte mitogenesis [Retraction of
 Alsabti EA, Ghalib ON, Salem MH. In: Jpn J Med Biol 1979 Apr; 32(2):53-65].      Jpn
 J Med Sci Biol 1980 Aug;33(4):235-237.</a></p>

 <p class=MsoNormal><a name=para3>3.����������� Leist TP,  Zinkernagel RM.
 Effects of treatment with IL-2 receptor specific monoclonal antibody in mice
 [letter] [Retraction of Leist TP, Kohler M, Eppler M, Zinkernagel RM. In: J
 Immunol 1989 Jul 15; 143(2): 628-32]. J Immunol 1990 Apr 1;144(7):2847.</a>  </p>

 <p class=MsoNormal><a name=para4>4.����������� Alsabti EA, Ghalib     ON, Salem MH.
 Effect of platinum compounds on murine lymphocyte mitogenesis [Retracted by
 Shishido A. In: Jpn J Med Sci Biol 1980 Aug; 33(4):235-7]. Jpn J Med Sci  Biol
 1979 Apr;32(2):53-65.</a></p>

 <p class=MsoNormal><a name=para5>5.����������� Tidy JA, Parry GC, Ward P,
 Coleman DV, Peto J, Malcolm AD, Farrell PJ. High rate of papillomavirus type 16
 infection in cytologically normal cervices [letter] [Retracted by Tidy J,
 Farrell PJ. In: Lancet 1989 Dec 23-30:2(8678-8679):1535]. Lancet 1989 Feb   25;1(8635):434.</a></p>

 <p class=MsoNormal><a name=para6>6.����������� Magni F, Rossoni G,  Berti F.
 BN-52021 protects guinea-pig from heard anaphylaxis. Pharm Res Commun 1988
 Dec;20 Suppl 5:75-78.</a></p>

 <p class=MsoNormal><a name=para7>7.����������� Garvia EE, DeHaven ED. An
 experimental analysis of response acquisition and elimination with positive
 reinforcers. Behav Neuropsychiatry 1975 a April-1976 May;7(1-12):71-78.</a>  </p>

 <p class=MsoNormal><a name=para8>8.����������� Mueller FO,   Schindler RD. Annual
 survey of football injury research 1931-1985. [place unknown]: American
 Football Coaches Assn; 1986. 24 p.</a></p>

 <p class=MsoNormal><a name=para9>9.����������� Stern, Michael P.   National
 Institute of Arthritis, Diabetes, and Digestive and Kidney Diseases.   Diabetes
 in America: diabetes data compiled 1984.. [Bethesda (MD)]: The Institute; 1985
 Aug. Diabetes in Hispanic Americans. Chapter 9. (NIH publication; no. 86- 1468).</a></p>

 <p class=MsoNormal><a name=para10>10.��������� Vivian, Valerie L,      editor. Child
 abuse and neglect: a medical community response. 1st AMA National   Conference on
 Child Abuse and Neglect; 1984 March 30-June 31; Chicago. Chicago: American
 Medical Association; 1985. 256 p.</a></p>

 <p class=MsoNormal><a name=para11>11.��������� Popper, Hans, et al.,   editors.
 Structural carbohydrates in the liver: proceedings of the 34th Falk   Symposium;
 1982 oct 12-19; Basil, Switzerland.Boston: MTB Press; 1983. 701 p.</a></p>

 <p class=MsoNormal><a name=para12></a>&nbsp;</p>

 </div>

 </body>

 </html>

1 Answers1

0

Note that you example script does not contain the mentioned TEIViewWriter. The problem is the same, however.

Unfortunately, the exemplary script has an error:

The line

Document{ -> CONFIGURE(ViewWriter, "inputView" = "plain",...

should read

Document{ -> CONFIGURE(HtmlViewWriter, "inputView" = "plain",

... then the NPE is gone. There could be another exception if the input text is not parseable by the HtmlParser resulting is a missing Sofa in the XMI file. Wrapping the text in could help here.

The files HtmlConverter.ruta and TEIConverter.ruta here are indeed good examples for these components The HtmlAnnotator creates annotations for HTML and XML tags/elements. The HtmlConverter removes all HTML/XML tags, stores the resulting text in a new view and recalculates the offsets of the annotations. The TEIViewWriter is just a ViewWriter with a specific type system, which copies a specific view to a new CAS and stores it. Together, these components are able to convert a TEI/Html/XML text to plain text with annotations for the xml markup.

The documentation contains more information, e.g., about the configuration parameters

DISCLAIMER: I am a developer of UIMA Ruta

Peter Kluegl
  • 3,008
  • 1
  • 11
  • 8
  • May 23, 2016 4:03:57 PM org.apache.uima.ruta.engine.HtmlConverter mapAnnotations(454) WARNING: illegal annotation offset mapping May 23, 2016 4:03:57 PM org.apache.uima.ruta.engine.HtmlConverter mapAnnotations(454) WARNING: illegal annotation offset mapping. I'm receiving this message in the console.Whether it will affect my output. – Sugunalakshmi Pagemajik May 23 '16 at 11:45
  • I'm receiving two xmi files in the output.Reference.html.xmi and output.xmi. – Sugunalakshmi Pagemajik May 23 '16 at 11:45
  • Do I need to call the main ruta script from the html Script? Orelse it is enough to add the TypeSystem of the main script file.Because the type(eg:DZC_CITYPUB) is not annotated in the Reference.html.xmi. – Sugunalakshmi Pagemajik May 24 '16 at 06:58
  • The warning is logged if the converter tries to produce an annotation with illegal offsets, e.g., length 0 because all text was removed. In my use cases this never caused a problem. – Peter Kluegl May 24 '16 at 13:24
  • Can you laborate on the setup with the two files? The main script imports and calls the html script as a script (not an AE) and executes it with CALL? – Peter Kluegl May 24 '16 at 13:26
  • When I import and call the html script from my main script.I found that number of occurence are reduced and some tags are missing. – Sugunalakshmi Pagemajik May 25 '16 at 10:04
  • Which tags are missing? I did not find any problems yet when reproducing it. – Peter Kluegl May 26 '16 at 15:09
  • CITYPUB and FIRSTWORD is missing.FIRSTWORD is not tagged due to the space(���������) after the FirstToken. – Sugunalakshmi Pagemajik May 30 '16 at 06:07
  • Ah ok, I assume you want to create the annotations like CITYPUB in the plain view and not in the initial view? – Peter Kluegl May 30 '16 at 10:06
  • First thing is to remove the line `Document{-> RETAINTYPE(SPACE,BREAK)};` in the Html Script. EXEC is not sensible to the filtering settings anyway. The annotations are added to the _Initialview. With EXEC you can apply analysis engine to a different view, but not with CALL. If you want to apply the rules on the plain view, you need some sofa mapping. – Peter Kluegl May 30 '16 at 10:11
  • The space(���������) is not considered as a space.It is tagged as a Special.Why?. – Sugunalakshmi Pagemajik Jun 01 '16 at 02:53
  • What I want to do ,If I want to create the annotations in the plain text view. – Sugunalakshmi Pagemajik Jun 21 '16 at 09:25
  • Is it a NBSP or which kind of space is it? NBSP should be SPACE and NBSP is defined as: `u00A0|\u202F|\uFEFF|\u2007|\u180E| |&NBSP;` – Peter Kluegl Jun 23 '16 at 09:25
  • That issue was solved.Now What I want to do ,If I want to create the annotations in the plain text view. – Sugunalakshmi Pagemajik Jun 24 '16 at 09:36
  • You told that I need some sofa mapping to apply the rules on the plain view,Can you explain it briefly. – Sugunalakshmi Pagemajik Jun 28 '16 at 03:47
  • Sofa mappings are described in the UIMA documentation, e.g., [here](https://uima.apache.org/d/uimaj-current/tutorials_and_users_guides.html#ugr.tug.mvs.sofa_name_mapping). If there are problems, you should rather create a new question. – Peter Kluegl Jul 07 '16 at 07:41
  • Ok I will ask in new question. – Sugunalakshmi Pagemajik Jul 11 '16 at 03:40
  • The Script is working perfectly in ruta project.But If I use it in a maven project I'm receiving some errors. – Sugunalakshmi Pagemajik Jul 19 '16 at 03:30
  • Which errors? Does this refer to the new question. If yes, you could add the errors there. – Peter Kluegl Jul 19 '16 at 06:59