3

Actually I used some CITY names and PUBLISHERS names in Wordlist.In my understanding, Wordlist will annotate all occurrences of any list item in a document.But I found a problem,that number of occurence was increased or decreased when I changed the order of the text in the list.

For Example:

Script:

 WORDLIST CITYPUBLIST='CITYPUB.txt';
 DECLARE CITYPUB;
 Document{ -> MARKFAST(CITYPUB, CITYPUBLIST)};
 WORDLIST JournalNameLIST='JournalName.txt';
 DECLARE JournalName;
 Document{ -> MARKFAST(JournalName, JournalNameLIST)};

Wordlist(CITYPUB.txt):

Arlington (VA): National Center for Education in Maternal and Child Health
[place unknown]: American Football Coaches Assn
[Bethesda (MD)]: The Institute
Chicago. Chicago: American Medical Association
Basil, Switzerland.Boston: MTB Press
St. Louis, MO. Washington: The Society
Chicago: University of Chicago Press

JournalName.txt:

Jpn J Med Sci Biol
J Immunol
Lancet
Pharm Res Commun
Behav Neuropsychiatry
J Pharm Pharm Sci
Cochrane Database Syst Rev 

Sample Input:

1.Lawrence RA. A review of the medical benefits and contraindications to breastfeeding in the United States [Internet] . Arlington (VA): National Center for Education in Maternal and Child Health; 1997 Oct [cited 2000 Apr 24]. p. 40. Available from: www.ncemch.org/pubs/PDFs/Welcometojungle.pdf.
2.Shishido A. Retraction notice: Effect of platinum compounds on murine lymphocyte mitogenesis [Retraction of Alsabti EA, Ghalib ON, Salem MH. In: Jpn J Med Biol 1979 Apr; 32(2):53-65]. Jpn J Med Sci Biol 1980 Aug;33(4):235-237.
3.Leist TP, Zinkernagel RM. Effects of treatment with IL-2 receptor specific monoclonal antibody in mice [letter] [Retraction of Leist TP, Kohler M, Eppler M, Zinkernagel RM. In: J Immunol 1989 Jul 15; 143(2): 628-32]. J Immunol 1990 Apr 1;144(7):2847.
4.Alsabti EA, Ghalib ON, Salem MH. Effect of platinum compounds on murine lymphocyte mitogenesis [Retracted by Shishido A. In: Jpn J Med Sci Biol 1980 Aug; 33(4):235-7]. Jpn J Med Sci Biol 1979 Apr;32(2):53-65.
5.Meyer, Beat; Hermanns, Karl. Formaldehyde release from pressed wood products. In: Turoski, Victor, editor. Formaldehyde: analytical chemistry and toxicology. Proceedings of the symposium at the 187th meeting of the American Chemical Society; 1984 Apr 8-13; St. Louis, MO. Washington: The Society; 1985. p. 101-116.
6.Magni F, Rossoni G, Berti F. BN-52021 protects guinea-pig from heard anaphylaxis. Pharm Res Commun 1988 Dec;20 Suppl 5:75-78.
7.Garvia EE, DeHaven ED. An experimental analysis of response acquisition and elimination with positive reinforcers. Behav Neuropsychiatry 1975 April-1976 Mar;7(1-12):71-78.
8.Mueller FO, Schindler RD. Annual survey of football injury research 1931-1985. [place unknown]: American Football Coaches Assn; 1986. 24 p.
9.Stern, Michael P. National Institute of Arthritis, Diabetes, and Digestive and Kidney Diseases. Diabetes in America: diabetes data compiled 1984. [Bethesda (MD)]: The Institute; 1985 Aug. Diabetes in Hispanic Americans. Chapter 9. (NIH publication; no. 86-1468).
10.Vivian, Valerie L, editor. Child abuse and neglect: a medical community response. 1st AMA National Conference on Child Abuse and Neglect; 1984 March 30-June 31; Chicago. Chicago: American Medical Association; 1985. 256 p.
11.Popper, Hans, et al., editors. Structural carbohydrates in the liver: proceedings of the 34th Falk Symposium; 1982 oct 12-19; Basil, Switzerland.Boston: MTB Press; 1983. 701 p.
12.Tidy JA, Parry GC, Ward P, Coleman DV, Peto J, Malcolm AD, Farrell PJ. High rate of papillomavirus type 16 infection in cytologically normal cervices [letter] [Retracted by Tidy J, Farrell PJ. In: Lancet 1989 Dec 23-30:2(8678-8679):1535]. Lancet 1989 Feb 25;1(8635):434.
13.Thomas Bernard, A Party for Boris, in Histrionics: Three Plays, trans. Peter K. Jansen and Kenneth Northcott (Chicago: University of Chicago Press, 1990).

When I tested it I got CITYPUB(4).If I use an empty line before the list item,I'm receiving CITYPUB(5).

Thanks in advance.

enter image description here

1 Answers1

1

Most likely this file starts with a byte order mark (BOM). Can you check if there is a BOM, e.g., with Notepad++? There is an open issue in UIMA Ruta, files with BOMs are not supported right now (UIMA Ruta 2.4.0). Either remove the BOM or add a dummy line (empty line) at the beginning.

(I am a developer of UIMA Ruta)

Peter Kluegl
  • 3,008
  • 1
  • 11
  • 8
  • Yeah Peter I accept Your answer but still "Chicago. Chicago: American Medical Association" and "Chicago: University of Chicago Press" are uncovered.Can I know why? – Sugunalakshmi Pagemajik May 21 '16 at 05:12
  • Similarly For Journalname its not capturing Jpn J Med Biol. – Sugunalakshmi Pagemajik May 21 '16 at 05:24
  • If I use J Med Biol instead of Jpn J Med Biol its covering J Med Biol.Its there any specifc reason for this. – Sugunalakshmi Pagemajik May 21 '16 at 06:40
  • 1
    First quick guess without testing: Maybe it is caused by the whitespaces. Try to remove all of them in the wordlist, or active the configuration parameter dictRemoveWS. I will try to reproduce your problem the next days – Peter Kluegl May 22 '16 at 21:10
  • I tried to reproduce it. I get 7 CITYPUB annotations and 6 JournalName annotations. `Jpn J Med Biol` is not found because of the whitespaces. If I remove them in JournalName.txt I get 9 JournalName annotations. Let me know if oyu need an explanation on the whitespace issue. – Peter Kluegl May 24 '16 at 13:20
  • Yeah Peter I need an explanation on the whitespace issue. – Sugunalakshmi Pagemajik May 25 '16 at 03:17
  • 1
    The entries are converted in a tree structure (trie) with one note for each character. When there is a lookup for text passage (token combination) in the wordlist, then whitespace nodes are skipped, because they are normally not visible in Ruta. If not, the entries would not be robust against different whitespace combinations, e.g., two spaces in the document in contrast to one space in the wordlist. If skipping is optional, but it still causes problems because the lookup in the tree has no backtracking. – Peter Kluegl May 25 '16 at 15:48
  • If a subtree is entered because of a skipped whitespace, then a different subtree that might be correct is maybe missed. I did not find a robust solution for this without making the lookup slower. The best solution was to programmatically remove the whitespace when loading the wordlist with the config parameter dictRemoveWS. – Peter Kluegl May 25 '16 at 15:49
  • Whether "dictRemoveWS" is only related to Wordlist and Wordtable.Thanks for your prompt response,Peter. – Sugunalakshmi Pagemajik May 26 '16 at 07:15