4

I've been using gate.ac.uk GUI to text mine data and am now attempting to use its machine learning module. In order to do so I've created several xml schema's to load into GATE. Here is one example:

<?xml version="1.0"?>
<schema xmlns="http://www.w3.org/2000/10/XMLSchema">
  <!-- XSchema definition for Condition -->
  <element name="Condition">
    <complexType>
      <attribute name="attrb_ConditionStatus" use="optional" value="other">
        <simpleType>
          <restriction base="string">
          <enumeration value="value_condition"/>
            </restriction>
          </simpleType>
      </attribute>
    </complexType>
  </element>
</schema>

I've created a similar schema for each attribute that I want to annotate. I'll illustrate the step that I have implement after creating the schemas: 1. I load the 'Schema Annotation Editor' for these purposes and then load the customized schemas through the 'Language Resources' menu item. 2. I also load the documents and the corpus. 3. I then run Annie 4. I can see the customized schema in the Annotations tab of the document 5. I annotate terms with my custom annotations

Now I want to run machine learning via the 'Learning - Batch Learning PR' plugin. I've added the processing resource to my application pipe line. My issue regards the creation of the machine learning configuration file/schema, I have searched the internet but could not get a good idea of how to correctly create the schema. I have looked at various examples, here is my attempt:

<?xml version="1.0"?>
<ML-CONFIG>
  <VERBOSITY level="1"/>
  <SURROUND value="true"/>
  <PARAMETER name="thresholdProbabilityEntity" value="0.2"/>
  <PARAMETER name="thresholdProbabilityBoundary" value="0.4"/>
  <multiClassification2Binary method="one-vs-others"/>
  <EVALUATION method="holdout" ratio="0.66"/>
  <ENGINE nickname="PAUM" implementationName="PAUM"
        options="-p 50 -n 5 -optB 0.3"/>
  <DATASET>
    <INSTANCE-TYPE>Token</INSTANCE-TYPE>
    <ATTRIBUTELIST>
       <NAME>ManType</NAME>
       <SEMTYPE>NOMINAL</SEMTYPE>
       <TYPE>Manufactuer</TYPE>
       <FEATURE>category</FEATURE>
       <RANGE from="-2" to="2"/>
    </ATTRIBUTELIST>
    <ATTRIBUTELIST>
       <NAME>ModelType</NAME>
       <SEMTYPE>NOMINAL</SEMTYPE>
       <TYPE>Model</TYPE>
       <FEATURE>orth</FEATURE>
       <RANGE from="-2" to="2"/>
    </ATTRIBUTELIST>
     <ATTRIBUTE>
       <NAME>Class1</NAME>
       <SEMTYPE>NOMINAL</SEMTYPE>
       <TYPE>Manufacturer</TYPE>
       <FEATURE>majorType</FEATURE>
       <POSITION>0</POSITION>
     </ATTRIBUTE>
     <ATTRIBUTE>
       <NAME>Class2</NAME>
       <SEMTYPE>NOMINAL</SEMTYPE>
       <TYPE>Model</TYPE>
       <FEATURE>type</FEATURE>
       <POSITION>0</POSITION>
       <CLASS/>
     </ATTRIBUTE>
   </DATASET>
</ML-CONFIG>

I wish for the machine learning algorithm to learn to annotate the Manufacturer and Model (Types) which are also a custom annotation that I created via a schema. My first question is whether the ml config structure looks correct? I add a new Corpus pipelin, add the Batch Learning PR process, select 'Evaluation' mode and I then run the application on my training document. This is the output:

The number of threads used is 1
** Evaluation mode started:
Hold-out test: runs=1, ratio of training docs is 0.66
Split, k=1, trainingNum=0.
HOLDOUT Fold 0:   (correct, partialCorrect, spurious, missing)= (0.0, 0.0, 0.0, 0.0);  (precision, recall, F1)= (0.0, 0.0, 0.0);  Lenient: (0.0, 0.0, 0.0)

  *** Averaged results for each label over 1 runs as:

Results of single label:

Overall results as:
  (correct, partialCorrect, spurious, missing)= (0.0, 0.0, 0.0, 0.0);  (precision, recall, F1)= (0.0, 0.0, 0.0);  Lenient: (0.0, 0.0, 0.0)

This learning session finished!

The output suggests that something is not configured correctly - either the ml configuration file or the pipe line I've created for these purposes. If anyone can share some insights on this matter I would be grateful. Again, I searched the internet high and low and read several manuals and ppt on machine learning by gate.ac.uk but still it seems quite ambiguous to me.

Regards Ofer

OAK
  • 2,994
  • 9
  • 36
  • 49

0 Answers0