1

I have a Chinese product dataset which contains around fifty thousand items and 1240 classes. And I use thirty-five thousand items to fine-tuning the BERT-BASE, Chinese. But I get a very low accuracy(accuracy 0.4%, global_step = 32728) on the dataset. I don't know where I go wrong. Could you help me?

I have modified the DataProcessor, and I created the data processor,


class CustProcessor(DataProcessor):
  def get_train_examples(self, data_dir):
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

  def get_dev_examples(self, data_dir):
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

  def get_test_examples(self, data_dir):
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

  def get_labels(self):
    # 这里返回的为具体的你的分类的类别
    return ['图书杂志--工业技术--一般工业技术', ...]

and I use these hyperparameters below to train the model.

export DATA_DIR=data
export BERT_BASE_DIR=vocab_file/chinese

python3 run_classifier.py --task_name=CUST
--do_train=true 
--do_eval=true 
--data_dir=$DATA_DIR/ 
--vocab_file=$BERT_BASE_DIR/vocab.txt 
--bert_config_file=$BERT_BASE_DIR/bert_config.json 
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt 
--max_seq_length=128 
--train_batch_size=32 
--learning_rate=2e-5 
--num_train_epochs=3.0 
--output_dir=output/

This is bert configure file, do I need to change them?

{
  "attention_probs_dropout_prob": 0.1, 
  "directionality": "bidi", 
  "hidden_act": "gelu", 
  "hidden_dropout_prob": 0.1, 
  "hidden_size": 768, 
  "initializer_range": 0.02, 
  "intermediate_size": 3072, 
  "max_position_embeddings": 512, 
  "num_attention_heads": 12, 
  "num_hidden_layers": 12, 
  "pooler_fc_size": 768, 
  "pooler_num_attention_heads": 12, 
  "pooler_num_fc_layers": 3, 
  "pooler_size_per_head": 128, 
  "pooler_type": "first_token_transform", 
  "type_vocab_size": 2, 
  "vocab_size": 21128
}

When I use other models, like SVM. Their accuracy is around 85%. But the accuracy for BERT is too low.

David Buck
  • 3,752
  • 35
  • 31
  • 35
Zhu Yun
  • 11
  • 2
  • Do I need to do Word Segmentation? Or the module will do it automatically? – Zhu Yun Jun 09 '19 at 11:59
  • The segmentation is done by the BERT. – Ashwin Geet D'Sa Jul 03 '19 at 15:57
  • Do check if you have the test data in the right format, its slightly different than train and dev data. – Ashwin Geet D'Sa Jul 03 '19 at 15:57
  • @Ashwin Geet D'Sa The test data and the train, dev data are the same format. What's the format of test data is right? – Zhu Yun Jul 09 '19 at 03:29
  • I am not sure about CUST task, but with respect to Cola, for train and dev, you dont have to give the headers (column names), it just starts with data, and they have 4 columsn, ID, class, segment, text data. For test set, The TSV file shall have header, and its 2 columns, ID and the text data. – Ashwin Geet D'Sa Jul 09 '19 at 09:03

0 Answers0