5

I am using an Arch Linux system with KDE plasma. I have approximately 50mb XML, and I need to parse it. The file has custom tags.

Example XML:

<JMdict>
   <entry>
      <ent_seq>1000000</ent_seq>
      <r_ele>
         <reb>ヽ</reb>
      </r_ele>
      <sense>
         <pos>&unc;</pos>
         <gloss g_type="expl">repetition mark in katakana</gloss>
      </sense>
   </entry>
</JMdict>

I have tried many solutions that were suggested on Stack Overflow, and they did not work at all, and some of them could not installed to my system like xml-stream, xml2json. I decided to use xml2js (most of them suggest to use xml2js), and got the same result. How can I correctly use it ? I am using this code but it always returns undefined:

const fs = require('fs-extra');
const xml2js = require('xml2js');
const parser = new xml2js.Parser();

const path = "test.xml";

fs.readFile(path, {encoding: 'utf-8'}, function(error, data) {
     parser.parseString(data, function(err, res) {
         console.log(res);
     });
});

Result: Undefined

Is there any way to handle an XML file by hand (without a package)?

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Kaan Taha Köken
  • 933
  • 3
  • 17
  • 37
  • 2
    Your "XML" file is not well-formed: it contains an undefined entity reference `&unc;`. So parsing *should* fail. – Michael Kay Jan 01 '19 at 18:58

3 Answers3

5

Answer is below Working Example Link

var fs = require('fs'),
slash = require('slash'),
xml2js = require('xml2js');

var parser = new xml2js.Parser();

let filename = slash(__dirname+'/foo.xml');

// console.log(filename);

fs.readFile(filename,  "utf8", function(err, data) {

    if(err) {
        console.log('Err1111');
        console.log(err);
    } else {
        //console.log(data);
        // data.toString('ascii', 0, data.length)

        parser.parseString(data.replace(/&(?!(?:apos|quot|[gl]t|amp);|#)/g, '&amp;'), function (err, result) {
            if(err) {
                console.log('Err');
                console.log(err);
            } else {
                console.log(JSON.stringify(result));
                console.log('Done');
            }            
        });
    }
});

Exact you have to do it below :

data.replace(/&(?!(?:apos|quot|[gl]t|amp);|#)/g, '&')

Problem is below tag only &unc;

<pos>&unc;</pos>

Referenced And Thanks to @tim

R.G.Krish
  • 487
  • 5
  • 22
3

I think your problem is unescaped characters in your xml data.

I'm able to get your example to work by using this:

xml data:

<JMdict>
    <entry>
        <ent_seq>1000000</ent_seq>
        <r_ele>
            <reb>ヽ</reb>
        </r_ele>
        <sense>
             <pos>YOUR PROBLEM WAS HERE</pos>
             <gloss g_type="expl">repetition mark in katakana</gloss>
        </sense>
    </entry>

node.js code:

const fs = require('fs-extra');
const xml2js = require('xml2js');
const parser = new xml2js.Parser();

const path = "test.xml";

fs.readFile(path, {encoding: 'utf-8'}, function(error, data) {
     parser.parseString(data, function(err, res) {
         console.log(JSON.stringify(res.JMdict.entry, null, 4));
     });

});

In situations like this, when I know it should work fine, I always look at the data and for any possible issues with the input data.

tamak
  • 1,541
  • 2
  • 19
  • 39
1

The way you use the xml2js package should be fine. However, the format of your xml is a little bit off.

if you add a console.log to see what's causing the error

fs.readFile(path, {encoding: 'utf-8'}, function(error, data) {
     parser.parseString(data, function(err, res) {
         if (err) console.log(err);

         console.log(res);
     });
});

You'll see that it's the line <pos>&unc;</pos> that causes the problem. If you fix the HTML entities, the parser should works fine.

Ray Chan
  • 1,050
  • 9
  • 18