1

I want to replace rare words with _RARE_ in a JSON tree using JAVA.

My rareWords list contains

late  
populate
convicts

So for JSON below

["S", ["PP", ["ADP", "In"], ["NP", ["DET", "the"], ["NP", ["ADJ", "late"], ["NOUN", "1700<s"]]]], ["S", ["NP", ["ADJ", "British"], ["NOUN", "convicts"]], ["S", ["VP", ["VERB", "were"], ["VP", ["VERB", "used"], ["S+VP", ["PRT", "to"], ["VP", ["VERB", "populate"], ["WHNP", ["DET", "which"], ["NOUN", "colony"]]]]]], [".", "?"]]]]

I should get

["S", ["PP", ["ADP", "In"], ["NP", ["DET", "the"], ["NP", ["ADJ", "_RARE_"], ["NOUN", "1700<s"]]]], ["S", ["NP", ["ADJ", "British"], ["NOUN", "_RARE_"]], ["S", ["VP", ["VERB", "were"], ["VP", ["VERB", "used"], ["S+VP", ["PRT", "to"], ["VP", ["VERB", "populate"], ["WHNP", ["DET", "which"], ["NOUN", "colony"]]]]]], [".", "?"]]]]

Notice how

["ADJ","late"]

was replaced by

["ADJ","_RARE_"]

My code so far is like below:

I recursively iterate over the tree and as soon as rare word is found, I create a new JSON array and try to replace the existing tree's node with it. See // this Doesn't work in below, that is where I got stuck. The tree remains unchanged outside of this function.

public static void traverseTreeAndReplaceWithRare(JsonArray tree){   

        //System.out.println(tree.getAsJsonArray()); 

        for (int x = 0; x < tree.getAsJsonArray().size(); x++)
        {
            if(!tree.get(x).isJsonArray())
            {
                if(tree.size()==2)
                {   
                //beware it will get here twice for same word
                 String word= tree.get(1).toString();  
                 word=word.replaceAll("\"", ""); // removing double quotes

                 if(rareWords.contains(word))
                 {
                 JsonParser parser = new JsonParser();                   

                             //This works perfectly 
                             System.out.println("Orig:"+tree);
                 JsonElement jsonElement = parser.parse("["+tree.get(0)+","+"_RARE_"+"]");

                 JsonArray newRareArray = jsonElement.getAsJsonArray();

                             //This works perfectly 
                             System.out.println("New:"+newRareArray);

                 tree=newRareArray; // this Doesn't work
                 }                 

                }               
                continue;   
            }
            traverseTreeAndReplaceWithRare(tree.get(x).getAsJsonArray());
        }
    }

code for calling above, I use google's gson

JsonParser parser = new JsonParser();
JsonElement jsonElement = parser.parse(strJSON);
JsonArray tree = jsonElement.getAsJsonArray();  
Watt
  • 3,118
  • 14
  • 54
  • 85
  • 1
    Why don't you just do a `strJSON.replaceAll("(late|populate|convicts)", "_RARE_")` – Leonard Brünings Apr 18 '13 at 22:13
  • +1 Sure, I am going to try that and it might work for most cases. But main motivation for asking this question was to understand/learn how to manipulate such tree. – Watt Apr 18 '13 at 22:16
  • sorry, replaceAll() doesn't work for me because my rareWords list is 3435 long and also it end up replacing "SQ" with "_RARE_" from instances like ["SQ", "late"] – Watt Apr 18 '13 at 23:38
  • The above is happening because there is a "S." in my rareList .. I just found by going through all 3435 rarewords. – Watt Apr 18 '13 at 23:51

1 Answers1

6

Here's a straight forward approach in C++:

#include <fstream>
#include "JSON.hpp"
#include <boost/algorithm/string/regex.hpp>
#include <boost/range/adaptors.hpp>
#include <boost/phoenix.hpp>

static std::vector<std::wstring> readRareWordList()
{
    std::vector<std::wstring> result;

    std::wifstream ifs("testcases/rarewords.txt");
    std::wstring line;
    while (std::getline(ifs, line))
        result.push_back(std::move(line));

    return result;
}

struct RareWords : boost::static_visitor<> {

    /////////////////////////////////////
    // do nothing by default
    template <typename T> void operator()(T&&) const { /* leave all other things unchanged */ }

    /////////////////////////////////////
    // recurse arrays and objects
    void operator()(JSON::Object& obj) const { 
        for(auto& v : obj.values) {
            //RareWords::operator()(v.first); /* to replace in field names (?!) */
            boost::apply_visitor(*this, v.second);
        }
    }

    void operator()(JSON::Array& arr) const {
        int i = 0;
        for(auto& v : arr.values) {
            if (i++) // skip the first element in all arrays
                boost::apply_visitor(*this, v);
        }
    }

    /////////////////////////////////////
    // do replacements on strings
    void operator()(JSON::String& s) const {
        using namespace boost;

        const static std::vector<std::wstring> rareWords = readRareWordList();
        const static std::wstring replacement = L"__RARE__";

        for (auto&& word : rareWords)
            if (word == s.value)
                s.value = replacement;
    }
};

int main()
{
    auto document = JSON::readFrom(std::ifstream("testcases/test3.json"));

    boost::apply_visitor(RareWords(), document);

    std::cout << document;
}

This assumes you wanted to do replacements on all string values, and only matches whole strings. You could easily make this case insensitive, match words inside strings etc. by changing the regex or regex flags. Slightly adapted in response to the comments.

The full code including JSON.hpp/cpp is here: https://github.com/sehe/spirit-v2-json/tree/16093940

sehe
  • 374,641
  • 47
  • 450
  • 633
  • +1 for code, Thanks! Since I don't know much C++. Will it possible you can modify your code, so that I can pass rareWords through a file? Actually, I tried to shorten my question for readability, in real my rareword list contains 3435 words and some of them contanis . or * for example S. U.S.A. A* that was messing up with String.replaceAll Regex matching . I will accept this answer after trying the updated code. – Watt Apr 19 '13 at 02:53
  • Yeah reading the words form a file is pretty trivial. However, I'd really want exact examples on what to match (do you always want _exact matches_ of _whole strings_? – sehe Apr 19 '13 at 02:55
  • Yes, exact matches of whole string. For example: if "xyz." is in rare word list, then only "xyz." should be replaced with "_RARE_" not even "xyz". And if some array is like ["xyz.","xyz." ] it should be ["xyz.", "_RARE_"] ... notice that second string in branch array getting replaced, we never touch the first one. Another drawback of replaceAll method was that it could potentially replace the first string. I am going to modify question to draw tree for more clarity. I can share whole input and rareWord file if you need. – Watt Apr 19 '13 at 03:06
  • @Watt updated with `readRareWordList()` and showing how to do exact matches only. **EDIT** also skipping the first element in every array now (see comment) **and** removed the regex matching that I picked from your code, but wasn't what you wanted after all. – sehe Apr 19 '13 at 03:11
  • Thank you! Going to try this now.. will be back in 10-15 min. – Watt Apr 19 '13 at 03:12
  • For some reason, it is complaining on apply_visitor "function couldn't be resolved". Fixing it. – Watt Apr 19 '13 at 04:27
  • I gave up setting up C++ env. I figured out Java way of doing it. But, I will still accept your answer to give you credit for your hard work. – Watt Apr 19 '13 at 05:25
  • Okay, glad to hear it works. I just noticed that the problem changed from 'string matching' to 'wordlist set membership tetsting' so it could be much more efficient using a (hash) set: https://github.com/sehe/spirit-v2-json/blob/16093940/test.cpp – sehe Apr 19 '13 at 06:29