1

Suppose I have a JavaScript array of elements that looks something very similar to:

var oui = new Array({
    "pfx": "000000",
    "mask": 24,
    "desc": "00:00:00   Officially Xerox, but 0:0:0:0:0:0 is more common"
},{
    "pfx": "000001",
    "mask": 24,
    "desc": "Xerox  Xerox Corporation"
},{
    "pfx": "000002",
    "mask": 24,
    "desc": "Xerox  Xerox Corporation"
},{
    "pfx": "000003",
    "mask": 24,
    "desc": "Xerox  Xerox Corporation"
},{
    "pfx": "000004",
    "mask": 24,
    "desc": "Xerox  Xerox Corporation"
},{
    "pfx": "000004",
    "mask": 24,
    "desc": "Let's pretend this is a repeat"
   });

Imagine now that the file is very large, and some of the "pfx" values are repeated throughout the data set. Obviously manual de-duping is out of the question, so I'm trying to figure out the best way to approach it programmatically. How can I write a python script to read in the .JS file containing this data set to de-dupe and remove any duplicates? In other words, I would like to read in the JS file, parse the array, and produce another JavaScript file with a similar array, but only unique values for the pfx variable.

I've gone through a couple of other Stack Overflow questions that are similar in nature, but nothing seems to quite fit this case. In my python testing, I can rarely just get the pfx variables by themselves to remove the duplicates, or Python struggles to read it in as a proper JSON object (even without the "var" and "new Array" portion). I should also note, that the reason that I'm doing the de-duping in Python over another JavaScript function within the JS file (which I tried following examples like this) is that it just inflates the size of the JavaScript that has to be loaded onto the page.

In the future, the array is likely to continue grow - thus to avoid unnecessary loading of JavaScript to keep page response times quick, I figured this was a step that could, and should, be performed offline, and added to the page.

For clarification, here is a model of the website I'm trying to mock up: https://www.wireshark.org/tools/oui-lookup.html. It is very simple in nature.

Research:

Convert Javascript array to python list?

Remove duplicate values from JS array

halfer
  • 19,824
  • 17
  • 99
  • 186
Billy Thorton
  • 213
  • 1
  • 4
  • 12
  • If the structure is not nested, I'd match the first `{` to the last `}` with a regex, then use `re.sub`, parse the JSON, de-dupe in Python, and stringify and return it, de-duped – CertainPerformance Sep 24 '19 at 00:32
  • what should happen if pfx is a duplicate? should its whole parent set be omitted? – Robert Kearns Sep 24 '19 at 00:32
  • I thought of using regex, but I thought there would be a much more simple way to do it. Additionally, I figured it would take more logic than perhaps other means – Billy Thorton Sep 24 '19 at 00:49
  • If the pfx is a duplicate, the whole set should be emitted. So in the example above:
     {
        "pfx": "000004",
        "mask": 24,
        "desc": "Let's pretend this is a repeat"
       } 
    
    
    should be omitted
    – Billy Thorton Sep 24 '19 at 00:50
  • To clarify, you're serving these .js files from a Python backend? Is there a reason this data isn't being sent via AJAX as raw JSON and is hardcoded into the file, yet dynamic enough that a script needs to dedupe it? Seems like a possible [x-y problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). – ggorlen Sep 24 '19 at 00:51
  • No, it's a pretty simple website. HTML calls a Javascript function to perform a quick lookup in the JS file containing the array. The Python portion is to prevent removing duplicates from within the script itself. I assumed that would overall be slower by adding to the page, than performing the de-duping in Python outside of the JavaScript – Billy Thorton Sep 24 '19 at 00:53
  • I don't really follow--is this a static website with no backend? How can Python have anything to do to work on a frontend environment like you describe? As an aside, there's no reason to use `new Array`, just `[]` is sufficient. – ggorlen Sep 24 '19 at 00:54
  • Yes it is. Backend could be as simple as an nginx server. Here is an example of what I'm trying to develop: https://www.wireshark.org/tools/oui-lookup.html – Billy Thorton Sep 24 '19 at 00:57
  • OK, so Python is going to be used to dedupe the structure on your machine before you deploy the static page to something like github pages? If this is the case and you can run the .js file, you can setify it, dump the data with `JSON.stringify`, dedupe and paste it back in. Or use a build script to do this for you, probably no need even for Python especially. – ggorlen Sep 24 '19 at 00:58
  • Bingo! That's exactly what I'm thinking. I could be wrong in my way of going about it, but I figured that in the long run, it's code that's unnecessary for the JavaScript page, and would just lead to slower load times. I would like to be a part of a build script as you suggest, but I don't know how to do that programmatically, or as you are suggesting. – Billy Thorton Sep 24 '19 at 00:59

1 Answers1

1

Since the structure is not nested, you can match the array with a regular expression, then parse it with JSON, remove duplicate objects with filter in Python, and then replace with the deduplicated JSON string.

Use array literal syntax ([ and ]) rather than new Array to keep things cleaner (best never to use new Array):

import re
import json
str = '''
var oui = [{
    "pfx": "000000",
    "mask": 24,
    "desc": "00:00:00   Officially Xerox, but 0:0:0:0:0:0 is more common"
},{
    "pfx": "000001",
    "mask": 24,
    "desc": "Xerox  Xerox Corporation"
},{
    "pfx": "000002",
    "mask": 24,
    "desc": "Xerox  Xerox Corporation"
},{
    "pfx": "000003",
    "mask": 24,
    "desc": "Xerox  Xerox Corporation"
},{
    "pfx": "000004",
    "mask": 24,
    "desc": "Xerox  Xerox Corporation"
},{
    "pfx": "000004",
    "mask": 24,
    "desc": "Let's pretend this is a repeat"
   }];
'''

def dedupe(match):
   jsonStr = match.group()
   list = json.loads(jsonStr)
   seenPfxs = set()
   def notDupe(obj):
        thisPfx = obj['pfx']
        if thisPfx in seenPfxs:
            return False
        seenPfxs.add(thisPfx)
        return True
   return json.dumps([obj for obj in list if notDupe(obj)])

dedupedStr = re.sub(r'(?s)\[[^\]]+\](?=;)', dedupe, str)
print(dedupedStr)

Output:

var oui = [{"pfx": "000000", "mask": 24, "desc": "00:00:00   Officially Xerox, but 0:0:0:0:0:0 is more common"}, {"pfx": "000001", "mask": 24, "desc": "Xerox  Xerox Corporation"}, {"pfx": "000002", "mask": 24, "desc": "Xerox  Xerox Corporation"}, {"pfx": "000003", "mask": 24, "desc": "Xerox  Xerox Corporation"}, {"pfx": "000004", "mask": 24, "desc": "Xerox  Xerox Corporation"}];

If possible, you might consider storing the data in a separate tag, rather than being in inline Javascript - it'll be more maintainable. Eg, in your HTML, instead of

var oui = [{
    "pfx": "000000",
    "mask": 24,
    "desc": "00:00:00   Officially Xerox, but 0:0:0:0:0:0 is more common"
},{

consider something like

var oui = JSON.parse(document.querySelector('[data-oui').textContent);
console.log(oui);
<script data-oui type="application/json">[{
    "pfx": "000000",
    "mask": 24,
    "desc": "00:00:00   Officially Xerox, but 0:0:0:0:0:0 is more common"
}]</script>

Then you don't have to dynamically change the Javascript, but only the <script data-oui type="application/json"> tag.

CertainPerformance
  • 356,069
  • 52
  • 309
  • 320
  • Awesome answer! Thank you very much! So originally I had the `New Array` piece in a separate JS file instead of inline of the HTML with a script tag. Should I put it inline? Additionally, would you mind explaining why I should never use the `New Array` syntax? – Billy Thorton Sep 24 '19 at 23:03
  • The issue is getting dynamic data from the server to the client. It's up to you. Inline Javascript can be a bit harder to work with compared to the other main methods (a network request, or a separate Javascript file) *but* inline Javascript means (1) no caching problems (client won't receive outdated data, though this is fixable) (2) no extra network ping needed, making the page able to process the data a bit quicker, which is a definite slight plus. – CertainPerformance Sep 24 '19 at 23:40
  • As for avoiding `new Array`, see https://stackoverflow.com/a/1800594/ and https://www.google.com/search?q=never+use+new+array – CertainPerformance Sep 24 '19 at 23:41