-1

I have a dictionary that contains a list of expanded classnames as the keys with each key pointing to an list containing the number of times the class appears in different jars and what jars it appears in.

For example:

classToJars = {
  'com.sun.xml.ws.policy.PolicyMapKey.class' :  [ 1, 'policy-2.3.1.jar'],
  'com.sun.xml.ws.policy.PolicyMerger.class' :  [ 1, 'policy-2.3.1.jar'],
  'com.sun.xml.ws.policy.PolicyAssertion.class' : [ 1,  'policy-2.3.1.jar' ],

  'com.sun.xml.bind.AccessorFactory.class'     : [1, 'jaxb-impl-2.2.6.jar'],
  'com.sun.xml.bind.AccessorFactoryImpl.class' : [1, 'jaxb-impl-2.2.6.jar'],
  'com.sun.xml.bind.AnyTypeAdapter.class'      : [1, 'jaxb-impl-2.2.6.jar' ],

  'org.apache.mina.integration.jmx.IoSessionManager.class' :      [1, 'mina-integration-jmx-1.1.7.jar'],
  'org.apache.mina.integration.jmx.IoServiceManager.class' :      [1, 'mina-integration-jmx-1.1.7.jar'],

  'org.apache.log4j.Appender.class' : [2, 'log4j-1.2.14.jar', 'log4j-1.2.15.jar'],
  'org.apache.log4j.AppenderSkeleton.class' : [2, 'log4j-1.2.14.jar', 'log4j-1.2.15.jar'],
  
  'com.sun.activation.registries.LineTokenizer.class' : [1, 'activation-1.1.jar'],
  'com.sun.activation.registries.LogSupport.class' : [1, 'activation-1.1.jar'],

  'com.sun.istack.Builder.class' : [2, 'jaxb-impl-2.2.6.jar istack-commons-runtime-2.4.jar'],
  'com.sun.istack.ByteArrayDataSource.class' : [2, 'jaxb-impl-2.2.6.jar istack-commons-runtime-2.4.jar'],
  
  'com.reuters.rfa.ansipage.Page.class' : [1, 'rfa-7.2.0.E2.jar'],
  'com.reuters.rfa.ansipage.PageUpdate.class' : [1, 'rfa-7.2.0.E2.jar'],

  'org.apache.http.impl.io.AbstractMessageWriter.class' : [1, 'rfa-7.2.0.E2.jar'],
  'org.apache.http.impl.io.ChunkedOutputStream.class' : [1, 'rfa-7.2.0.E2.jar']
}

The is a large dict with thousands of keys and values looped over a large set of jars. The idea is to to be able to fold the dict where if the values are the same, then fold it to the largest common substring.

For example: when I run the folding function, the above hash should be reduced to 4 lines as follows:

'com.sun.xml.ws.policy'    :  [ 1, 'policy-2.3.1.jar'],
'com.sun.xml.bind'         :  [1, 'jaxb-impl-2.2.6.jar'],
'org.apache.mina.integration.jmx'  :  [1, 'mina-integration-jmx-1.1.7.jar'],
'org.apache.log4j'      :  [2, 'log4j-1.2.14.jar', 'log4j-1.2.15.jar'],
'com.sun.activation.registries'  :  [1, 'activation-1.1.jar'],
'com.sun.istack'      :  [2, 'jaxb-impl-2.2.6.jar istack-commons-runtime-2.4.jar'],
'com.reuters.rfa.ansipage'    :  [1, 'rfa-7.2.0.E2.jar'],
'org.apache.http.impl.io'    :  [1, 'rfa-7.2.0.E2.jar'],

and so on.

since there is nothing common between com.reuters.rfa and org.apache.http, it will come back with an empty key if you go for largest common substring.

In a case like that, it should simply paste com.reuters.rfa and org.apache.http separately.

Any ideas on how to achieve this?

soothsayer
  • 333
  • 2
  • 6
  • 14

1 Answers1

1

Is this what you want?

import os

classToJars = {
  'com.sun.xml.ws.policy.PolicyMapKey.class' :  [ 1, 'policy-2.3.1.jar'],
  'com.sun.xml.ws.policy.PolicyMerger.class' :  [ 1, 'policy-2.3.1.jar'],
  'com.sun.xml.ws.policy.PolicyAssertion.class' : [ 1,  'policy-2.3.1.jar' ],

  'com.sun.xml.bind.AccessorFactory.class'     : [1, 'jaxb-impl-2.2.6.jar'],
  'com.sun.xml.bind.AccessorFactoryImpl.class' : [1, 'jaxb-impl-2.2.6.jar'],
  'com.sun.xml.bind.AnyTypeAdapter.class'      : [1, 'jaxb-impl-2.2.6.jar' ],

  'org.apache.mina.integration.jmx.IoSessionManager.class' :
      [1, 'mina-integration-jmx-1.1.7.jar'],
  'org.apache.mina.integration.jmx.IoServiceManager.class' :
      [1, 'mina-integration-jmx-1.1.7.jar'],
  'org.apache.mina.integration.jmx.IoSessionManagerMBean.class' :
      [1, 'mina-integration-jmx-1.1.7.jar' ],

  'org.apache.log4j.Appender.class' : [2, 'log4j-1.2.14.jar', 'log4j-1.2.15.jar'],
  'org.apache.log4j.AppenderSkeleton.class' : [2, 'log4j-1.2.14.jar', 'log4j-1.2.15.jar'],
  'org.apache.log4j.AsyncAppender.class' : [2, 'log4j-1.2.14.jar log4j-1.2.15.jar'],
  # ...
  }

#
# from http://stackoverflow.com/a/21419164/866915
#
def common_prefix(names):
  prefix = os.path.commonprefix( [ n.split('.') for n in names ] )
  return '.'.join(prefix)

# return the first 3 components of a class name
def min_prefix(name):
  return '.'.join( name.split('.')[0:3] )

jarsForKey = {}
keyForClass = {}

for c in classToJars:
  jars = classToJars[c]
  s = '|'.join(jars[1:])
  jarsForKey[s] = classToJars[c]
  keyForClass[c] = s

# group together classes based on their key

sameKey = {}
for c in classToJars:
  s = keyForClass[c]
  sameKey.setdefault(s,[]).append(c)

# for each group of classes with the same key, find the largest common substring

for s in sameKey:
  cls = sameKey[s]  # all of the classes with the same key
  jars = jarsForKey[s]

  # partition cls into groups having at least 3 components in common

  group = {}
  for c in cls:
    m = min_prefix(c)
    group.setdefault(m, []).append(c)

  # find the common prefix for each group

  for m in group:
    cls = group[m]
    prefix = common_prefix(cls)
    print prefix, "==>", jars
ErikR
  • 51,541
  • 9
  • 73
  • 124
  • I am so glad somebody else thought of this logic. Although I have to say there are issues with this approach. And I genuinely apologize as I should have given more input data. Let me add more input and once you run the program you will understand better. – soothsayer Dec 17 '14 at 21:57
  • When you add the above entries to your dict, since there is nothing common between com.reuters.rfa and org.apache.http, it will come back with an empty key – soothsayer Dec 17 '14 at 22:00
  • Just put it in your original question - it's too hard to read in a comment box. – ErikR Dec 17 '14 at 22:00
  • my bad. Updated the original post with some more data. – soothsayer Dec 17 '14 at 22:32
  • I added a heuristic so two class names must have at least three name components in common to be considered part of the same group. – ErikR Dec 18 '14 at 00:16