How to parse/unzip/unpack Maven repository indexes generated by Nexus

Question

I have downloaded the indexes generated for Maven Central from http://mirrors.ibiblio.org/pub/mirrors/maven2/dot-index/nexus-maven-repository-index.gz

I would like to list the artifacts information from these index files (groupId, artifactId, version for example). I have read that there is a high level API for that. It seems that I have to use the following maven dependency. However, I don't know what is the entry point to use (which class?) and how to use it to access those files:

<dependency>
    <groupId>org.sonatype.nexus</groupId>
    <artifactId>nexus-indexer</artifactId>
    <version>3.0.4</version>
</dependency>

score 9 · Accepted Answer · edited May 10 '16 at 19:14

Take a peek at https://github.com/cstamas/maven-indexer-examples project.

In short: you dont need to download the GZ/ZIP (new/legacy format) manually, it will indexer take care of doing it for you (moreover, it will handle incremental updates for you too, if possible).

GZ is the "new" format, independent of Lucene index-format (hence, independent of Lucene version) containing data only, while the ZIP is "old" format, which is actually plain Lucene 2.4.x index zipped up. No data content change happens currently, but is planned in future.

As I said, there is no data content difference between two, but some fields (like you noticed) are Indexed but not stored on index, hence, if you consume the ZIP format, you will have them searchable, but not retrievable.

Appears the examples project has "moved" see http://stackoverflow.com/a/28087134/32453 — rogerdpack, May 10 '16 at 19:10

Ondra Žižka · Answer 2 · 2016-05-10T21:17:58.090

The https://github.com/cstamas/maven-indexer-examples is obsolete. And the build fails (tests do not pass).

The Nexus Indexer has moved along and included the examples too: https://github.com/apache/maven-indexer/tree/master/indexer-examples

That builds, and the code works.

Here is a simplified version if you want to roll your own:

Maven:

<dependencies>
    <dependency>
        <groupId>org.apache.maven.indexer</groupId>
        <artifactId>indexer-core</artifactId>
        <version>6.0-SNAPSHOT</version>
        <scope>compile</scope>
    </dependency>

    <!-- For ResourceFetcher implementation, if used -->
    <dependency>
        <groupId>org.apache.maven.wagon</groupId>
        <artifactId>wagon-http-lightweight</artifactId>
        <version>2.3</version>
        <scope>compile</scope>
    </dependency>

    <!-- Runtime: DI, but using Plexus Shim as we use Wagon -->
    <dependency>
        <groupId>org.eclipse.sisu</groupId>
        <artifactId>org.eclipse.sisu.plexus</artifactId>
        <version>0.2.1</version>
    </dependency>
    <dependency>
        <groupId>org.sonatype.sisu</groupId>
        <artifactId>sisu-guice</artifactId>
        <version>3.2.4</version>
    </dependency>

Java:

public IndexToGavMappingConverter(File dataDir, String id, String url)
    throws PlexusContainerException, ComponentLookupException, IOException
{
    this.dataDir = dataDir;

    // Create Plexus container, the Maven default IoC container.
    final DefaultContainerConfiguration config = new DefaultContainerConfiguration();
    config.setClassPathScanning( PlexusConstants.SCANNING_INDEX );
    this.plexusContainer = new DefaultPlexusContainer(config);

    // Lookup the indexer components from plexus.
    this.indexer = plexusContainer.lookup( Indexer.class );
    this.indexUpdater = plexusContainer.lookup( IndexUpdater.class );
    // Lookup wagon used to remotely fetch index.
    this.httpWagon = plexusContainer.lookup( Wagon.class, "http" );

    // Files where local cache is (if any) and Lucene Index should be located
    this.centralLocalCache = new File( this.dataDir, id + "-cache" );
    this.centralIndexDir = new File( this.dataDir,   id + "-index" );

    // Creators we want to use (search for fields it defines).
    // See https://maven.apache.org/maven-indexer/indexer-core/apidocs/index.html?constant-values.html
    List<IndexCreator> indexers = new ArrayList();
    // https://maven.apache.org/maven-indexer/apidocs/org/apache/maven/index/creator/MinimalArtifactInfoIndexCreator.html
    indexers.add( plexusContainer.lookup( IndexCreator.class, "min" ) );
    // https://maven.apache.org/maven-indexer/apidocs/org/apache/maven/index/creator/JarFileContentsIndexCreator.html
    //indexers.add( plexusContainer.lookup( IndexCreator.class, "jarContent" ) );
    // https://maven.apache.org/maven-indexer/apidocs/org/apache/maven/index/creator/MavenPluginArtifactInfoIndexCreator.html
    //indexers.add( plexusContainer.lookup( IndexCreator.class, "maven-plugin" ) );

    // Create context for central repository index.
    this.centralContext = this.indexer.createIndexingContext(
            id + "Context", id, this.centralLocalCache, this.centralIndexDir,
            url, null, true, true, indexers );
}


    final IndexSearcher searcher = this.centralContext.acquireIndexSearcher();
    try
    {
        final IndexReader ir = searcher.getIndexReader();
        Bits liveDocs = MultiFields.getLiveDocs(ir);
        for ( int i = 0; i < ir.maxDoc(); i++ )
        {
            if ( liveDocs == null || liveDocs.get( i ) )
            {
                final Document doc = ir.document( i );
                final ArtifactInfo ai = IndexUtils.constructArtifactInfo( doc, this.centralContext );

                if (ai == null)
                    continue;
                if (ai.getSha1() == null)
                    continue;
                if (ai.getSha1().length() != 40)
                    continue;
                if ("javadoc".equals(ai.getClassifier()))
                    continue;
                if ("sources".equals(ai.getClassifier()))
                    continue;

                out.append(StringUtils.lowerCase(ai.getSha1())).append(' ');
                out.append(ai.getGroupId()).append(":");
                out.append(ai.getArtifactId()).append(":");
                out.append(ai.getVersion()).append(":");
                out.append(StringUtils.defaultString(ai.getClassifier()));
                out.append('\n');
            }
        }
    }
    finally
    {
        this.centralContext.releaseIndexSearcher( searcher );
    }

We use this in the Windup project - JBoss migration tool.

And where exactly is version 6.0-SNAPSHOT of the indexer-core deployed? I'm getting compile errors because maven is not finding that jar file. — Robert Reiz, Jun 02 '15 at 10:26
Oh, right. That's our own. We needed 6 before released. See https://github.com/windup/maven-indexer . — Ondra Žižka, Apr 22 '16 at 03:47
Even the new link to the examples needs the 6.0-SNAPSHOT dependency, see https://github.com/cstamas/maven-indexer-examples/issues/4 (but does run once you have that) — rogerdpack, May 10 '16 at 19:25
I know, we simply roll our own version. You can get it [here](https://github.com/windup/maven-indexer) and build. Sorry for inconvenience but we have no time to make it a normal for with releases to central. — Ondra Žižka, May 10 '16 at 20:55
BTW I've done some refactoring at our [nexus-repository-indexer](https://github.com/windup/nexus-repository-indexer). Check the /nexus-indexer submodule, there are few classes which consume the Nexus index. I haven't found a way to consume the "Jar" data, only the "Min" one. The Jar supposedly contains metadata about contained classes. — Ondra Žižka, May 10 '16 at 21:15

score 2 · Answer 3 · edited May 10 '16 at 19:11

2

The legacy zip index is a simple lucene index. I was able to open it with Luke and write some simple lucene code to dump out the headers of interest ("u" in this case)

import org.apache.lucene.document.Document;
import org.apache.lucene.search.IndexSearcher;

public class Dumper {
    public static void main(String[] args) throws Exception {
        IndexSearcher searcher = new IndexSearcher("c:/PROJECTS/Test/index");
        for (int i = 0; i < searcher.maxDoc(); i++) {
            Document doc = searcher.doc(i);
            String metadata = doc.get("u");
            if (metadata != null) {
                System.out.println(metadata);
            }
        }
    }
}

Sample output ...

org.ioke|ioke-lang-lib|P-0.4.0-p11|NA
org.jboss.weld.archetypes|jboss-javaee6-webapp|1.0.1.CR2|sources|jar
org.jboss.weld.archetypes|jboss-javaee6-webapp|1.0.1.CR2|NA
org.nutz|nutz|1.b.37|javadoc|jar
org.nutz|nutz|1.b.37|sources|jar
org.nutz|nutz|1.b.37|NA
org.openengsb.wrapped|com.google.gdata|1.41.5.w1|NA
org.openengsb.wrapped|openengsb-wrapped-parent|6|NA

There may be better ways to achieve this though ...

edited May 10 '16 at 19:11

rogerdpack

62,887
36
269
388

answered Apr 25 '11 at 10:46

qwerty

3,801
2
28
43

Thanks for your help. Do you know what is the difference between the zip and the gz one? I know that the gz use a proprietary binary format but does it contain more information? Do you know how to iterate on it? – Laurent Apr 25 '11 at 11:03
Not really sure of the internal differences. Googling says that the gz is optimized for speed. If you'd like to use it as input there seems to be a sample available @ https://github.com/sonatype/nexus/tree/master/sandbox/nexus-indexer-sample – qwerty Apr 25 '11 at 11:16
I found that the gz contains much more information like the JAR file size, the description, the classes contained by the JAR, and so on. I have tried to parse the gz version as in your example but it is not possible because the gz archive does not contain lucene segments. Also, before to post my question here I have already seen the github link you point me out. However, from this github example I cannot understand what is the main class to use as an entry point supposing that I have the unzipped gz file. – Laurent Apr 25 '11 at 13:58
hmm, i believed this alternate method would solve your problem. You can retrieve the jar from the index using the lucene api and get its size and the classes it contains. Best of luck finding a nexus-indexer implementation – qwerty Apr 26 '11 at 04:47
the link to the .zip file is dead, I guess we can declare the "legacy zip index" as basically also dead, nowadays? – rogerdpack May 10 '16 at 19:10
What does `NA` mean in this context? – Alex Reinking May 18 '17 at 22:57

Boris Baldassari · Answer 4 · 2021-12-09T08:06:24.537

0

For the records, there is now a tool to extract and export maven indexes as text files: the Maven index exporter. It's available as a Docker image and no code is required.

It basically downloads all .gz index files, extracts the indexes using maven-indexer cli and exports them to a text file with clue. It has been tested on Maven Central and works on many other Maven repositories.

edited Dec 09 '21 at 08:06

answered Dec 09 '21 at 07:59

Boris Baldassari

134
5

How to parse/unzip/unpack Maven repository indexes generated by Nexus

4 Answers4

Linked

Related