Read-mostly data structure to compress and search source code

Question

Suppose you have a lot of source code (like 50GB+) in popular languages (Java, C, C++, etc).

The project needs are:

compressing source code to reduce disk use and disk I/O
indexing it in such way that particular source file can be extracted from the compressed source without decompressing the whole thing
compression time for the whole codebase is not important
search and retrieval time (and memory use when searching and retrieving) are important

This SO answer contains potential answers: What are the lesser known but useful data structures?

However, this is just a list of potentials - I do not know how those structures actually evaluate against requirements listed above.

Question: what are data structures (and their implementations) that would perform well according to the aforementioned requirements?

@JoachimPileborg Can I search contents of LZ-compressed text for say strings without decompressing entire ZIP? — LetMeSOThat4U, Sep 11 '14 at 07:48
I'd check what Open Source _Source Control_ programs (like SVN et simili) do to handle that, I suppose they have same requirements and many people worked to solve them... — Adriano Repetti, Sep 11 '14 at 07:48
Yes, each file in a ZIP archive is compressed (or not) and stored individually, so you can quickly and easily extract a single file from the archive. [Read more about ZIP archives](http://en.wikipedia.org/wiki/Zip_%28file_format%29). There exists tools and libraries for all major operating systems to handle ZIP archives. — Some programmer dude, Sep 11 '14 at 08:01

score 1 · Answer 1 · answered Sep 11 '14 at 12:13

The main data-structure used for searching is the inverted list. Fortunately, you don't need to implement it yourself. Lucene is a widely used search tool which works with inverted lists internally.

Using Lucene you can create a document with multiple fields. The idea is that some of these fields will be searchable with standard keyword-type queries.

I've implemented a source code search utility which I'll now describe briefly in the following paragraphs. The entire source code itself is stored as a non-indexable field named "code" (you can modify the source to store a compressed version).

For the retrieval part, note that the keywords that you are going to use for the search could be names of function, classes, packages or variables. They could also be words from the comments and so on. In my implementation, I extracted these information from using a Java annotated syntax tree (AST). You could do the same for other languages as well by making use of an appropriate parser to construct an AST.

Another possibility is the query-by-example (QBE) paradigm, where you could use a small snippet of code to search approximately similar snippets from your indexed code-base. This is particularly helpful for detecting source-code reuse and plagiarism, the main purpose for which I developed the tool.

The project page is here. I call it the YASOCS (Yet Another SOurce Code Searcher).

The search is very fast since it uses the inverted list. You could also use Luke (an open source Lucene index visualizer) to "see" the index yourself and execute test queries using the interface.

Thanks I'm looking at it now, but you dont' have a license in there - well that makes you potentially liable because there's not even disclaimer. ;-) — LetMeSOThat4U, Sep 11 '14 at 12:45
Didn't have a chance to construct a license agreement :) the software is intended to be free to use... — Debasis, Sep 11 '14 at 15:26

Read-mostly data structure to compress and search source code

1 Answers1