47

I really like data.frames in R because you can store different types of data in one data structure and you have a lot of different methods to modify the data (add column, combine data.frames,...), it is really easy to extract a subset from the data,...

Is there any Java library available which have the same functionality? I'm mostly interested in storing different types of data in a matrix-like fashion and be able to extract a subset of the data.

Using a two-dimensional array in Java can provide a similar structure, but it is much more difficult to add a column and afterwards extract the top k records.

Michael
  • 1,251
  • 1
  • 15
  • 29
  • In a previous job, I wrote a similar library, using `Object[][]` (in row-major mode) to store the data. You can easily write methods for all the operations you need (write them one at a time, as the need arises): adding a column, extracting a column, adding a row, extracting a row, `cbind`, `rbind`, `merge`,`ddply`, etc. Since they were inspired by R, most of those methods had a function as argument: for instance, to add a new column, I would provide a function to compute the value from the rest of the row; to extract rows, I would provide a predicate, to indicate which rows to keep. – Vincent Zoonekynd Dec 12 '13 at 13:31
  • That was my idea, too. But I thought maybe there is already a library which support all the functionality so that I don't have to re-implement it :) – Michael Dec 12 '13 at 17:40
  • I would also love to have a Java class that implements a data.frame. – stackoverflowuser2010 Feb 22 '14 at 20:35
  • I am looking for the same thing but could not find anything so far. The best I could find was [this Stack Overflow thread](http://stackoverflow.com/questions/7451716/java-r-integration) on calling R from Java. – Zhubarb Mar 05 '14 at 08:45
  • If Scala is an option, Saddle looks very attractive. – David Maust May 25 '14 at 20:38
  • I have just published v0.7 of my paleo library, which offers memory efficient, type-safe Java data frames (see answer below) – Rahel Lüthy Feb 04 '16 at 19:47

6 Answers6

18

Tablesaw (https://github.com/jtablesaw/tablesaw) is Java dataframe begun in 2015 and is under active development (2018). It's designed to be as scalable as possible without sacrificing ease-of-use. Features include filtering by rows and columns, descriptive stats, map/reduce functions, cross-tabs, plots, machine learning. Apache license.

In one query test it returned 500+ records from a 1/2 billion record table in 2 ms.

Contributions, feature requests, and feedback are welcome.

L. Blanc
  • 2,150
  • 2
  • 21
  • 31
  • 1
    I gave it a major chance, but coming from use of both R and Pandas it is a very frustrating dataframe framework to use. The documentation is sparse and what is obvious in Pandas may not even be possible in Tablesaw. I am still hunting for the elusive JVM Dataframe framework that makes me happy like Pandas did. – horcle_buzz Mar 28 '18 at 02:01
  • @horcle_buzz Sorry you found it frustrating. It's true that it's not well documented, and there are inconsistencies in the API. (I'm removing them). But I haven't seen any pandas examples that couldn't be done with comparable effort in java with tablesaw. It would be very helpful to have examples to learn from.If you know of any, open an issue on github with pandas code & a dataset. – L. Blanc Mar 28 '18 at 10:33
  • It definitely is one of the more mature dataframe frameworks available in Java, I will give it that. However that being said, the state of such tools in Java is kind of sad. I actually started playing around with ND4j last night, since it seems the closest to any of the python scientific computing/data sciency tools. One suggestion to help elevate Tablesaw would be to integrate it with ND4j, similar to how Pandas works on top on Numpy. I do have some issues I can submit when I have more time. Thanks for taking the criticism the right way. – horcle_buzz Mar 28 '18 at 14:10
  • @horcle_buzz I also ended up looking into ND4j for the same reason. As cool as tablesaw seems, the published documentation is for v0.20, which is not compatible with the latest published maven version. Also, I am realizing that I may want to deploy code on a GPU, and it seems that nd4j will make that process a little easier. I also agree that the Pandas on Numpy analogy could make sense here with Tablesaw on ND4j. – TKH Jun 21 '18 at 22:53
  • @TKH Tablesaw v.0.20 is now on maven central (or will be when MC updates. It can take a while). I looked at ND4J, but I'm not sure what it would do for Tablesaw. AFAICT it doesn't support non-numeric types, and the implementation as a single contiguous off-heap memory block, makes it (seem) difficult to update efficiently. (a row append would need to move all the data in the table after the first column.) Tablesaw is limited to 2 dimensions and 2 billion rows per table, but otherwise how would ND4J make it better? For ML, tablesaw integrates easily with Smile. – L. Blanc Jul 01 '18 at 14:58
15

I have just open-sourced a first draft version of Paleo, a Java 8 library which offers data frames based on typed columns (including support for primitive values). Columns can be created programmatically (through a simple builder API), or imported from text file.

Please refer to the README for further details.

The project is still wet from birth – I am very interested in feedback / PRs, tia!

Rahel Lüthy
  • 6,837
  • 3
  • 36
  • 51
  • Nice, that is very much what I was looking for. Do you have any plans to extend the Paleo library with functions to e.g. sort the data records or keep only the ones with a certain property etc.? – Michael Feb 23 '16 at 07:56
  • 1
    I am glad you like it! Sorting and slicing is definitely on my list. Which one is more urgent for you? – Rahel Lüthy Feb 23 '16 at 15:51
  • I am interested in resampling a dataframe based on a Date column. Looking at Paleo if it offers this possibility... if not and if feasible for me I would add it! – Christophe Nov 11 '16 at 18:35
  • @netzwerg, how do i apply filters on dataframe based on column criteria using paleo df – Steve Harrison Jul 28 '17 at 06:21
12

I also found myself in need of a data frame structure while working in Java recently. Fortunately, after writing a very basic implementation I was able to get approval to release it as open source. You can find my implementation here: Joinery -- Data frames for Java. Contributions and feature requests are welcome.

Bryan Cardillo
  • 129
  • 1
  • 2
6

Not being very proficient with R, but you should have a look at Guava, specifically Tables. They do not provide the exact functionality you want, but you could either extend them or their specification could help you in writing your own Collection.

Community
  • 1
  • 1
Ondrej Skopek
  • 778
  • 10
  • 27
  • 2
    I haven't played with Guava's Tables before, but they seem very similar to R's dataframes. In particular, it *is* possible to extract a specific row or a specific column. On the other hand, it does not seem like there is an easy way to *add* a given row or column to a table – raptortech97 Jul 15 '14 at 21:31
3

Morpheus (http://www.zavtech.com/morpheus/docs/) provides a DataFrame analogue to that of R. It is a high performance column store data structure that enables data to sorted, sliced, grouped, and aggregated in either the row or column dimension. It also supports parallel processing for many of these operations using the Fork & Join framework internally.

You can easily read & write data to CSV files, databases and also a proprietary JSON format. Adapters to load data from Quandl, Google Finance and others are also available.

It has built in support for various styles of Linear Regressions, Principal Component Analysis, Linear Algebra and other types of analytics support. The feature set is still growing, but it is already a very capable framework.

0

In R we have the dataframe, in Python we have pandas, in Java: There is the Schema from the deeplearning4j

There is also a version for the data analysis of the ubiquitous iris data if you want to just get started, here

There are also other custom objects (from Weka, from Tensorflow that are more or less the same).

moldovean
  • 3,132
  • 33
  • 36