Questions tagged [filehash]

The filehash package implements a simple key-value style database where character string keys are associated with data values that are stored on the disk. A simple interface is provided for inserting, retrieving, and deleting data from the database. Utilities are provided that allow filehash databases to be treated much like environments and lists are already used in R. These utilities permit interactive and exploratory analysis on large datasets.

Working with large datasets in R can be cumbersome because of the need to keep objects in physical memory. While many might generally see that as a feature of the system, the need to keep whole objects in memory creates challenges to those who might want to work interactively with large datasets. Here we take a simple definition of “large dataset” to be any dataset that cannot be loaded into R as a single R object because of memory limitations. For example, a very large data frame might be too large for all of the columns and rows to be loaded at once. In such a situation, one might load only a subset of the rows or columns, if that is possible.

The filehash package provides a full read-write implementation of a key-value database for R. The package does not depend on any external packages (beyond those provided in a standard R installation) or software systems and is written entirely in R, making it readily usable on most platforms. The filehash package represents a database as an instance of an S4 class and operates directly on the S4 object via various methods.

Text adapted from: Peng, Roger, "INTERACTING WITH DATA USING THE FILEHASH PACKAGE FOR R" (June 2006). Johns Hopkins University, Dept. of Biostatistics Working Papers. Working Paper 108. http://biostats.bepress.com/jhubiostat/paper108 & http://cran.r-project.org/web/packages/filehash/vignettes/filehash.pdf

21 questions
8
votes
1 answer

Interactively work with list objects that take up massive memory

I have recently discovered the wonders of the packages bigmemory, ff and filehash to handle very large matrices. How can I handle very large (300MB++) lists? In my work I work with these lists all day every day. I can do band-aid solution with…
Jase
  • 1,025
  • 1
  • 9
  • 34
7
votes
0 answers

difference between ff and filehash package in R

I have a dataframe compose of 25 col and ~1M rows, split into 12 files, now I need to import them and then use some reshape package to do some data management. Each file is too large that I have to look for some "non-RAM" solution for importing and…
lokheart
  • 23,743
  • 39
  • 98
  • 169
4
votes
1 answer

working with large lists that become too big for RAM when operated on

Short of working on a machine with more RAM, how can I work with large lists in R, for example put them on disk and then work on sections of it? Here's some code to generate the type of lists I'm using n = 50; i = 100 WORD <- vector(mode =…
Ben
  • 41,615
  • 18
  • 132
  • 227
2
votes
1 answer

What is the fastest way to get unique file hash using Java?

I want to write a program for personal use that walks the file tree of all of my volumes for the purpose of finding duplicate files. I know there are programs out there that do this, but none do it the way I want to do it, and few seem to ever…
Michael Sims
  • 2,360
  • 1
  • 16
  • 29
2
votes
1 answer

How do I check the filehash of a file thats online in PowerShell?

So well, I am making a pull request to Chris Titus Tech's Ultimate Windows Toolkit, and I wanna make something that checks if it's updated. But when I try running: Get-FileHash -Algorithm SHA256…
fg_
  • 59
  • 1
  • 8
2
votes
1 answer

Calculate the hash of a file longer than 256 characters?

I am using Boe Prox's script to print a list of all the files and folders in a directory. I need Prox's script (as opposed to other windows print directory commands) because it uses robocopy to print filepaths longer than 260 characters. My problem…
oymonk
  • 427
  • 9
  • 27
2
votes
2 answers

What is a python "cksum" equivalent for very large files and how does it work?

I have a problem that i need to validate huge compressed files after download (usually more than 10-20gb per file) against reference checksums that have apparently been generated using cksum (To be more precise: My python script needs to download…
jov14
  • 139
  • 9
2
votes
1 answer

Constructing model.matrix in R cannot fit in memory (tried all memory-mapping packages)

I am trying to estimate an lm() fitment in R for a large sales dataset. The data itself is not so large that R cannot handle it; about 250MB in memory. The problem is when lm() is invoked to include all variables and cross-terms, the construction of…
Ryan Price
  • 131
  • 8
2
votes
1 answer

How can I save results in a list in a memory efficient way?

In my current project I have a calculation function that runs on one element of a vector A and returns a list element that I insert into list B. The return element contains a number of large arbitrarily sized matrices that relate to the first…
Jon M
  • 1,157
  • 1
  • 10
  • 16
1
vote
2 answers

Find Duplicate Files with hash and length, but use other algorithm

I'm trying to Find any duplicate files from my computer, I am using length and hash to speed the process, Someone told me I can improve the speed of my code changing the algorithm of hashing to MD5, I don't know where I have to write that, I copied…
1
vote
1 answer

Powershell - Implementing looping to access elements in a hash algorithm

The function I wrote here accepts three mandatory parameters: an input file, a list containing at least one hash algorithm(s), and an output file that saves the hash values of that input file. This function attempts to accept three needed…
sierra117
  • 39
  • 7
1
vote
0 answers

Typescript Library : Incremental Hashing of Huge file at client side

I need to have typescript/javascript based library which can support hashing when large file is provided to it. It should do the hashing in chunks rather than loading it entirely at once in the memory. I was using previously md5 hashing, but it…
Usman
  • 2,742
  • 4
  • 44
  • 82
1
vote
1 answer

Compare File Hash in PowerShell

I'm very new to Powershell, but am attempting to write a simple function to compare two files using their hashes. I'm getting some unexpected results using the following : $hash1 = Get-FileHash $source | Select-Object Hash Write-Host(" hash1 : "…
Phil S
  • 123
  • 8
1
vote
1 answer

How can I simply check whether two Excel files are the same, or not

I don't want to know WHAT are the differences, I just want to know "Y/N Are these sheets identical?" Unfortunately, superficially Hashing the file doesn't answer that :( Specifically ... I took an .XLSX file, and file-copied it. Compared hashes ...…
Brondahl
  • 7,402
  • 5
  • 45
  • 74
1
vote
1 answer

How can I find out what sort of Hash is being returned by a CKAN resource record?

Example record: "resources": [ { "cache_last_updated": null, "cache_url": null, "mimetype_inner": "", "hash": "9d599bcf3b8db2b5c6aea528bc37d728c856b09c", "description": "CSV file extracted and cleaned…
Frames Catherine White
  • 27,368
  • 21
  • 87
  • 137
1
2