Questions tagged [blaze]

Blaze is a NumPy/Pandas like interface to data analytics developed by Continuum Analytics.

Blaze is intended to provide an expressive, compact set of foundational abstractions for composing computations over large amounts of semi-structured data.

81 questions
164
votes
8 answers

How to read a Parquet file into Pandas DataFrame?

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
17
votes
5 answers

Python particles simulator: out-of-core processing

Problem description In writing a Monte Carlo particle simulator (brownian motion and photon emission) in python/numpy. I need to save the simulation output (>>10GB) to a file and process the data in a second step. Compatibility with both Windows and…
user2304916
  • 7,882
  • 5
  • 39
  • 53
10
votes
4 answers

Blaze with Scikit Learn K-Means

I am trying to fit Blaze data object to scikit kmeans function. from blaze import * from sklearn.cluster import KMeans data_numeric = Data('data.csv') data_cluster = KMeans(n_clusters=5) data_cluster.fit(data_numeric) Data Sample: A B C 1 32…
sachin saxena
  • 926
  • 5
  • 18
8
votes
1 answer

pydata blaze: does it allow parallel processing or not?

I am looking to parallelise numpy or pandas operations. For this I have been looking into pydata's blaze. My understanding was that seemless parallelisation was its major selling point. Unfortunately I have been unable to find an operation that runs…
ARF
  • 7,420
  • 8
  • 45
  • 72
8
votes
0 answers

What are the most robust and interactive-friendly ways to structure general 2D/3D/ND datasets in Python?

I am a scientist recently converted from MATLAB to Python. I am looking for ways to structure my (mainly 2D and 3D) datasets. I have searched the net quite a bit, and it seems to me that robust and general-purpose data structuring in Python is still…
cmeeren
  • 3,890
  • 2
  • 20
  • 50
7
votes
2 answers

Where is the pydata BLAZE project heading?

I find the blaze ecosystem* amazing because it covers most of the data engineering use cases. There was definitely a lot of interest on these projects during the period 2015-2016, but of late it has been ignored. I say this looking at the commits on…
human
  • 2,250
  • 20
  • 24
7
votes
0 answers

Streaming results with Blaze and SqlAlchemy

I am trying to use Blaze/Odo to read a large (~70M rows) result set from Redshift. By default SqlAlchemy witll try to read the whole result into memory, before starting to process it. This can be prevented by either…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
7
votes
1 answer

Choosing a framework for larger than memory data analysis with python

I'm solving a problem with a dataset that is larger than memory. The original dataset is a .csv file. One of the columns is for track IDs from the musicbrainz service. What I already did I read the .csv file with dask and converted it to castra…
Nagasaki45
  • 2,634
  • 1
  • 22
  • 27
6
votes
3 answers

calling SQL functions from Blaze

In particular I would like to call the Postgres levenshtein function. I would like to write the blaze query to return words similar to the word 'similar', ie the equivalent of: select word from wordtable where levenshtein(word, 'similar') < 3; In…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
5
votes
4 answers

Delete column(s) from very large CSV file using pandas or blaze

I have a very large csv file (5 GB), so I do not want to load the whole thing into memory, and I want to delete one or more of its columns. I tried using the following code in blaze, but all it did was append the resulting columns to the existing…
Alex
  • 3,946
  • 11
  • 38
  • 66
5
votes
1 answer

How to provide user defined function for python blaze with sqlite backend?

I connect to sqlite database in Blaze using df = bz.Data("sqlite:///) everything works fine but I do not know how to provide user-defined functions in my interaction with df. I have a column called IP in df which is text containing IP…
Kshadi
  • 51
  • 2
5
votes
1 answer

Using odo to migrate data to SQL

I have a large 3 GB CSV file, and I'd like to use Blaze to investigate the data, select down to the data I'm interesting in analyzing, with the eventual goal to migrate that data into a suitable computational backend such as SQlite, PostgresSQL etc.…
Joseph
  • 351
  • 1
  • 6
  • 17
5
votes
1 answer

What are "synthetic dimensions" in Blaze?

The Blaze readme (here https://github.com/ContinuumIO/blaze) describes a number of improvements over NumPy including "Synthetic Dimensions". I have searched around but have been unable to find out what they are. Could someone enlighten me? Thanks.
4
votes
1 answer

access data in sharded JSON files on S3 from Blaze

I am trying to access line delimited JSON data on S3. From my understanding of the docs I should be able to do something like print data(S3(Chunks(JSONLines))('s3://KEY:SECRET@bucket/dir/part-*.json').peek() which throws BotoClientError:…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
4
votes
2 answers

index milion row square matrix for fast access

I have some very large matrices (let say of the order of the million rows), that I can not keep in memory, and I would need to access to subsample of this matrix in descent time (less than a minute...). I started looking at hdf5 and blaze in…
fransua
  • 1,559
  • 13
  • 30
1
2 3 4 5 6