Questions tagged [cublas]

The NVIDIA CUDA Basic Linear Algebra Subroutines (cuBLAS) library is a GPU-accelerated version of the complete standard BLAS library for use with CUDA capable GPUs.

The CUBLAS library is an implementation of the standard BLAS (Basic Linear Algebra Subprograms) API on top of the NVIDIA CUDA runtime.

Since CUDA 4.0 was released, the library contains implementations of all 152 standard BLAS routines, supporting single precision real and complex arithmetic on all CUDA capable devices, and double precision real and complex arithmetic on those CUDA capable devices with double precision support. The library includes host API bindings for C and Fortran, and CUDA 5.0 introduces a device API for use with CUDA kernels.

The library is shipped in every version of the CUDA toolkit and has a dedicated homepage at http://developer.nvidia.com/cuda/cublas.

330 questions
53
votes
10 answers

Tensorflow crashes with CUBLAS_STATUS_ALLOC_FAILED

I'm running tensorflow-gpu on Windows 10 using a simple MINST neural network program. When it tries to run, it encounters a CUBLAS_STATUS_ALLOC_FAILED error. A google search doesn't turn up anything. I…
Axiverse
  • 1,589
  • 3
  • 14
  • 30
24
votes
2 answers

Clarification of the leading dimension in CUBLAS when transposing

For a matrix A, the documentation only states that the corresponding leading dimension parameter lda refers to the: leading dimension of two-dimensional array used to store the matrix A Thus I presume this is just the number of rows of A given…
mchen
  • 9,808
  • 17
  • 72
  • 125
17
votes
1 answer

non-square C-order matrices in cuBLAS ( numba )

I'm trying to use the cuBLAS functions in Anaconda's Numba package and having an issue. I need the input matrices to be in C-order. The output can be in Fortran order. I can run the example script provided with the package, here. The script has two…
user1554752
  • 707
  • 2
  • 10
  • 24
17
votes
7 answers

tensorflow running error with cublas

when I successfully install tensorflow on cluster, I immediately running mnist demo to check if it's going well, but here I came up with a problem. I don't know what is this all about, but it looks like the error is coming from CUDA python3 -m…
Pengqi Lu
  • 231
  • 1
  • 3
  • 5
17
votes
3 answers

Could a CUDA kernel call a cublas function?

I know it sound weird, but here is my scenario: I need to do a matrix-matrix multiplication (A(n*k)*B(k*n)), but I only needs the diagonal elements to be evaluated for the output matrix. I searched cublas library and didn't find any level 2 or 3…
Hailiang Zhang
  • 17,604
  • 23
  • 71
  • 117
16
votes
2 answers

Simple CUBLAS Matrix Multiplication Example?

I'm looking for a very bare bones matrix multiplication example for CUBLAS that can multiply M times N and place the results in P for the following code, using high-performance GPU operations: float M[500][500], N[500][500], P[500][500]; for(int i =…
Chris Redford
  • 16,982
  • 21
  • 89
  • 109
15
votes
1 answer

First tf.session.run() performs dramatically different from later runs. Why?

Here's an example to clarify what I mean: First session.run(): First run of a TensorFlow session Later session.run(): Later runs of a TensorFlow session I understand TensorFlow is doing some initialization here, but I'd like to know where in the…
12
votes
3 answers

ValueError: libcublas.so.*[0-9] not found in the system path

I'm trying to import and use ultralytics library in my Django rest framework project, I use poetry as my dependency manager, I installed ultralytics using poetry add ultralytics and on trying to import the library in my code I recieve this…
12
votes
2 answers

Matrix-vector multiplication in CUDA: benchmarking & performance

I'm updating my question with some new benchmarking results (I also reformulated the question to be more specific and I updated the code)... I implemented a kernel for matrix-vector multiplication in CUDA C following the CUDA C Programming Guide…
Pantelis Sopasakis
  • 1,902
  • 5
  • 26
  • 45
11
votes
1 answer

Asynchronous cuBLAS calls

I want to make calls to cuBLAS routines asynchronously. Is it possible? If yes, how can I achieve that?
user1439690
  • 659
  • 1
  • 11
  • 26
10
votes
1 answer

How to transpose a matrix in an optimal way using blas?

I'm doing some calculations, and doing some analysis on the forces and weakness of different BLAS implementations. however I have come across a problem. I'm testing cuBlas, doing linAlg on the GPU would seem like a good idea, but there is one…
Martin Kristiansen
  • 9,875
  • 10
  • 51
  • 83
10
votes
1 answer

cublasSetVector() vs cudaMemcpy()

I am wondering if there is a difference between: // cumalloc.c - Create a device on the device HOST float * cudamath_vector(const float * h_vector, const int m) { float *d_vector = NULL; cudaError_t cudaStatus; cublasStatus_t cublasStatus; …
Stefan Falk
  • 23,898
  • 50
  • 191
  • 378
10
votes
5 answers

Equivalent of cudaGetErrorString for cuBLAS?

CUDA runtime has a convenience function cudaGetErrorString(cudaError_t error) that translates an error enum into a readable string. cudaGetErrorString is used in the CUDA_SAFE_CALL(someCudaFunction()) macro that many people use for CUDA error…
solvingPuzzles
  • 8,541
  • 16
  • 69
  • 112
8
votes
1 answer

Why cuSparse is much slower than cuBlas for sparse matrix multiplication

Recently when I used cuSparse and cuBLAS in CUDA TOOLKIT 6.5 to do sparse matrix multiplication, I find cuSPARSE is much slower than cuBLAS in all cases! In all my experiments, I used cusparseScsrmm in cuSparse and cublasSgemm in cuBLAS. In the…
ROBOT AI
  • 1,217
  • 3
  • 16
  • 27
8
votes
2 answers

cuBLAS synchronization best practices

I read two posts on Stack Overflow, namely Will the cublas kernel functions automatically be synchronized with the host? and CUDA Dynamic Parallelizm; stream synchronization from device and they recommend the use of some synchronization API, e.g.,…
Pantelis Sopasakis
  • 1,902
  • 5
  • 26
  • 45
1
2 3
21 22