Use this tag for questions related to the datasets project from Hugging Face. [Project on GitHub][1] [1]: https://github.com/huggingface/datasets
Questions tagged [huggingface-datasets]
221 questions
10
votes
2 answers
How do I save a Huggingface dataset?
How do I write a HuggingFace dataset to disk?
I have made my own HuggingFace dataset using a JSONL file:
Dataset({
features: ['id', 'text'],
num_rows: 18 })
I would like to persist the dataset to disk.
Is there a preferred way to do this? Or, is…

Campbell Hutcheson
- 549
- 2
- 4
- 12
9
votes
1 answer
Convert pandas dataframe to datasetDict
I cannot find anywhere how to convert a pandas dataframe to type datasets.dataset_dict.DatasetDict, for optimal use in a BERT workflow with a huggingface model. Take these simple dataframes, for example.
train_df = pd.DataFrame({
"label" : [1,…

ADF
- 522
- 6
- 14
6
votes
1 answer
StableDiffusion Colab - How to "make sure you're logged in with `huggingface-cli login`?"
I'm trying to run the Colab example of the Huggingface StableDiffusion generative text-to-image…

Twenkid
- 825
- 7
- 15
6
votes
1 answer
How do I convert Pandas DataFrame to a Huggingface Dataset object?
I have the following df:
import pandas as pd
df = pd.DataFrame({"foo": ["bar", "baz"]})
How do I convert to a Huggingface Dataset?

Vincent Claes
- 3,960
- 3
- 44
- 62
5
votes
3 answers
Add new column to a HuggingFace dataset
In the dataset I have 5000000 rows, I would like to add a column called 'embeddings' to my dataset.
dataset = dataset.add_column('embeddings', embeddings)
The variable embeddings is a numpy memmap array of size (5000000, 512).
But I get this…

albero
- 169
- 2
- 9
5
votes
1 answer
How to convert tokenized words back to the original ones after inference?
I'm writing a inference script for already trained NER model, but I have trouble with converting encoded tokens (their ids) into original words.
# example input
df = pd.DataFrame({'_id': [1], 'body': ['Amazon and Tesla are currently the best picks…

deonardo_licaprio
- 308
- 1
- 11
4
votes
1 answer
Labeling model with hugginface Dataset
I have the following code
from scipy.spatial.distance import dice, directed_hausdorff
from sklearn.metrics import f1_score
from segments import SegmentsClient
from segments import SegmentsDataset
from datasets import load_dataset
from…

Norhther
- 545
- 3
- 15
- 35
4
votes
1 answer
How to drop sentences that are too long in Huggingface?
I'm going through the Huggingface tutorial and it appears as the library has automatic truncation, to cut sentences that are too long, based on a max value, or other things.
How can I remove sentences for the same reasoning (sentences are too long,…

Penguin
- 1,923
- 3
- 21
- 51
4
votes
0 answers
max_steps and generative dataset huggingface
I am fine tuning a model on my domain using both MLM and NSP. I am using the TextDatasetForNextSentencePrediction for NSP and DataCollatorForLanguageModeling for MLM.
The problem is with TextDatasetForNextSentencePrediction as it loads everything in…

Prasanna
- 4,125
- 18
- 41
3
votes
1 answer
How to use sample_by="document" argument with load_dataset in Huggingface Dataset?
Problem
Hello. I am trying to use huggingface to do some malware classification. I have a 5738 malware binaries in a directory. The paths to these malware binaries are stored in a list called files. I am trying to load these binaries into a…

Luke Kurlandski
- 81
- 5
3
votes
1 answer
How to create a dataset object with for multiple input of texts to the SetFit model?
The Setfit library accept two inputs : "text" and "label", https://huggingface.co/blog/setfit
My goals is to train Setfit using two similarity input with binary label (similar or not similar). ("text1","text2","similiar/not")
The example of dataset…

wenz
- 61
- 6
3
votes
1 answer
Using huggingface load_dataset in Google Colab notebook
I am trying to load a training dataset in my Google Colab notebook but keep getting an error. This happens exclusively in Colab, since when I run the same notebook in VS Code there is no problem in loading.
Here is the code snippet which returns the…

Luiz Felipe Bromfman
- 31
- 1
- 2
3
votes
1 answer
Cast features to ClassLabel
I have a dataset with type dictionary which I converted to Dataset:
ds = datasets.Dataset.from_dict(bio_dict)
The shape now is:
Dataset({
features: ['id', 'text', 'ner_tags', 'input_ids', 'attention_mask', 'label'],
num_rows: 8805
})
When I…

Yana
- 785
- 8
- 23
3
votes
0 answers
Huggingface datasets storing and loading image data
I have a huggingface dataset with an image column
ds["image"][0]
When I save to disk, load it later I get the image column as…

Vincent Claes
- 3,960
- 3
- 44
- 62
3
votes
1 answer
Predict over a whole dataset using Transformers
I'm trying to zo zero-shot classification over a dataset with 5000 records. Right now I'm using a normal Python loop, but it is going painfully slow. Is there to speed up the process using Transformers or Datasets structures? This is how my code…

ignacioct
- 325
- 1
- 12