Questions tagged [huggingface-datasets]

Use this tag for questions related to the datasets project from Hugging Face. [Project on GitHub][1] [1]: https://github.com/huggingface/datasets

221 questions
10
votes
2 answers

How do I save a Huggingface dataset?

How do I write a HuggingFace dataset to disk? I have made my own HuggingFace dataset using a JSONL file: Dataset({ features: ['id', 'text'], num_rows: 18 }) I would like to persist the dataset to disk. Is there a preferred way to do this? Or, is…
Campbell Hutcheson
  • 549
  • 2
  • 4
  • 12
9
votes
1 answer

Convert pandas dataframe to datasetDict

I cannot find anywhere how to convert a pandas dataframe to type datasets.dataset_dict.DatasetDict, for optimal use in a BERT workflow with a huggingface model. Take these simple dataframes, for example. train_df = pd.DataFrame({ "label" : [1,…
ADF
  • 522
  • 6
  • 14
6
votes
1 answer

StableDiffusion Colab - How to "make sure you're logged in with `huggingface-cli login`?"

I'm trying to run the Colab example of the Huggingface StableDiffusion generative text-to-image…
Twenkid
  • 825
  • 7
  • 15
6
votes
1 answer

How do I convert Pandas DataFrame to a Huggingface Dataset object?

I have the following df: import pandas as pd df = pd.DataFrame({"foo": ["bar", "baz"]}) How do I convert to a Huggingface Dataset?
Vincent Claes
  • 3,960
  • 3
  • 44
  • 62
5
votes
3 answers

Add new column to a HuggingFace dataset

In the dataset I have 5000000 rows, I would like to add a column called 'embeddings' to my dataset. dataset = dataset.add_column('embeddings', embeddings) The variable embeddings is a numpy memmap array of size (5000000, 512). But I get this…
albero
  • 169
  • 2
  • 9
5
votes
1 answer

How to convert tokenized words back to the original ones after inference?

I'm writing a inference script for already trained NER model, but I have trouble with converting encoded tokens (their ids) into original words. # example input df = pd.DataFrame({'_id': [1], 'body': ['Amazon and Tesla are currently the best picks…
4
votes
1 answer

Labeling model with hugginface Dataset

I have the following code from scipy.spatial.distance import dice, directed_hausdorff from sklearn.metrics import f1_score from segments import SegmentsClient from segments import SegmentsDataset from datasets import load_dataset from…
Norhther
  • 545
  • 3
  • 15
  • 35
4
votes
1 answer

How to drop sentences that are too long in Huggingface?

I'm going through the Huggingface tutorial and it appears as the library has automatic truncation, to cut sentences that are too long, based on a max value, or other things. How can I remove sentences for the same reasoning (sentences are too long,…
4
votes
0 answers

max_steps and generative dataset huggingface

I am fine tuning a model on my domain using both MLM and NSP. I am using the TextDatasetForNextSentencePrediction for NSP and DataCollatorForLanguageModeling for MLM. The problem is with TextDatasetForNextSentencePrediction as it loads everything in…
3
votes
1 answer

How to use sample_by="document" argument with load_dataset in Huggingface Dataset?

Problem Hello. I am trying to use huggingface to do some malware classification. I have a 5738 malware binaries in a directory. The paths to these malware binaries are stored in a list called files. I am trying to load these binaries into a…
3
votes
1 answer

How to create a dataset object with for multiple input of texts to the SetFit model?

The Setfit library accept two inputs : "text" and "label", https://huggingface.co/blog/setfit My goals is to train Setfit using two similarity input with binary label (similar or not similar). ("text1","text2","similiar/not") The example of dataset…
3
votes
1 answer

Using huggingface load_dataset in Google Colab notebook

I am trying to load a training dataset in my Google Colab notebook but keep getting an error. This happens exclusively in Colab, since when I run the same notebook in VS Code there is no problem in loading. Here is the code snippet which returns the…
3
votes
1 answer

Cast features to ClassLabel

I have a dataset with type dictionary which I converted to Dataset: ds = datasets.Dataset.from_dict(bio_dict) The shape now is: Dataset({ features: ['id', 'text', 'ner_tags', 'input_ids', 'attention_mask', 'label'], num_rows: 8805 }) When I…
Yana
  • 785
  • 8
  • 23
3
votes
0 answers

Huggingface datasets storing and loading image data

I have a huggingface dataset with an image column ds["image"][0] When I save to disk, load it later I get the image column as…
Vincent Claes
  • 3,960
  • 3
  • 44
  • 62
3
votes
1 answer

Predict over a whole dataset using Transformers

I'm trying to zo zero-shot classification over a dataset with 5000 records. Right now I'm using a normal Python loop, but it is going painfully slow. Is there to speed up the process using Transformers or Datasets structures? This is how my code…
1
2 3
14 15