Highest Voted 'huggingface-datasets' Questions

10

votes

2 answers

How do I save a Huggingface dataset?

How do I write a HuggingFace dataset to disk? I have made my own HuggingFace dataset using a JSONL file: Dataset({ features: ['id', 'text'], num_rows: 18 }) I would like to persist the dataset to disk. Is there a preferred way to do this? Or, is…

huggingface-datasets huggingface

asked Apr 26 '22 at 23:57

Campbell Hutcheson

549
2
4
12

9

votes

1 answer

Convert pandas dataframe to datasetDict

I cannot find anywhere how to convert a pandas dataframe to type datasets.dataset_dict.DatasetDict, for optimal use in a BERT workflow with a huggingface model. Take these simple dataframes, for example. train_df = pd.DataFrame({ "label" : [1,…

pandas huggingface-datasets

asked Mar 25 '22 at 14:54

ADF

522
6
14

6

votes

1 answer

StableDiffusion Colab - How to "make sure you're logged in with `huggingface-cli login`?"

I'm trying to run the Colab example of the Huggingface StableDiffusion generative text-to-image…

google-colaboratory huggingface-datasets huggingface

asked Aug 22 '22 at 21:25

Twenkid

825
7
15

6

votes

1 answer

How do I convert Pandas DataFrame to a Huggingface Dataset object?

I have the following df: import pandas as pd df = pd.DataFrame({"foo": ["bar", "baz"]}) How do I convert to a Huggingface Dataset?

huggingface-datasets

asked Jul 28 '22 at 13:58

Vincent Claes

3,960
3
44
62

5

votes

3 answers

Add new column to a HuggingFace dataset

In the dataset I have 5000000 rows, I would like to add a column called 'embeddings' to my dataset. dataset = dataset.add_column('embeddings', embeddings) The variable embeddings is a numpy memmap array of size (5000000, 512). But I get this…

python numpy word-embedding pyarrow huggingface-datasets

asked Nov 22 '21 at 10:56

albero

169
2
9

5

votes

1 answer

How to convert tokenized words back to the original ones after inference?

I'm writing a inference script for already trained NER model, but I have trouble with converting encoded tokens (their ids) into original words. # example input df = pd.DataFrame({'_id': [1], 'body': ['Amazon and Tesla are currently the best picks…

python pytorch huggingface-transformers huggingface-tokenizers huggingface-datasets

asked Sep 21 '21 at 19:12

deonardo_licaprio

308
1
11

4

votes

1 answer

Labeling model with hugginface Dataset

I have the following code from scipy.spatial.distance import dice, directed_hausdorff from sklearn.metrics import f1_score from segments import SegmentsClient from segments import SegmentsDataset from datasets import load_dataset from…

python huggingface-datasets

asked Aug 30 '22 at 23:13

Norhther

545
3
15
35

4

votes

1 answer

How to drop sentences that are too long in Huggingface?

I'm going through the Huggingface tutorial and it appears as the library has automatic truncation, to cut sentences that are too long, based on a max value, or other things. How can I remove sentences for the same reasoning (sentences are too long,…

python huggingface-transformers huggingface-tokenizers huggingface-datasets

asked May 26 '22 at 16:54

Penguin

1,923
3
21
51

4

votes

0 answers

max_steps and generative dataset huggingface

I am fine tuning a model on my domain using both MLM and NSP. I am using the TextDatasetForNextSentencePrediction for NSP and DataCollatorForLanguageModeling for MLM. The problem is with TextDatasetForNextSentencePrediction as it loads everything in…

python huggingface-transformers bert-language-model huggingface-datasets

asked Nov 05 '21 at 16:23

Prasanna

4,125
18
41

3

votes

1 answer

How to use sample_by="document" argument with load_dataset in Huggingface Dataset?

Problem Hello. I am trying to use huggingface to do some malware classification. I have a 5738 malware binaries in a directory. The paths to these malware binaries are stored in a list called files. I am trying to load these binaries into a…

python deep-learning nlp huggingface huggingface-datasets

asked May 22 '23 at 21:32

Luke Kurlandski

81
5

3

votes

1 answer

How to create a dataset object with for multiple input of texts to the SetFit model?

The Setfit library accept two inputs : "text" and "label", https://huggingface.co/blog/setfit My goals is to train Setfit using two similarity input with binary label (similar or not similar). ("text1","text2","similiar/not") The example of dataset…

python-3.x nlp huggingface-transformers huggingface-datasets sentence-transformers

asked Mar 24 '23 at 06:28

wenz

61
6

3

votes

1 answer

Using huggingface load_dataset in Google Colab notebook

I am trying to load a training dataset in my Google Colab notebook but keep getting an error. This happens exclusively in Colab, since when I run the same notebook in VS Code there is no problem in loading. Here is the code snippet which returns the…

dataset google-colaboratory huggingface-datasets

asked Jan 12 '23 at 13:11

Luiz Felipe Bromfman

31
1
2

3

votes

1 answer

Cast features to ClassLabel

I have a dataset with type dictionary which I converted to Dataset: ds = datasets.Dataset.from_dict(bio_dict) The shape now is: Dataset({ features: ['id', 'text', 'ner_tags', 'input_ids', 'attention_mask', 'label'], num_rows: 8805 }) When I…

python huggingface-transformers huggingface-datasets

asked Dec 22 '22 at 07:19

Yana

785
8
23

3

votes

0 answers

Huggingface datasets storing and loading image data

I have a huggingface dataset with an image column ds["image"][0] When I save to disk, load it later I get the image column as…

huggingface huggingface-datasets

asked Oct 30 '22 at 07:38

Vincent Claes

3,960
3
44
62

3

votes

1 answer

Predict over a whole dataset using Transformers

I'm trying to zo zero-shot classification over a dataset with 5000 records. Right now I'm using a normal Python loop, but it is going painfully slow. Is there to speed up the process using Transformers or Datasets structures? This is how my code…

python deep-learning nlp huggingface-transformers huggingface-datasets

asked Oct 13 '22 at 07:32

ignacioct

325
1
12

Questions tagged [huggingface-datasets]