Problem
Hello. I am trying to use huggingface to do some malware classification. I have a 5738 malware binaries in a directory. The paths to these malware binaries are stored in a list called files
. I am trying to load these binaries into a huggingface datasets.Dataset object.
I have created the Dataset like this
dataset = datasets.Dataset.from_text(
files,
sample_by="document",
encoding="latin1",
)
Since each file is supposed to represent a single instance, I used sample_by="document"
, which to my knowledge (confirmed by reading the source code) should treat each document in files
as an individual example.
Strangely, the length of files
and the length of the resulting dataset
do not appear to be the same
dataset.num_rows, len(files)
>>> (27967, 5738)
The expected behavior was that each file in files
would get mapped to a particular row in dataset
, but apparently this did not happen. Any idea whats up with this? Thanks!
Software
- datasets 2.12.0
- Python 3.10.6
- CentOS 9