How to use sample_by="document" argument with load_dataset in Huggingface Dataset?

Question

Problem

Hello. I am trying to use huggingface to do some malware classification. I have a 5738 malware binaries in a directory. The paths to these malware binaries are stored in a list called files. I am trying to load these binaries into a huggingface datasets.Dataset object.

I have created the Dataset like this

dataset = datasets.Dataset.from_text(
    files,
    sample_by="document",
    encoding="latin1",
)

Since each file is supposed to represent a single instance, I used sample_by="document", which to my knowledge (confirmed by reading the source code) should treat each document in files as an individual example.

Strangely, the length of files and the length of the resulting dataset do not appear to be the same

dataset.num_rows, len(files)
>>> (27967, 5738)

The expected behavior was that each file in files would get mapped to a particular row in dataset, but apparently this did not happen. Any idea whats up with this? Thanks!

Software

datasets 2.12.0
Python 3.10.6
CentOS 9

References

What is in your files? Are you sure that the files contain one instance each? — Junuxx, May 22 '23 at 21:49
@Junuxx `files` is a list of str files. Each file is supposed to be treated as a single instance, which I thought was the behavior of sample_by="document". — Luke Kurlandski, May 25 '23 at 13:00

alvas · Answer 1 · 2023-05-26T04:23:50.857

Using these files as example from https://github.com/jstrosch/malware-samples/blob/master/binaries/nanocore/2020/March/samples_pcap_artifacts.zip

! wget https://github.com/jstrosch/malware-samples/raw/master/binaries/nanocore/2020/March/samples_pcap_artifacts.zip 
! unzip -P infected samples_pcap_artifacts.zip

$ ls
1ef872652a143f17864063628cd4941d.bin  NanoCoreBase.bin*
ClientPlugin.bin*                     NanoCoreStressTester.bin*
CoreClientPlugin.bin*                 NetworkClientPlugin.bin*
FileBrowserClient.bin*                SecurityClientPlugin.bin*
ManagementClientPlugin.bin*           SurveillanceClientPlugin.bin*
MyClientPlugin2.bin*                  SurveillanceExClientPlugin.bin*
MyClientPluginNew.bin*                ToolsClientPlugin.bin*

Doing it the hard-way

Then to read them into bytes and huggingface Dataset:


from io import BytesIO
import glob

import pandas as pd
from datasets import Dataset

filenames = glob.glob('./*.bin')

filebytes = []
for fn in filenames:
  with open(fn, 'rb') as fin:
    filebytes.append(fin.read())

ds = Dataset.from_dict({'filename': filenames, 'content': filebytes})

Using `load_dataset`

import glob

from datasets import load_dataset,Features,Value
from datasets import load_dataset

filenames = glob.glob('./*.bin')


ds2 = load_dataset("text", 
    data_files={"train": filenames}, 
    sample_by="document", 
    features=Features({'text': Value(dtype='string', id=None)}),
    encoding='latin-1'
)

ds2['train']

[out]:

Dataset({
    features: ['text'],
    num_rows: 14
})

Thank you I created a similar work-around constructing the dataset in a different manner. However, if you have any insight about the strange behavior above, I would be very curious to understand what is wrong about my interpretation of the documentation or if some kind of bug might exist in the API. — Luke Kurlandski, May 30 '23 at 11:58
Use the `feautres=...` argument and see if it works for your files. — alvas, May 30 '23 at 19:47