How can I add customized method to return the data source not only in dask format in a plugin but also in several different custom formats?

Question

I am working on an intake plugin that allows to read specific JSON files from Github. These JSON files contain basic information about systems that we want to simulate with different simulation software, each with its own input format. We have converters from JSON to each of these formats available. I would now like to add a method 'to_format' to my plugin similar to the 'to_dask' method, but I keep getting `RemoteSequenceSource object has no attribute 'to_format'. Is there a way to do this?


from latticejson.convert import to_elegant, to_madx

class RemoteLatticejson(RemoteSource):
    """
    A lattice json source on the server
    """

    name      = 'remote-latticejson'
    container = 'python'
    partition_access = False

    def __init__(self,org, repo, filename, parameters= None, metadata=None, **kwargs):
        # super().__init__(org, repo, filename, parameters, metadata=metadata, **kwargs)
        self._schema = None
        self.org = org
        self.repo = repo
        self.filename = filename
        self.metadata = metadata

        self._dict = None

    def _load(self):
        self._dict = read_remote_file(self.org, self.repo, self.filename)

    def _get_schema(self):
        if self._dict is None:
            self._load()

        self._dtypes = {
                'version': 'str',
                'title': 'str',
                'root': 'str',
                'elements': 'dict',
                'lattice': 'dict'
                }
        return base.Schema(
                datashape=None,
                dtype=self._dtypes,
                shape=(None, len(self._dtypes)),
                npartitions=1,
                extra_metadata={}
                )


    def _get_partition(self, i):
        if self._dict is None:
            self._load_metadata()
        data = [self.read()]
        return [self._dict]


    def read(self):
        if self._dict is None:
            self._load()

        self.metadata = {
                'version': self._dict.get('version'),
                'title': self._dict.get('title'),
                'root': self._dict.get('root')
                }

        return self._dict

    def to_madx(self):
        self._get_schema()
        return to_madx(self._dict)

    def _close(self):
        pass
`

score 0 · Answer 1 · answered Jun 30 '20 at 14:51

There are two concepts at play here:

a new driver, which can freely add methods to its implementation (to_X) and expose them to the user. This is allowed, and there are cases implementing this, to pass out particular formats or to allow access to the base object (like here). Note that by adding methods, you make the already-long list of methods on the source even longer, so we lightly discourage this.
a remote source, which is only used in the case that the client cannot access the data directly (because it doesn't have a route, permission, or the right driver locally). This case is more restricted, and the transfer of data is mediated by the "container" sources. If you wanted to have new, custom behaviour for your source when transferring data through the server, you would need to write your own container as well as the original driver (the driver would have container = "mycustom" and you would register the container with intake.container.register_container).

You can see from this, that Intake was not really designed for processing or writing data, but to bring you datasets in recognised forms in the simplest way. By limiting scope, we hoped to keep the code simple and flexible.

Thanks for your comments. This is a little bit troublesome as quite some of our data is not so easy to put in a dataframe like structure. For example many of our files are simulation input files, basically strings with a certain internal structure, we would like to make available to users. Furthermore we want to offer them in different formats (strings with different internal structure) to be able to use them as inputs for simulation software available to the user, hence my question about the formats. And, of course, we need to be able to control who can see what. — TMS, Jul 03 '20 at 14:43
You may wish to post an issue on the intake repo describing in as much detail as you can the type of data and set of outputs you would like. You might find that defaulting to the "python" (i.e., generic sequence) container is the best option, or perhaps indeed making your own container. — mdurant, Jul 03 '20 at 14:50

How can I add customized method to return the data source not only in dask format in a plugin but also in several different custom formats?

1 Answers1