Developers’ Package Tour
General Guidelines
Intake is an open source project, and all development happens on github. Please open issues or discussions there to talk about problems with the code or request features.
To contribute, you should:
clone the repo locally
fork the repo to your personal identity in github using the “fork” button
run
pre-commit install
in the repomake changes locally as you see fit, and commit to a new branch
push the branch to your fork, and follow the prompt to create a Pull Request (PR)
You can expect comments on your PR within a couple of days.
To have a higher chance of having your changes accepted, a concise title description are best, and ideally new code should be accompanied by tests.
Outline
For those interested in Intake Take2, here are the places to look for contributing.
All of the implementation code lives under intake.readers
, which was developed for a
while parallel with and without touching Intake’s V1 code.
The list below gives summaries of the modules,
and the principle classes themselves are in the API Reference.
|
Intake config manipulations and persistence |
|
Data readers which create Catalog objects |
|
Convert between python representations of data |
|
Enumerates all the sorts of data that Intake knows about |
|
Description of the ways to load a data set |
|
Imports made my intake when it itself is imported |
|
Some types and meanings of fields that can be expected in metadata dictionaries |
|
Helpers for creating pipelines |
|
Add module accessors to pipelines, providing functions appropriate for its output |
|
Serialise and output data into persistent formats |
|
Classes for reading data into a python objects |
|
Find datasets meeting some complex criteria |
|
Manipulate data: functions that change the data but not the container type |
|
Parametrization of data/reader entries, as they appear in Catalogs |
Creating Datatypes and Readers
Here follows a minimalist complete set of classes to make a complete pipeline.
A typical data/reader implementation in Intake Take 2 is very simple. Here is the CSV prototype
from intake.readers import FileData
class CSV(FileData):
filepattern = "(csv$|txt$|tsv$)"
mimetypes = "(text/csv|application/csv|application/vnd.ms-excel)"
structure = {"table"}
This specified that CSVs live in files (the superclass), which also implies that they may be local or remote. Further, the block specifies expected URL/filenames and MIME types, as well as an indicator that this filetype is typically used for tables. All of these attributes are optional - an instance just contains enough information to unambiguously identify the source of data. For the case of a CSV dataset, this would be just the URL(s) of the data plus any extra storage backend parameters. Other data types may have other necessary attributes, such as a SQL dataset is a combination of server connection string and query.
The pandas CSV reader counterpart looks like this
from intake.readers import FileReader, datatypes
class PandasCSV(FileReader):
imports = {"pandas"}
output_instance = "pandas:DataFrame"
storage_options = True
implements = {datatypes.CSV}
func = "pandas:read_csv"
url_arg = "filepath_or_buffer"
This says that:
the data type is made of files
the reader requires “pandas” to be installed
the result will be a DataFrame
if the URL is remote, fsspec-style storage_options are acceptable
it can be used on the CSV type from before (only)
it uses the
read_csv
function from thepandas
packageand that the URL of the data source should be passed using the argument name “filepath_or_buffer” (this information can be found from the target function’s signature and docstring).
Often a reader is this simple, or even simpler when you can group attributes in common subclasses.
In other cases, it may be necessary to override the key ._read()
method, which is the one
that does the work.
In fact, PandasCSV does override .discover()
, to add the nrows=
argument,
but adding such refinements is optional.
Doing the above is enough, such that a URL ending in “csv” will be recognised, and pandas offered as one of the potential readers; and thus we can make a reader instance and store it in a Catalog.
Next, let’s imagine we want to make a super simple converter:
from intake import BaseConverter
class PandasToStr(BaseConverter):
instances = {"pandas:DataFrame": "builtins:str"}
func = "builtins:str"
This just returns the string representation of the dataframe, turning DataFrame instances into
str
instances (actually, it would work for just about any python object). The inclusion
of “DataFrame” in instances
means that Intake will know that this is a transform that can be
applied to readers that produce a DataFrame, and it will appear in tab completions and a
reader instance’s .transform
attribute.
To complete the pipeline, lets make a outputter which writes this back to a file
class StrToFile(BaseConverter):
instances = {"builtins:str": datatypes.Text.qname()}
def run(self, x, url, storage_options=None, metadata=None, **kwargs):
with fsspec.open(url, mode="wt", **storage_options) as f:
f.write(x)
return datatypes.Text(url=url, storage_options=storage_options, metadata=metadata)
Although we use fsspec
(which is recommended, where possible), the code is again super-simple.
It is conventional, but not necessary, to have such “output” nodes return a datatypes instance.
All of this now allows:
>>> import intake
>>> intake.auto_pipeline("blah.csv", "Text")
PipelineReader:
0: intake.readers.readers:PandasCSV, () {} => pandas:DataFrame
1: PandasToStr, () {} => builtins:str
2: StrToFile, () {} => intake.readers.datatypes:Text
(where the output filename remains to be filled in)
Packaging
Having made a couple of new classes, how would we get these to potential users?
Assuming you are already familiar with how to create a python package _in_general_, what you need to know, is that Intake will find the new code so long as the classes are subclasses of BaseData, BaseReader (etc.), and the code is imported. That importing can be done
explicitly (which is good form for ad-hoc/experimental use)
including an entrypoint for the package in the group “intake.imports”, where the value would be of the form “package.module” or “package:module” (the latter for
import .. from
style). This requires that the new package is installed viapip
,conda
, etc.adding the package/module to
intake.conf["extra_imports"]
and saving; this will take effect on the next import of Intake.
Migration from V1
Section Relationship to V1 shows the principal differences to Intake before Take2. From a developer’s viewpoint, if porting former plugins, here are some things to bear in mind.
in v2 we generally separate out the definition of the data itself versus the specific reader, e.g., HDF5 is a file type, but xarray is a reader which can handle HDF5. It is totally possible to write a reader without a data type if appropriate. See Base Classes for an overview of the classes.
the new readers only really have one method that matters,
.read()
, and will contain all of the previous logic. It should consistently only produce one particular output type. Other attributes of BaseReader (or FileReader) are one-line overrides and mostly provide information rather than functionality; for instance, Intake uses these for recommending readers for a given data instance.for catalog-producing readers, the output type will be
intake.readers.entry:Catalog
, and the.read()
method will create the Catalog instance and assign readers into it. Moduleintake.readers.catalogs
contains some patterns to copy.if using file patterns: the
DaskCSVPattern
reader will give an idea of how to implement that in the new framework.if using V1 plots, dataframe and xarray-producing readers have the
ToHvPlot
converter can be used for similar functionality.