End User

These are reference class and function definitions likely to be useful to everyone.

intake.open_catalog([uri])

Create a Catalog object

intake.registry

Dict of driver: DataSource class

intake.register_driver(name, value[, ...])

Add runtime driver definition to list of registered drivers (drivers in global scope with corresponding intake.open_* function)

intake.unregister_driver(name)

Remove runtime registered driver

intake.upload(data, path, **kwargs)

Given a concrete data object, store it at given location return Source

intake.source.csv.CSVSource(*args, **kwargs)

Read CSV files into dataframes

intake.source.textfiles.TextFilesSource(...)

Read textfiles as sequence of lines

intake.source.jsonfiles.JSONFileSource(...)

Read JSON files as a single dictionary or list

intake.source.jsonfiles.JSONLinesFileSource(...)

Read a JSONL (https://jsonlines.org/) file and return a list of objects, each being valid json object (e.g.

intake.source.npy.NPySource(*args, **kwargs)

Read numpy binary files into an array

intake.source.zarr.ZarrArraySource(*args, ...)

Read Zarr format files into an array

intake.catalog.local.YAMLFileCatalog(*args, ...)

Catalog as described by a single YAML file

intake.catalog.local.YAMLFilesCatalog(*args, ...)

Catalog as described by a multiple YAML files

intake.catalog.zarr.ZarrGroupCatalog(*args, ...)

A catalog of the members of a Zarr group.

intake.interface.gui.GUI([cats])

Top level GUI panel that contains controls and all visible sub-panels

intake.open_catalog(uri=None, **kwargs)

Create a Catalog object

Can load YAML catalog files, connect to an intake server, or create any arbitrary Catalog subclass instance. In the general case, the user should supply driver= with a value from the plugins registry which has a container type of catalog. File locations can generally be remote, if specifying a URL protocol.

The default behaviour if not specifying the driver is as follows:

  • if uri is a single string ending in “yml” or “yaml”, open it as a catalog file

  • if uri is a list of strings, a string containing a glob character (“*”) or a string not ending in “y(a)ml”, open as a set of catalog files. In the latter case, assume it is a directory.

  • if uri begins with protocol "intake:", connect to a remote Intake server

  • if uri is None or missing, create a base Catalog object without entries.

Parameters
uri: str or pathlib.Path

Designator for the location of the catalog.

kwargs:

passed to subclass instance, see documentation of the individual catalog classes. For example, yaml_files_cat (when specifying multiple uris or a glob string) takes the additional parameter flatten=True|False, specifying whether all data sources are merged in a single namespace, or each file becomes a sub-catalog.

See also

intake.open_yaml_files_cat, intake.open_yaml_file_cat
intake.open_intake_remote
intake.registry

Mapping from plugin names to the DataSource classes that implement them. These are the names that should appear in the driver: key of each source definition in a catalog. See Plugin Directory for more details.

intake.open_

Set of functions, one for each plugin, for direct opening of a data source. The names are derived from the names of the plugins in the registry at import time.

intake.upload(data, path, **kwargs)

Given a concrete data object, store it at given location return Source

Use this function to publicly share data which you have created in your python session. Intake will try each of the container types, to see if one of them can handle the input data, and write the data to the path given, in the format most appropriate for the data type, e.g., parquet for pandas or dask data-frames.

With the DataSource instance you get back, you can add this to a catalog, or just get the YAML representation for editing (.yaml()) and sharing.

Parameters
datainstance

The object to upload and store. In many cases, the dask or in-memory variant are handled equivalently.

pathstr

Location of the output files; can be, for instance, a network drive for sharing over a VPC, or a bucket on a cloud storage service

kwargspassed to the writer for fine control
Returns
DataSource instance
class intake.interface.gui.GUI(cats=None)

Top level GUI panel that contains controls and all visible sub-panels

This class is responsible for coordinating the inputs and outputs of various sup-panels and their effects on each other.

Parameters
cats: list of catalogs

catalogs used to initalize the cat panel

Attributes
children: list of panel objects

children that will be used to populate the panel when visible

panel: panel layout object

instance of a panel layout (row or column) that contains children when visible

watchers: list of param watchers

watchers that are set on children - cleaned up when visible is set to false.

add(*args, **kwargs)

Add to list of cats

property cats

Cats that have been selected from the cat sub-panel

classmethod from_state(state)

Create a new object from a serialized exising object.

property item

Item that is selected

property source_instance

DataSource instance for the current selection and any parameters

property sources

Sources that have been selected from the source sub-panel

Source classes

class intake.source.csv.CSVSource(*args, **kwargs)

Read CSV files into dataframes

Prototype of sources reading dataframe data

__init__(urlpath, csv_kwargs=None, metadata=None, storage_options=None, path_as_pattern=True)
Parameters
urlpathstr or iterable, location of data

May be a local path, or remote path if including a protocol specifier such as 's3://'. May include glob wildcards or format pattern strings. Some examples:

  • {{ CATALOG_DIR }}data/precipitation.csv

  • s3://data/*.csv

  • s3://data/precipitation_{state}_{zip}.csv

  • s3://data/{year}/{month}/{day}/precipitation.csv

  • {{ CATALOG_DIR }}data/precipitation_{date:%Y-%m-%d}.csv

csv_kwargsdict

Any further arguments to pass to Dask’s read_csv (such as block size) or to the CSV parser in pandas (such as which columns to use, encoding, data-types)

storage_optionsdict

Any parameters that need to be passed to the remote data backend, such as credentials.

path_as_patternbool or str, optional

Whether to treat the path as a pattern (ie. data_{field}.csv) and create new columns in the output corresponding to pattern fields. If str, is treated as pattern to match on. Default is True.

discover()

Open resource and populate the source attributes.

export(path, **kwargs)

Save this data for sharing with other people

Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).

Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).

persist(ttl=None, **kwargs)

Save data from this source to local persistent storage

Parameters
ttl: numeric, optional

Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than ttl seconds have passed since the old persisted version was written.

kargs: passed to the _persist method on the base container.
read()

Load entire dataset into a container and return it

read_partition(i)

Return a part of the data corresponding to i-th partition.

By default, assumes i should be an integer between zero and npartitions; override for more complex indexing schemes.

to_dask()

Return a dask container for this data source

class intake.source.zarr.ZarrArraySource(*args, **kwargs)

Read Zarr format files into an array

Zarr is an numerical array storage format which works particularly well with remote and parallel access. For specifics of the format, see https://zarr.readthedocs.io/en/stable/

__init__(urlpath, storage_options=None, component=None, metadata=None, **kwargs)

The parameters dtype and shape will be determined from the first file, if not given.

Parameters
urlpathstr

Location of data file(s), possibly including protocol information

storage_optionsdict

Passed on to storage backend for remote files

componentstr or None

If None, assume the URL points to an array. If given, assume the URL points to a group, and descend the group to find the array at this location in the hierarchy; components are separated by the “/” character.

kwargspassed on to dask.array.from_zarr
discover()

Open resource and populate the source attributes.

export(path, **kwargs)

Save this data for sharing with other people

Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).

Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).

persist(ttl=None, **kwargs)

Save data from this source to local persistent storage

Parameters
ttl: numeric, optional

Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than ttl seconds have passed since the old persisted version was written.

kargs: passed to the _persist method on the base container.
read()

Load entire dataset into a container and return it

read_partition(i)

Return a part of the data corresponding to i-th partition.

By default, assumes i should be an integer between zero and npartitions; override for more complex indexing schemes.

to_dask()

Return a dask container for this data source

class intake.source.textfiles.TextFilesSource(*args, **kwargs)

Read textfiles as sequence of lines

Prototype of sources reading sequential data.

Takes a set of files, and returns an iterator over the text in each of them. The files can be local or remote. Extra parameters for encoding, etc., go into storage_options.

__init__(urlpath, text_mode=True, text_encoding='utf8', compression=None, decoder=None, read=True, metadata=None, storage_options=None)
Parameters
urlpathstr or list(str)

Target files. Can be a glob-path (with “*”) and include protocol specified (e.g., “s3://”). Can also be a list of absolute paths.

text_modebool

Whether to open the file in text mode, recoding binary characters on the fly

text_encodingstr

If text_mode is True, apply this encoding. UTF* is by far the most common

compressionstr or None

If given, decompress the file with the given codec on load. Can be something like “gzip”, “bz2”, or to try to guess from the filename, ‘infer’

decoderfunction, str or None

Use this to decode the contents of files. If None, you will get a list of lines of text/bytes. If a function, it must operate on an open file-like object or a bytes/str instance, and return a list

readbool

If decoder is not None, this flag controls whether bytes/str get passed to the function indicated (True) or the open file-like object (False)

storage_options: dict

Options to pass to the file reader backend, including text-specific encoding arguments, and parameters specific to the remote file-system driver, if using.

discover()

Open resource and populate the source attributes.

export(path, **kwargs)

Save this data for sharing with other people

Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).

Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).

persist(ttl=None, **kwargs)

Save data from this source to local persistent storage

Parameters
ttl: numeric, optional

Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than ttl seconds have passed since the old persisted version was written.

kargs: passed to the _persist method on the base container.
read()

Load entire dataset into a container and return it

read_partition(i)

Return a part of the data corresponding to i-th partition.

By default, assumes i should be an integer between zero and npartitions; override for more complex indexing schemes.

to_dask()

Return a dask container for this data source

class intake.source.jsonfiles.JSONFileSource(*args, **kwargs)

Read JSON files as a single dictionary or list

The files can be local or remote. Extra parameters for encoding, etc., go into storage_options.

__init__(urlpath: str, text_mode: bool = True, text_encoding: str = 'utf8', compression: Optional[str] = None, read: bool = True, metadata: Optional[dict] = None, storage_options: Optional[dict] = None)
Parameters
urlpathstr

Target file. Can include protocol specified (e.g., “s3://”).

text_modebool

Whether to open the file in text mode, recoding binary characters on the fly

text_encodingstr

If text_mode is True, apply this encoding. UTF* is by far the most common

compressionstr or None

If given, decompress the file with the given codec on load. Can be something like “zip”, “gzip”, “bz2”, or to try to guess from the filename, ‘infer’

storage_options: dict

Options to pass to the file reader backend, including text-specific encoding arguments, and parameters specific to the remote file-system driver, if using.

discover()

Open resource and populate the source attributes.

export(path, **kwargs)

Save this data for sharing with other people

Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).

Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).

persist(ttl=None, **kwargs)

Save data from this source to local persistent storage

Parameters
ttl: numeric, optional

Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than ttl seconds have passed since the old persisted version was written.

kargs: passed to the _persist method on the base container.
read()

Load entire dataset into a container and return it

class intake.source.jsonfiles.JSONLinesFileSource(*args, **kwargs)

Read a JSONL (https://jsonlines.org/) file and return a list of objects, each being valid json object (e.g. a dictionary or list)

__init__(urlpath: str, text_mode: bool = True, text_encoding: str = 'utf8', compression: Optional[str] = None, read: bool = True, metadata: Optional[dict] = None, storage_options: Optional[dict] = None)
Parameters
urlpathstr

Target file. Can include protocol specified (e.g., “s3://”).

text_modebool

Whether to open the file in text mode, recoding binary characters on the fly

text_encodingstr

If text_mode is True, apply this encoding. UTF* is by far the most common

compressionstr or None

If given, decompress the file with the given codec on load. Can be something like “zip”, “gzip”, “bz2”, or to try to guess from the filename, ‘infer’.

storage_options: dict

Options to pass to the file reader backend, including text-specific encoding arguments, and parameters specific to the remote file-system driver, if using.

discover()

Open resource and populate the source attributes.

export(path, **kwargs)

Save this data for sharing with other people

Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).

Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).

head(nrows: int = 100)

return the first nrows lines from the file

persist(ttl=None, **kwargs)

Save data from this source to local persistent storage

Parameters
ttl: numeric, optional

Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than ttl seconds have passed since the old persisted version was written.

kargs: passed to the _persist method on the base container.
read()

Load entire dataset into a container and return it

class intake.source.npy.NPySource(*args, **kwargs)

Read numpy binary files into an array

Prototype source showing example of working with arrays

Each file becomes one or more partitions, but partitioning within a file is only along the largest dimension, to ensure contiguous data.

__init__(path, dtype=None, shape=None, chunks=None, storage_options=None, metadata=None)

The parameters dtype and shape will be determined from the first file, if not given.

Parameters
path: str of list of str

Location of data file(s), possibly including glob and protocol information

dtype: str dtype spec

In known, the dtype (e.g., “int64” or “f4”).

shape: tuple of int

If known, the length of each axis

chunks: int

Size of chunks within a file along biggest dimension - must exactly divide each file, or None for one partition per file.

storage_options: dict

Passed to file-system backend.

discover()

Open resource and populate the source attributes.

export(path, **kwargs)

Save this data for sharing with other people

Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).

Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).

persist(ttl=None, **kwargs)

Save data from this source to local persistent storage

Parameters
ttl: numeric, optional

Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than ttl seconds have passed since the old persisted version was written.

kargs: passed to the _persist method on the base container.
read()

Load entire dataset into a container and return it

read_partition(i)

Return a part of the data corresponding to i-th partition.

By default, assumes i should be an integer between zero and npartitions; override for more complex indexing schemes.

to_dask()

Return a dask container for this data source

class intake.catalog.local.YAMLFileCatalog(*args, **kwargs)

Catalog as described by a single YAML file

__init__(path=None, text=None, autoreload=True, **kwargs)
Parameters
path: str

Location of the file to parse (can be remote)

text: str (DEPRECATED)

YAML contents of catalog, takes precedence over path

autoreloadbool

Whether to watch the source file for changes; make False if you want an editable Catalog

export(path, **kwargs)

Save this data for sharing with other people

Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).

Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).

persist(ttl=None, **kwargs)

Save data from this source to local persistent storage

Parameters
ttl: numeric, optional

Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than ttl seconds have passed since the old persisted version was written.

kargs: passed to the _persist method on the base container.
reload()

Reload catalog if sufficient time has passed

walk(sofar=None, prefix=None, depth=2)

Get all entries in this catalog and sub-catalogs

Parameters
sofar: dict or None

Within recursion, use this dict for output

prefix: list of str or None

Names of levels already visited

depth: int

Number of levels to descend; needed to truncate circular references and for cleaner output

Returns
Dict where the keys are the entry names in dotted syntax, and the
values are entry instances.
class intake.catalog.local.YAMLFilesCatalog(*args, **kwargs)

Catalog as described by a multiple YAML files

__init__(path, flatten=True, **kwargs)
Parameters
path: str

Location of the files to parse (can be remote), including possible glob (*) character(s). Can also be list of paths, without glob characters.

flatten: bool (True)

Whether to list all entries in the cats at the top level (True) or create sub-cats from each file (False).

export(path, **kwargs)

Save this data for sharing with other people

Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).

Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).

persist(ttl=None, **kwargs)

Save data from this source to local persistent storage

Parameters
ttl: numeric, optional

Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than ttl seconds have passed since the old persisted version was written.

kargs: passed to the _persist method on the base container.
reload()

Reload catalog if sufficient time has passed

walk(sofar=None, prefix=None, depth=2)

Get all entries in this catalog and sub-catalogs

Parameters
sofar: dict or None

Within recursion, use this dict for output

prefix: list of str or None

Names of levels already visited

depth: int

Number of levels to descend; needed to truncate circular references and for cleaner output

Returns
Dict where the keys are the entry names in dotted syntax, and the
values are entry instances.
class intake.catalog.zarr.ZarrGroupCatalog(*args, **kwargs)

A catalog of the members of a Zarr group.

__init__(urlpath, storage_options=None, component=None, metadata=None, consolidated=False, name=None)
Parameters
urlpathstr

Location of data file(s), possibly including protocol information

storage_optionsdict, optional

Passed on to storage backend for remote files

componentstr, optional

If None, build a catalog from the root group. If given, build the catalog from the group at this location in the hierarchy.

metadatadict, optional

Catalog metadata. If not provided, will be populated from Zarr group attributes.

consolidatedbool, optional

If True, assume Zarr metadata has been consolidated.

export(path, **kwargs)

Save this data for sharing with other people

Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).

Returns the resultant source object, so that you can, for instance, add it to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).

persist(ttl=None, **kwargs)

Save data from this source to local persistent storage

Parameters
ttl: numeric, optional

Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than ttl seconds have passed since the old persisted version was written.

kargs: passed to the _persist method on the base container.
reload()

Reload catalog if sufficient time has passed

walk(sofar=None, prefix=None, depth=2)

Get all entries in this catalog and sub-catalogs

Parameters
sofar: dict or None

Within recursion, use this dict for output

prefix: list of str or None

Names of levels already visited

depth: int

Number of levels to descend; needed to truncate circular references and for cleaner output

Returns
Dict where the keys are the entry names in dotted syntax, and the
values are entry instances.