End User
These are reference class and function definitions likely to be useful to everyone.
|
Create a Catalog object |
Dict of driver: DataSource class |
|
|
Add runtime driver definition to list of registered drivers (drivers in global scope with corresponding |
|
Remove runtime registered driver |
|
Given a concrete data object, store it at given location return Source |
|
Read CSV files into dataframes |
Read textfiles as sequence of lines |
|
Read JSON files as a single dictionary or list |
|
Read a JSONL (https://jsonlines.org/) file and return a list of objects, each being valid json object (e.g. |
|
|
Read numpy binary files into an array |
|
Read Zarr format files into an array |
|
Catalog as described by a single YAML file |
|
Catalog as described by a multiple YAML files |
|
A catalog of the members of a Zarr group. |
|
Top level GUI panel that contains controls and all visible sub-panels |
- intake.open_catalog(uri=None, **kwargs)
Create a Catalog object
Can load YAML catalog files, connect to an intake server, or create any arbitrary Catalog subclass instance. In the general case, the user should supply
driver=
with a value from the plugins registry which has a container type of catalog. File locations can generally be remote, if specifying a URL protocol.The default behaviour if not specifying the driver is as follows:
if
uri
is a single string ending in “yml” or “yaml”, open it as a catalog fileif
uri
is a list of strings, a string containing a glob character (“*”) or a string not ending in “y(a)ml”, open as a set of catalog files. In the latter case, assume it is a directory.if
uri
begins with protocol"intake:"
, connect to a remote Intake serverif
uri
isNone
or missing, create a base Catalog object without entries.
- Parameters
- uri: str or pathlib.Path
Designator for the location of the catalog.
- kwargs:
passed to subclass instance, see documentation of the individual catalog classes. For example,
yaml_files_cat
(when specifying multiple uris or a glob string) takes the additional parameterflatten=True|False
, specifying whether all data sources are merged in a single namespace, or each file becomes a sub-catalog.
See also
intake.open_yaml_files_cat
,intake.open_yaml_file_cat
intake.open_intake_remote
- intake.registry
Mapping from plugin names to the DataSource classes that implement them. These are the names that should appear in the
driver:
key of each source definition in a catalog. See Plugin Directory for more details.
- intake.open_
Set of functions, one for each plugin, for direct opening of a data source. The names are derived from the names of the plugins in the registry at import time.
- intake.upload(data, path, **kwargs)
Given a concrete data object, store it at given location return Source
Use this function to publicly share data which you have created in your python session. Intake will try each of the container types, to see if one of them can handle the input data, and write the data to the path given, in the format most appropriate for the data type, e.g., parquet for pandas or dask data-frames.
With the DataSource instance you get back, you can add this to a catalog, or just get the YAML representation for editing (
.yaml()
) and sharing.- Parameters
- datainstance
The object to upload and store. In many cases, the dask or in-memory variant are handled equivalently.
- pathstr
Location of the output files; can be, for instance, a network drive for sharing over a VPC, or a bucket on a cloud storage service
- kwargspassed to the writer for fine control
- Returns
- DataSource instance
- class intake.interface.gui.GUI(cats=None)
Top level GUI panel that contains controls and all visible sub-panels
This class is responsible for coordinating the inputs and outputs of various sup-panels and their effects on each other.
- Parameters
- cats: list of catalogs
catalogs used to initalize the cat panel
- Attributes
- children: list of panel objects
children that will be used to populate the panel when visible
- panel: panel layout object
instance of a panel layout (row or column) that contains children when visible
- watchers: list of param watchers
watchers that are set on children - cleaned up when visible is set to false.
- add(*args, **kwargs)
Add to list of cats
- property cats
Cats that have been selected from the cat sub-panel
- classmethod from_state(state)
Create a new object from a serialized exising object.
- property item
Item that is selected
- property source_instance
DataSource instance for the current selection and any parameters
- property sources
Sources that have been selected from the source sub-panel
Source classes
- class intake.source.csv.CSVSource(*args, **kwargs)
Read CSV files into dataframes
Prototype of sources reading dataframe data
- __init__(urlpath, csv_kwargs=None, metadata=None, storage_options=None, path_as_pattern=True)
- Parameters
- urlpathstr or iterable, location of data
May be a local path, or remote path if including a protocol specifier such as
's3://'
. May include glob wildcards or format pattern strings. Some examples:{{ CATALOG_DIR }}data/precipitation.csv
s3://data/*.csv
s3://data/precipitation_{state}_{zip}.csv
s3://data/{year}/{month}/{day}/precipitation.csv
{{ CATALOG_DIR }}data/precipitation_{date:%Y-%m-%d}.csv
- csv_kwargsdict
Any further arguments to pass to Dask’s read_csv (such as block size) or to the CSV parser in pandas (such as which columns to use, encoding, data-types)
- storage_optionsdict
Any parameters that need to be passed to the remote data backend, such as credentials.
- path_as_patternbool or str, optional
Whether to treat the path as a pattern (ie.
data_{field}.csv
) and create new columns in the output corresponding to pattern fields. If str, is treated as pattern to match on. Default is True.
- discover()
Open resource and populate the source attributes.
- export(path, **kwargs)
Save this data for sharing with other people
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (
catalog.add(source)
) or get its YAML representation (.yaml()
).
- persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
- Parameters
- ttl: numeric, optional
Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than
ttl
seconds have passed since the old persisted version was written.- kargs: passed to the _persist method on the base container.
- read()
Load entire dataset into a container and return it
- read_partition(i)
Return a part of the data corresponding to i-th partition.
By default, assumes i should be an integer between zero and npartitions; override for more complex indexing schemes.
- to_dask()
Return a dask container for this data source
- class intake.source.zarr.ZarrArraySource(*args, **kwargs)
Read Zarr format files into an array
Zarr is an numerical array storage format which works particularly well with remote and parallel access. For specifics of the format, see https://zarr.readthedocs.io/en/stable/
- __init__(urlpath, storage_options=None, component=None, metadata=None, **kwargs)
The parameters dtype and shape will be determined from the first file, if not given.
- Parameters
- urlpathstr
Location of data file(s), possibly including protocol information
- storage_optionsdict
Passed on to storage backend for remote files
- componentstr or None
If None, assume the URL points to an array. If given, assume the URL points to a group, and descend the group to find the array at this location in the hierarchy; components are separated by the “/” character.
- kwargspassed on to dask.array.from_zarr
- discover()
Open resource and populate the source attributes.
- export(path, **kwargs)
Save this data for sharing with other people
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (
catalog.add(source)
) or get its YAML representation (.yaml()
).
- persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
- Parameters
- ttl: numeric, optional
Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than
ttl
seconds have passed since the old persisted version was written.- kargs: passed to the _persist method on the base container.
- read()
Load entire dataset into a container and return it
- read_partition(i)
Return a part of the data corresponding to i-th partition.
By default, assumes i should be an integer between zero and npartitions; override for more complex indexing schemes.
- to_dask()
Return a dask container for this data source
- class intake.source.textfiles.TextFilesSource(*args, **kwargs)
Read textfiles as sequence of lines
Prototype of sources reading sequential data.
Takes a set of files, and returns an iterator over the text in each of them. The files can be local or remote. Extra parameters for encoding, etc., go into
storage_options
.- __init__(urlpath, text_mode=True, text_encoding='utf8', compression=None, decoder=None, read=True, metadata=None, storage_options=None)
- Parameters
- urlpathstr or list(str)
Target files. Can be a glob-path (with “*”) and include protocol specified (e.g., “s3://”). Can also be a list of absolute paths.
- text_modebool
Whether to open the file in text mode, recoding binary characters on the fly
- text_encodingstr
If text_mode is True, apply this encoding. UTF* is by far the most common
- compressionstr or None
If given, decompress the file with the given codec on load. Can be something like “gzip”, “bz2”, or to try to guess from the filename, ‘infer’
- decoderfunction, str or None
Use this to decode the contents of files. If None, you will get a list of lines of text/bytes. If a function, it must operate on an open file-like object or a bytes/str instance, and return a list
- readbool
If decoder is not None, this flag controls whether bytes/str get passed to the function indicated (True) or the open file-like object (False)
- storage_options: dict
Options to pass to the file reader backend, including text-specific encoding arguments, and parameters specific to the remote file-system driver, if using.
- discover()
Open resource and populate the source attributes.
- export(path, **kwargs)
Save this data for sharing with other people
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (
catalog.add(source)
) or get its YAML representation (.yaml()
).
- persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
- Parameters
- ttl: numeric, optional
Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than
ttl
seconds have passed since the old persisted version was written.- kargs: passed to the _persist method on the base container.
- read()
Load entire dataset into a container and return it
- read_partition(i)
Return a part of the data corresponding to i-th partition.
By default, assumes i should be an integer between zero and npartitions; override for more complex indexing schemes.
- to_dask()
Return a dask container for this data source
- class intake.source.jsonfiles.JSONFileSource(*args, **kwargs)
Read JSON files as a single dictionary or list
The files can be local or remote. Extra parameters for encoding, etc., go into
storage_options
.- __init__(urlpath: str, text_mode: bool = True, text_encoding: str = 'utf8', compression: Optional[str] = None, read: bool = True, metadata: Optional[dict] = None, storage_options: Optional[dict] = None)
- Parameters
- urlpathstr
Target file. Can include protocol specified (e.g., “s3://”).
- text_modebool
Whether to open the file in text mode, recoding binary characters on the fly
- text_encodingstr
If text_mode is True, apply this encoding. UTF* is by far the most common
- compressionstr or None
If given, decompress the file with the given codec on load. Can be something like “zip”, “gzip”, “bz2”, or to try to guess from the filename, ‘infer’
- storage_options: dict
Options to pass to the file reader backend, including text-specific encoding arguments, and parameters specific to the remote file-system driver, if using.
- discover()
Open resource and populate the source attributes.
- export(path, **kwargs)
Save this data for sharing with other people
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (
catalog.add(source)
) or get its YAML representation (.yaml()
).
- persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
- Parameters
- ttl: numeric, optional
Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than
ttl
seconds have passed since the old persisted version was written.- kargs: passed to the _persist method on the base container.
- read()
Load entire dataset into a container and return it
- class intake.source.jsonfiles.JSONLinesFileSource(*args, **kwargs)
Read a JSONL (https://jsonlines.org/) file and return a list of objects, each being valid json object (e.g. a dictionary or list)
- __init__(urlpath: str, text_mode: bool = True, text_encoding: str = 'utf8', compression: Optional[str] = None, read: bool = True, metadata: Optional[dict] = None, storage_options: Optional[dict] = None)
- Parameters
- urlpathstr
Target file. Can include protocol specified (e.g., “s3://”).
- text_modebool
Whether to open the file in text mode, recoding binary characters on the fly
- text_encodingstr
If text_mode is True, apply this encoding. UTF* is by far the most common
- compressionstr or None
If given, decompress the file with the given codec on load. Can be something like “zip”, “gzip”, “bz2”, or to try to guess from the filename, ‘infer’.
- storage_options: dict
Options to pass to the file reader backend, including text-specific encoding arguments, and parameters specific to the remote file-system driver, if using.
- discover()
Open resource and populate the source attributes.
- export(path, **kwargs)
Save this data for sharing with other people
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (
catalog.add(source)
) or get its YAML representation (.yaml()
).
- persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
- Parameters
- ttl: numeric, optional
Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than
ttl
seconds have passed since the old persisted version was written.- kargs: passed to the _persist method on the base container.
- read()
Load entire dataset into a container and return it
- class intake.source.npy.NPySource(*args, **kwargs)
Read numpy binary files into an array
Prototype source showing example of working with arrays
Each file becomes one or more partitions, but partitioning within a file is only along the largest dimension, to ensure contiguous data.
- __init__(path, dtype=None, shape=None, chunks=None, storage_options=None, metadata=None)
The parameters dtype and shape will be determined from the first file, if not given.
- Parameters
- path: str of list of str
Location of data file(s), possibly including glob and protocol information
- dtype: str dtype spec
In known, the dtype (e.g., “int64” or “f4”).
- shape: tuple of int
If known, the length of each axis
- chunks: int
Size of chunks within a file along biggest dimension - must exactly divide each file, or None for one partition per file.
- storage_options: dict
Passed to file-system backend.
- discover()
Open resource and populate the source attributes.
- export(path, **kwargs)
Save this data for sharing with other people
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (
catalog.add(source)
) or get its YAML representation (.yaml()
).
- persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
- Parameters
- ttl: numeric, optional
Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than
ttl
seconds have passed since the old persisted version was written.- kargs: passed to the _persist method on the base container.
- read()
Load entire dataset into a container and return it
- read_partition(i)
Return a part of the data corresponding to i-th partition.
By default, assumes i should be an integer between zero and npartitions; override for more complex indexing schemes.
- to_dask()
Return a dask container for this data source
- class intake.catalog.local.YAMLFileCatalog(*args, **kwargs)
Catalog as described by a single YAML file
- __init__(path=None, text=None, autoreload=True, **kwargs)
- Parameters
- path: str
Location of the file to parse (can be remote)
- text: str (DEPRECATED)
YAML contents of catalog, takes precedence over path
- autoreloadbool
Whether to watch the source file for changes; make False if you want an editable Catalog
- export(path, **kwargs)
Save this data for sharing with other people
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (
catalog.add(source)
) or get its YAML representation (.yaml()
).
- persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
- Parameters
- ttl: numeric, optional
Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than
ttl
seconds have passed since the old persisted version was written.- kargs: passed to the _persist method on the base container.
- reload()
Reload catalog if sufficient time has passed
- walk(sofar=None, prefix=None, depth=2)
Get all entries in this catalog and sub-catalogs
- Parameters
- sofar: dict or None
Within recursion, use this dict for output
- prefix: list of str or None
Names of levels already visited
- depth: int
Number of levels to descend; needed to truncate circular references and for cleaner output
- Returns
- Dict where the keys are the entry names in dotted syntax, and the
- values are entry instances.
- class intake.catalog.local.YAMLFilesCatalog(*args, **kwargs)
Catalog as described by a multiple YAML files
- __init__(path, flatten=True, **kwargs)
- Parameters
- path: str
Location of the files to parse (can be remote), including possible glob (*) character(s). Can also be list of paths, without glob characters.
- flatten: bool (True)
Whether to list all entries in the cats at the top level (True) or create sub-cats from each file (False).
- export(path, **kwargs)
Save this data for sharing with other people
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (
catalog.add(source)
) or get its YAML representation (.yaml()
).
- persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
- Parameters
- ttl: numeric, optional
Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than
ttl
seconds have passed since the old persisted version was written.- kargs: passed to the _persist method on the base container.
- reload()
Reload catalog if sufficient time has passed
- walk(sofar=None, prefix=None, depth=2)
Get all entries in this catalog and sub-catalogs
- Parameters
- sofar: dict or None
Within recursion, use this dict for output
- prefix: list of str or None
Names of levels already visited
- depth: int
Number of levels to descend; needed to truncate circular references and for cleaner output
- Returns
- Dict where the keys are the entry names in dotted syntax, and the
- values are entry instances.
- class intake.catalog.zarr.ZarrGroupCatalog(*args, **kwargs)
A catalog of the members of a Zarr group.
- __init__(urlpath, storage_options=None, component=None, metadata=None, consolidated=False, name=None)
- Parameters
- urlpathstr
Location of data file(s), possibly including protocol information
- storage_optionsdict, optional
Passed on to storage backend for remote files
- componentstr, optional
If None, build a catalog from the root group. If given, build the catalog from the group at this location in the hierarchy.
- metadatadict, optional
Catalog metadata. If not provided, will be populated from Zarr group attributes.
- consolidatedbool, optional
If True, assume Zarr metadata has been consolidated.
- export(path, **kwargs)
Save this data for sharing with other people
Creates a copy of the data in a format appropriate for its container, in the location specified (which can be remote, e.g., s3).
Returns the resultant source object, so that you can, for instance, add it to a catalog (
catalog.add(source)
) or get its YAML representation (.yaml()
).
- persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
- Parameters
- ttl: numeric, optional
Time to live in seconds. If provided, the original source will be accessed and a new persisted version written transparently when more than
ttl
seconds have passed since the old persisted version was written.- kargs: passed to the _persist method on the base container.
- reload()
Reload catalog if sufficient time has passed
- walk(sofar=None, prefix=None, depth=2)
Get all entries in this catalog and sub-catalogs
- Parameters
- sofar: dict or None
Within recursion, use this dict for output
- prefix: list of str or None
Names of levels already visited
- depth: int
Number of levels to descend; needed to truncate circular references and for cleaner output
- Returns
- Dict where the keys are the entry names in dotted syntax, and the
- values are entry instances.