API Reference
User Functions
|
Intake's dict-like config system |
|
Show which Intake data types can apply to the given details |
|
Create pipeline from given URL to desired output type |
|
Find possible conversion paths from start to end types |
|
A collection of data and reader descriptions. |
|
Defines some data: class and arguments. |
A serialisable description of a reader or pipeline |
|
Show which readers claim to support the given data instance or a superclass |
|
Attempt to construct a reader instance by finding one that matches the function call |
|
Inspect a dataset at url and return a summary dictionary. |
- class intake.config.Config(filename=None, **kwargs)
Intake’s dict-like config system
Instance
intake.confis globally used throughout the package- Attributes:
- environment_conf_parsestr
“ignore” (default), “warn” or raise an “error” when parsing local environment variables as strings.
- get(key, default=None)
Return the value for key if key is in the dictionary, else default.
- load(fn=None)
Update global config from YAML file
If fn is None, looks in global config directory, which is either defined by the INTAKE_CONF_DIR env-var or is ~/.intake/ .
- load_env()
Analyse environment variables and update conf accordingly
- reset()
Set conf values back to defaults
- save(fn=None)
Save current configuration to file as YAML
Uses
self.filenamefor target location
- set(update_dict=None, **kw)
Change config values within a context or for the session
- values: dict
This can be deeply nested to set only leaf values
See also:
intake.readers.utils.nested_keys_to_dictExamples
Value resets after context ends
>>> with intake.conf.set(mybval=5): ... ...
Set for whole session
>>> intake.conf.set(myval=5)
Set only a single leaf value within a nested dict
>>> intake.conf.set(intake.readers.utils.nested_keys_to_dict({"deep.2.key": True})
- intake.readers.datatypes.recommend(url: str | None = None, mime: str | None = None, head: bool = True, contents: bool = False, storage_options=None, ignore: set[str] | None = None) set[BaseData]
Show which Intake data types can apply to the given details
- Parameters:
- url: str
Location of data
- mime: str
MIME type, usually “x/y” form
- head: bytes | bool | None
A small number of bytes from the file head, for seeking magic bytes. If it is True, fetch these bytes from th given URL/storage_options and use them. If None, only fetch bytes if there is no match by mime type or path, if False, don’t fetch at all.
- contents: bool | None
Attempt to delve into URL to analyse constituent files. This can significantly slow your recommendation.
- storage_options: dict | None
If passing a URL which might be a remote file, storage_options can be used by fsspec.
- ignore: set | None
Don’t include these in the output
- Returns:
- set of matching datatype classes.
- intake.readers.convert.auto_pipeline(url: str | BaseData, outtype: str | tuple[str] = '', storage_options: dict | None = None, avoid: list[str] | None = None, prefer: list[str] | None = None, exclude: list[str] | None = None) Pipeline
Create pipeline from given URL to desired output type
Will search for the shortest conversion path from the inferred data-type to the output.
- Parameters:
- url: input data, usually a location/URL, but maybe a data instance
- outtype: pattern to match to possible output types (instance or last converter)
- storage_options: if url is a remote str, these are kwargs that fsspec may need to
access it
- avoid: don’t consider readers whose names match any of these strings
- prefer:
List of substring patterns (case-insensitive) matched against reader class names. Matching readers are tried before non-matching ones when multiple candidates satisfy the path. Example:
prefer=["Polars", "Duck"].- exclude:
List of substring patterns (case-insensitive) matched against reader class names. Any reader whose class name matches is removed from consideration. Example:
exclude=["Spark", "Ray"].
- class intake.readers.entry.Catalog(entries: Iterable[ReaderDescription] | Mapping | None = None, aliases: dict[str, int] | None = None, data: Iterable[DataDescription] | Mapping = None, user_parameters: dict[str, BaseUserParameter] | None = None, parameter_overrides: dict[str, Any] | None = None, metadata: dict | None = None)
A collection of data and reader descriptions.
- add_entry(entry, name: str | None = None, clobber: bool = True, simplify: bool = False)
Add entry/reader (and its requirements) in-place, with optional alias
- Parameters:
- entry: instance of BaseData, BaseReader or their descriptions
- name: set the key value the iterm will be known as
- clobber: if False, will not overwrite an entry
- simplify: if True, checks if an equivalent entity already exists, and
returns it’s token if found. Such comparisons are relatively slow when you have >>100 entries.
- delete(name, recursive=False)
Remove named entity (data/entry) from catalog
We do not check whether any other entity in the catalog refers to what is being deleted, so you can break other entries this way.
- Parameters:
- recursive: bool
Also removed data/entries references by the given one, and those they refer to in turn.
- extract_parameter(item: str, name: str, path: str | None = None, value: ~typing.Any = None, cls=<class 'intake.readers.user_parameters.SimpleUserParameter'>, store_to: str | None = None, **kw)
Descend into data & reader descriptions to create a user_parameter
There are two ways to fund and replace values by a template:
if
pathis given, the kwargs will be walked to this location e.g., “field.0.special_value” -> kwargs[“field”][0][“special_value”]if
valueis given, all kwargs will be recursively walked, looking for values that equal that given.
Matched values will be replaced by a template string like
"{name}", and a user_parameter of classclswill be placed in the location given bystore_to(could be “data”, “catalog”).
- classmethod from_dict(data)
Assemble catalog from dict representation
- static from_yaml_file(path: str, **kwargs)
Load YAML representation into a new Catalog instance
- storage_options:
kwargs to pass to fsspec for opening the file to read; can pass as storage_options= or will pick up any unused kwargs for simplicity
- get_entity(item: str)
Get the objects by reference
Use this method if you want to change the catalog in-place
item can be an entry in .aliases, in which case the original wil be returned, or a key in .entries, .user_parameters or .data. The entity in question is returned without processing.
- give_name(tok: str, name: str, clobber=True)
Give an alias to a dataset
- tok:
a key in the .entries dict
- move_parameter(from_entity: str, to_entity: str, parameter_name: str) Catalog
Move user-parameter from between entry/data
entity is an alias name or entry/data token
- promote_parameter_name(parameter_name: str, level: str = 'cat') Catalog
Find and promote given named parameter, assuming they are all identical
- parameter_name:
the key string referring to the parameter
- level: cat | data
If the parameter is found in a reader, it can be promoted to the data it depends on. Parameters in a data description can only be promoted to a catalog global.
- search(expr) Catalog
Make new catalog with a subset of this catalog
The new catalog will have those entries which pass the filter expr, which is an instance of intake.readers.search.BaseSearch (i.e., has a method like filter(entry) -> bool).
In the special case that expr is just a string, the Text search expression will be used.
- class intake.readers.entry.DataDescription(datatype: str, kwargs: dict = None, metadata: dict = None, user_parameters: dict = None)
Defines some data: class and arguments. This may be laoded in a number of ways
A DataDescription normally resides in a Catalog, and can contain templated arguments. When there are user_parameters, these will also be applied to any reader that depends on this data.
- get_kwargs(user_parameters: dict[str | BaseUserParameter] | None = None, **kwargs) dict[str, Any]
Get set of kwargs for given reader, based on prescription, new args and user parameters
Here, user_parameters is intended to come from the containing catalog. To provide values for a user parameter, include it by name in kwargs
- class intake.readers.entry.ReaderDescription(reader: str, kwargs: dict[str, Any] | None = None, user_parameters: dict[str | BaseUserParameter] | None = None, metadata: dict | None = None, output_instance: str | None = None)
A serialisable description of a reader or pipeline
This class is typically stored inside Catalogs, and can contain templated arguments which get evaluated at the time that it is accessed from a Catalog.
- check_imports()
Are the packages listed in the “imports” key of the metadata available?
- extract_parameter(name: str, path=None, value=None, cls=<class 'intake.readers.user_parameters.SimpleUserParameter'>, **kw)
Creates new version of the description
Creates new instance, since the token will in general change
- classmethod from_dict(data)
Recreate instance from the results of to_dict()
- get_kwargs(user_parameters=None, **kwargs) dict[str, Any]
Get set of kwargs for given reader, based on prescription, new args and user parameters
Here, user_parameters is intended to come from the containing catalog. To provide values for a user parameter, include it by name in kwargs
- to_cat(name=None)
Create a Catalog containing only this entry
- intake.readers.readers.recommend(data)
Show which readers claim to support the given data instance or a superclass
The ordering is more specific readers first
- intake.readers.readers.reader_from_call(func: str, *args, join_lines=False, **kwargs) BaseReader
Attempt to construct a reader instance by finding one that matches the function call
Fails for readers that don’t define a func, probably because it depends on the file type or needs a dynamic instance to be a method of.
- Parameters:
- func: callable | str
If a callable, pass args and kwargs as you would have done to execute the function. If a string, it should look like
"func(arg1, args2, kwarg1, **kw)", i.e., a normal python call but as a string. In the latter case, args and kwargs are ignored
- intake.readers.inspect.inspect_dataset(url: str, storage_options: dict | None = None, max_bytes: int = 50000000, timeout: float | None = 30.0, metadata: dict | None = None, prefer: list[str] | None = None, exclude: list[str] | None = None, retry: bool = True) dict
Inspect a dataset at url and return a summary dictionary.
- Parameters:
- url:
Location of the data. Any fsspec-compatible URL is accepted (
s3://,gs://,https://, local path, …).- storage_options:
Keyword arguments forwarded to fsspec (credentials, etc.).
- max_bytes:
Maximum file size (bytes) for which a Tier-3 (full-read) reader will be attempted. Set to
Noneto disable the guard entirely.- timeout:
Wall-clock seconds to allow for each
discover()call.Nonedisables the timeout. Note: the background thread may continue after a timeout is triggered.- metadata:
Extra metadata dict merged into the
BaseDatainstance.- prefer:
List of substring patterns (case-insensitive) matched against reader class names. Matching readers are moved to the front of the candidate list (while still sorted by tier within the preferred group). Example:
prefer=["Polars", "Duck"].- exclude:
List of substring patterns (case-insensitive). Any reader whose class name contains one of these patterns is removed from the candidate list entirely before any attempt is made. Example:
exclude=["Spark", "Ray"].- retry:
If
True(default), when the chosen reader’sdiscover()raises or times out the next candidate in the ordered list is tried automatically, continuing until one succeeds or the list is exhausted. IfFalse, the first failure is recorded and the function returns immediately without trying further readers.
- Returns:
- dict with keys:
urlThe input URL.
detected_typeClass name of the first matching
BaseDatasubclass, orNone.detected_type_qnameFully-qualified name (
"module:Class"), orNone.structureSet of structural tags from the datatype (e.g.
{"table"}).reader_usedClass name of the reader that ultimately succeeded, or
None.reader_tierInteger 1/2/3 for the reader that succeeded, or
None.readers_attemptedOrdered list of reader class names that were tried (including failures).
descriptionValue of
metadata["description"]from the data instance, if any.datashapeDict of schema information (columns + dtypes, or xarray dims, etc.). Does not include
shape— that lives exclusively at the top-levelshapekey.shapeList of integer dimensions (e.g.
[1000, 4]), orNonewhen the shape cannot be determined without a full scan (lazy DataFrames, partial reads, etc.).npartitionsNumber of partitions as reported by the discovered object (Dask, Ray, etc.). For file-based data with no in-memory partition count this falls back to
n_files.n_filesNumber of individual files that make up the dataset (after glob expansion), or
Noneif the URL is not file-based / unknowable.file_size_bytesTotal size in bytes across all files, or
Noneif any file’s size could not be determined or the URL is not file-based.reprPlain-text
repr()of the discovered object (capped at 1000 chars).html_reprHTML string from
_repr_html_()/_repr_svg_(), orNone.thumbnaildata:image/png;base64,…URI, orNone.metadataThe
metadatadict attached to theBaseData/BaseReader.readersDict mapping every candidate reader class name to a sub-dict with keys
"importable"(bool) and"tier"(int 1/2/3). Whether a reader is importable reflects the current environment only; another machine may have different packages installed.errorsList of error strings for non-fatal problems encountered.
Base Classes
These may be subclassed by developers
|
Prototype dataset definition |
|
|
Converts from one object type to another |
|
A set of functions as an accessor on a Reader, producing a Pipeline |
|
Prototype for a single term in a search expression |
|
The base class allows for any default without checking/coercing |
- class intake.readers.datatypes.BaseData(metadata: dict[str, Any] | None = None)
Prototype dataset definition
- auto_pipeline(outtype: str | tuple[str], avoid: list[str] | None = None, prefer: list[str] | None = None, exclude: list[str] | None = None)
Find a pipeline to transform from this to the given output type
- Parameters:
- outtype:
Pattern matched against possible output types / converter names.
- avoid:
Reader/converter names (substring patterns) to exclude from the graph search entirely.
- prefer:
Substring patterns (case-insensitive) matched against reader class names. Matching readers are tried before others when multiple candidates exist.
- exclude:
Substring patterns (case-insensitive) matched against reader class names. Matching readers are removed from consideration.
- magic: set[bytes | tuple] = {}
binary patterns, usually at the file head; each item identifies this data type
- property possible_outputs
Map of importable readers to the expected output class of each
- property possible_readers
List of reader classes for this type, grouped by importability
- to_entry()
Create DataDescription version of this, for placing in a Catalog
- to_reader(type_or_reader=None, outtype: str | None = None, reader: str | None = None, prefer: list[str] | None = None, exclude: list[str] | None = None, **kw)
Find an appropriate reader for this data
If all Nones are passed, the first importable reader will be picked. If there is any selection, you will get ValueError on failure.
See also .possible_outputs
- Parameters:
- type_or_reader: matches either on type or reader name, whichever is found first
- outtype: string to match against the output classes of potential readers
- reader: string to match against the class names of the readers
- prefer:
List of substring patterns (case-insensitive). Matching readers are tried before non-matching ones when multiple candidates satisfy the selection criteria. Example:
prefer=["Polars", "Duck"].- exclude:
List of substring patterns (case-insensitive). Any reader whose class name matches is removed from consideration entirely. Example:
exclude=["Spark", "Ray"].
- to_reader_cls(type_or_reader=None, outtype: tuple[str] | str | None = None, reader: tuple[str] | str | type | None = None, prefer: list[str] | None = None, exclude: list[str] | None = None)
Return the reader class best suited for this data instance.
- Parameters:
- type_or_reader:
Convenience argument: tried first as outtype, then as reader.
- outtype:
Substring pattern(s) matched (case-insensitively) against each candidate reader’s
output_instancestring.- reader:
Either a fully-qualified import string (
"pandas:read_csv"), a reader class directly, or a substring pattern matched case-insensitively against each candidate reader’s qualified name.- prefer:
List of substring patterns (case-insensitive). Matching readers are tried before non-matching ones when multiple candidates satisfy outtype or reader. Has no effect when a bare reader class or exact import string is given.
- exclude:
List of substring patterns (case-insensitive). Any reader whose class name matches is removed from consideration entirely.
- class intake.readers.readers.BaseReader(*args, metadata: dict | None = None, output_instance: str | None = None, **kwargs)
- property data
The BaseData this reader depends on, if it has one
- discover(**kwargs)
Part of the data
The intent is to return a minimal dataset, but for some readers and conditions this may be up to the whole of the data. Output type is the same as for read().
- classmethod doc()
Doc associated with loading function
- classmethod is_ok(data) bool
Determine whether this reader is suitable for the given data instance.
This is called after the type-based
implementscheck and allows a reader to inspect the properties of a concrete data instance (e.g. the shape of its URL, whether it is a remote resource, etc.) to decide whether it should be recommended.Override this in subclasses to add instance-level constraints on top of the class-level
implementsdeclaration.- Parameters:
- data:
The
BaseDatainstance being evaluated.
- Returns:
- bool
Trueif this reader can handle the data instance,Falseto exclude it from therecommend()results.
- prefer_for_inspect: bool = True
Whether this reader should be preferred by
inspect_dataset().Set to
Falseon readers whose output is designed purely for interactive display (e.g.panel.pane.Image) and carries no queryable schema. Such readers are tried last byinspect_dataset, only after every reader withprefer_for_inspect = Truehas been exhausted.
- read(*args, **kwargs)
Produce data artefact
Any of the arguments encoded in the data instance can be overridden.
Output type is given by the .output_instance attribute
- to_cat(name=None)
Create a Catalog containing on this reader
- to_entry()
Create an entry version of this, ready to be inserted into a Catalog
- class intake.readers.convert.BaseConverter(*args, metadata: dict | None = None, output_instance: str | None = None, **kwargs)
Converts from one object type to another
Most often, subclasses call a single function on the data, but arbitrary complex transforms are possible. This is designed to be one step in a
Pipeline.Subclasses should set:
instancesA
{input_type_qname: output_type_qname}mapping. Keys and values are"module:Class"strings matchingoutput_instanceon readers.funcThe primary callable as a
"module:name"string. Used bydoc()and_func(inherited fromBaseReader). Subclasses that perform more than one function call should still setfuncto the main entry-point for documentation purposes, and overriderun().is_okOverride to reject in-memory objects that this converter cannot handle even when the type name matches (e.g. wrong
ndimordtype).
- classmethod doc()
Documentation for this conversion step.
Mirrors
BaseReader.docso that converters participate in the same help/introspection conventions as readers.
- classmethod is_ok(x) bool
Return
Trueif this converter can handle the concrete object x.This is the converter analogue of
BaseReader.is_ok(). The default implementation always returnsTrue; subclasses override it to enforce constraints on the in-memory object (e.g.ndim,dtype, shape) that go beyond the type-name match ininstances.- Parameters:
- x:
The concrete in-memory object that would be passed to
run().
- Returns:
- bool
Falseto exclude this converter fromconvert_classes()results for this particular object.
- run(x, *args, **kwargs)
Execute a conversion stage on the output object from another stage
Subclasses may override this
- class intake.readers.namespaces.Namespace(reader)
A set of functions as an accessor on a Reader, producing a Pipeline
- class intake.readers.search.SearchBase
Prototype for a single term in a search expression
The method filter() is meant to be overridden in subclasses.
- filter(entry: ReaderDescription) bool
Does the given ReaderDescription entry match the query?
- class intake.readers.user_parameters.BaseUserParameter(default, description='')
The base class allows for any default without checking/coercing
- coerce(value)
Change given type to one that matches this parameter’s intent
- default
the value to use without user input
- description
what is the function of this parameter
- set_default(value)
Change the default, if it validates
- to_dict()
Dictionary representation of the instances contents
- with_default(value)
A new instance with different default, if it validates
(original object is left unchanged)
Data Classes
|
Advanced Scientific Data Format |
|
Structured record passing file format |
|
Human-readable tabular format, Comma Separated Values |
|
Datatypes that are groupings of other data |
|
An API endpoint capable of describing Intake catalogs |
|
Intake catalog expressed as YAML |
|
Imaging data usually from medical scans |
|
Indexed set of parquet files with servioning and diffs |
|
The well-known spreadsheet app's file format |
|
Tabular or array data in text/binary format common in astronomy |
|
Deprecated tabular format from the Arrow project (Feather v1) |
|
Datatypes loaded from files, local or remote |
|
One of the filetpes at https://gdal.org/drivers/raster/index.html |
|
One of the filetypes at https://gdal.org/drivers/vector/index.html |
|
"Gridded" file format commonly used in meteo forecasting |
|
Geo data (position and geometries) within JSON |
|
Geo data (position and geometries) in a SQLite DB file |
|
Hierarchical tree of ND-arrays, widely used scientific file format |
|
An identifier registered on handle registry |
|
|
|
Indexed set of parquet files with servioning and diffs |
|
Image format with good compression for the internet |
|
Nested record format as readable text, very common over HTTP |
|
Keras model parameter set |
|
A value that can be embedded directly to YAML (text, dict, list) |
|
A single array in a .mat file |
|
Text format for sparse array |
|
Collection of ND-arrays with coordinates, scientific file format |
|
Medical imaging or volume data file |
|
Simple array format |
|
Columnar-optimized tabular binary file format |
|
Earth-science oriented searchable HTTP API |
|
Portable Network Graphics, common image format |
|
Column-optimized binary format |
|
Python pickle, arbitrary serialized object |
|
Monitoring metric query service |
|
Source code file |
|
A C or FORTRAN N-dimensional array buffer without metadata |
|
Trained model made by sklearn and saved as pickle |
|
Query on a database-like service |
|
Database data stored in files |
|
Data assets related to geo data, either as static JSON or a searchable API |
|
Datatypes loaded from some service |
|
Geo data (position and geometries) in a set of related binary files |
|
Tensorflow record file, ready for machine learning |
|
Datasets on a THREDDS server |
|
Image format commonly used for large data |
|
Service exposing versioned, chunked and potentially sparse arrays |
|
Data access service for data-aware portals and data science tools |
|
|
|
Waveform/sound file |
|
Extensible Markup Language file |
|
Human-readable JSON/object-like format |
|
Cloud optimised, chunked N-dimensional file format |
Reader Classes
Includes readers, transformers, converters and output classes.
|
Finds the earthdata datasets that contain some data in the given query bounds |
|
Read particular earthdata dataset by ID and parameter bounds |
|
Datasets from HuggingfaceHub |
|
Example datasets from sklearn.datasets |
|
Uses SQLAlchemy to get the list of tables at some SQL URL |
|
Searches stacindex.org for known public STAC data sources |
|
Create a Catalog from a STAC endpoint or file |
|
Get stac objects matching a search spec from a STAC endpoint |
|
Reimplementation of "StackBandsSource" from intake-stac |
|
Read from THREDDS endpoint |
|
Datasets from the TensorFlow public registry |
|
Creates a catalog of Tiled datasets from a root URL |
|
Standard example PyTorch datasets |
|
|
Converts from one object type to another |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Call given arbitrary function |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Holds a list of transforms/conversions to be enacted in sequence |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Implemented only if an attribute was not already chosen. |
|
|
|
|
|
|
|
|
|
|
|
|
|
creates one of several output file types |
|
Take a matplotlib figure and save to PNG file |
|
Save a single array into a single binary file |
|
|
|
|
|
|
|
|
|
good for including "peek" at data in entries' metadata |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Requires a directory with .npy files and an "info" pickle file |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Convenience superclass for readers of files |
|
|
|
|
|
Dereference handle (hdl:) identifiers |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Retry (part of) a pipeline until it returns without exception |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Equivalent of x[item] |
|
Call named method on object |
|
|
|
|
|