Catalog User
So someone has sent you an Intake URL or other way to load a catalog. What happens next? Let’s do the simplest thing and load a public catalog
cat = intake.from_yaml_file("s3://mymdtemp/intake_1.yaml", anon=True)
(this is the same catalog as is made in the example tutorial noteboook)
Displaying the catalog shows that it has four named datasets and some automatically populated “user parameters”.
Interesting attributes of the catalog are:
cat.data
: full description of the original datasets
{'7e0b327a50eef58d': DataDescription type intake.readers.datatypes:CSV
kwargs {'metadata': {}, 'storage_options': {'anon': True}, 'url': '{CATALOG_DIR}/intake_1.csv'}}
cat.entries
: readers, ways to load the data
{'capitals': Entry for reader: intake.readers.convert:Pipeline
kwargs: {'out_instances': ['pandas:DataFrame', 'pandas:DataFrame', 'pandas:DataFrame', 'pandas:DataFrame'],
'steps': [
['{data(tute)}', [], {}],
['{func(intake.readers.transform:Method)}', [], {'method_name': 'a'}],
['{func(intake.readers.transform:Method)}', [], {'method_name': 'str'}],
['{func(intake.readers.transform:Method)}', [], {'method_name': 'capitalize'}]]}
producing: pandas:DataFrame,
'inverted': Entry for reader: intake.readers.convert:Pipeline
kwargs: {'out_instances': ['pandas:DataFrame', 'pandas:DataFrame'],
'steps': [
['{data(tute)}', [], {}],
['{func(intake.readers.transform:Method)}', [], {'args': ['b'], 'ascending': False, 'method_name': 'sort_values'}]]}
producing: pandas:DataFrame,
'multi': Entry for reader: intake.readers.convert:Pipeline
kwargs: {'out_instances': ['pandas:DataFrame', 'pandas:DataFrame'],
'steps': [
['{data(tute)}', [], {}],
['{func(intake.readers.transform:Method)}', [], {'c': '{data(capitals)}', 'method_name': 'assign'}]]}
producing: pandas:DataFrame,
'tute': Entry for reader: intake.readers.readers:PandasCSV
kwargs: {'data': '{data(7e0b327a50eef58d)}'}
producing: pandas:DataFrame}
cat.aliases
: names to associate with readers or data (these are the ones used with tab-completion)
{'capitals': 'capitals',
'inverted': 'inverted',
'multi': 'multi',
'tute': 'tute'}
cat.user_parameters
: values that can be used in templated values (see below)
{'CATALOG_PATH': 's3://mymdtemp/intake_1.yaml',
'CATALOG_DIR': 's3://mymdtemp',
'STORAGE_OPTIONS': {'anon': True}}
You can even get an overall view of everything in the catalog using cat.to_dict()
, which gives you
back essentially the same information as was contained in the YAML file we read the catalog from.
Also notice, that there is metadata associated with the whole catalog, and each of the
data and reader descriptions. All the readers depend on the one dataset (“multi” depends on
it twice) and all are Pipelines (with “steps”) except “tute”.
Key Concepts
A few definitions that will help you:
data: a set of numbers of various forms, which can be used to infer information about some domain
dataset: a specific delimited amount of data, often a single file, a directory of files or a single request or query to some service. The output of any Intake reader is also a “dataset”, as represented in a live python session
data: the basic information needed to uniquely identify a dataset, such as data type, URL/paths, server location, query. Intake supports many data types (Data Classes). A description ought to also contain descriptive information in its metadata.
reader: how a given dataset should be handled/loaded. This is more specific than the data itself, since there may be many different ways to read the data. For instance, CSVs are a very common and simple data format, and virtually all (table-oriented) data packages can read them.
pipeline: a sequence of operations on a dataset. In Intake, this is just a type of reader, although it is possible to refer to the output of any particular stage.
catalog: a collection of datasets and their reader descriptions. Each dataset may be referred to by multiple readers, and a reader may refer to multiple datasets, although the latter is less common (think of JOIN operations).
templates, user-parameters: in the catalog definition of the one dataset, you will notice special syntax for part of the URL value to be filled in. See below for how to work with this.
Reader API
Before accessing any of the entries in a catalog, you should introspect them to see if it is
what you are after. There should be descriptive text, other metadata and of course the contents
of the data/readers, as shown above. It is important to note, that extracting readers (the next
step) already comes with security implications, such as evaluating environment variables and
making imports. The “allow_*” keys in the intake configuration, intake.conf
define what
is generally allowed.
As an end-user, you will generally interact with readers. Get them from the catalog by attribute access or item access; the latter is required where the name is not a valid python identifier or conflicts with a method. The following two line are exactly equivalent:
reader = cat.tute
reader = cat["tute"]
reader.pprint()
{'kwargs': {'data': {'url': 's3://mymdtemp/intake_1.csv',
'storage_options': {'anon': True},
'metadata': {}}},
'metadata': {},
'output_instance': 'pandas:DataFrame'}
Note
We will work on the best way to represent the various instances, especially in the notebook.
For the time being, you can always use the .pprint()
method, or introspect the
instance’s attributes.
We notice that this is NOT exactly the same as the entry in the original catalog with name
“tute”. In particular: the reader is a concrete instance of a subclass of
intake.readers.readers.BaseReader
,
it contains the data definition it referenced and the URL of which has been expanded to the
full “s3://..”.
The most obvious thing to do to a reader is read: this is, after all, what they are for. We
already know to expect an output type a pandas DataFrame. reader.doc()
provides the
docstring of the target function, in this case read_csv()
. You can
pass extra or override arguments, with exactly the same names and meaning as the
original docstring (some readers might provide extra functionality or possibilities).
reader.read()
Unnamed: 0 a b
0 0 ho 4
1 1 hi 5
reader(index_col=[0]).read()
a b
0 ho 4
1 hi 5
For large datasets, you may try .discover()
instead, which is generally a small subset
of the data, depending on the format and library. For small datasets like this one, you get
exactly the same output.
Templates
Returning to the mysterious “s3://” URL in the reader instance above. This was created from the URL “{CATALOG_DIR}/intake_1.csv” using templating. You may recall that the catalog had a user_parameter of this name, whose value was auto-populated from the URL we used to read the catalog file. This means that the data file and catalog describing it could be moved together to a new location without having to edit the catalog. On the other hand, if the URL were not templated, moving/copying the catalog would still refer to the original data location (sometimes this is what you want).
This particular user_parameter was global to the catalog, and to assign a new value before templating, you would do
cat2 = cat(CATALOG_DIR="new_value")
(so cat2.tute
would not have a different data URL and no longer load!). It is also
possible to have parameters associated with the data description and/or specific readers,
and for any parameter to be used in multiple places. They can also have specific types,
defaults and constraints/choices. If a template refers to a parameter that is missing
or has no value set, it will be left unchanged, and the data in question will probably not load.