Quickstart
This guide will show you how to get started using Intake to read data, and give you a flavour of how Intake feels to the Data User. It assumes you are working in either a conda or a virtualenv/pip environment. For notebooks with executable code, see the Examples. This walk-through can be run from a notebook or interactive python session.
Installation
If you are using Anaconda or Miniconda, install Intake with the following commands:
conda install -c conda-forge intake
If you are using virtualenv/pip, run the following command:
pip install intake
Note that this will install with the minimum of optional requirements. If you want a more complete install, use intake[complete] instead.
Creating Sample Data
Let’s begin by creating a sample data set and catalog. At the command line, run the intake example
command.
This will create an example data Catalog and two CSV data files. These files contains some basic facts about the 50
US states, and the catalog includes a specification of how to load them.
Loading a Data Source
Data sources can be created directly with the open_*()
functions in the intake
module. To read our example data:
>>> import intake
>>> ds = intake.open_csv('states_*.csv')
>>> print(ds)
<intake.source.csv.CSVSource object at 0x1163882e8>
Each open function has different arguments, specific for the data format or service being used.
Reading Data
Intake reads data into memory using containers you are already familiar with:
Tables: Pandas DataFrames
Multidimensional arrays: NumPy arrays
Semistructured data: Python lists of objects (usually dictionaries)
To find out what kind of container a data source will produce, inspect the container
attribute:
>>> ds.container
'dataframe'
The result will be dataframe
, ndarray
, or python
. (New container types will be added in
the future.)
For data that fits in memory, you can ask Intake to load it directly:
>>> df = ds.read()
>>> df.head()
state slug code nickname ...
0 Alabama alabama AL Yellowhammer State
1 Alaska alaska AK The Last Frontier
2 Arizona arizona AZ The Grand Canyon State
3 Arkansas arkansas AR The Natural State
4 California california CA Golden State
Many data sources will also have quick-look plotting available. The attribute .plot
will list
a number of built-in plotting methods, such as .scatter()
, see Plotting.
Intake data sources can have partitions. A partition refers to a contiguous chunk of data that can be loaded
independent of any other partition. The partitioning scheme is entirely up to the plugin author. In
the case of the CSV plugin, each .csv
file is a partition.
To read data from a data source one chunk at a time, the read_chunked()
method returns an iterator:
>>> for chunk in ds.read_chunked(): print('Chunk: %d' % len(chunk))
...
Chunk: 24
Chunk: 26
Working with Dask
Working with large datasets is much easier with a parallel, out-of-core computing library like
Dask. Intake can create Dask containers (like dask.dataframe
)
from data sources that will load their data only when required:
>>> ddf = ds.to_dask()
>>> ddf
Dask DataFrame Structure:
admission_date admission_number capital_city capital_url code constitution_url facebook_url landscape_background_url map_image_url nickname population population_rank skyline_background_url slug state state_flag_url state_seal_url twitter_url website
npartitions=2
object int64 object object object object object object object object int64 int64 object object object object object object object
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: from-delayed, 4 tasks
The Dask containers will be partitioned in the same way as the Intake data source, allowing different chunks to be processed in parallel. Please read the Dask documentation to understand the differences when working with Dask collections (Bag, Array or Data-frames).
Opening a Catalog
A Catalog is an inventory of data sources, with the type and arguments prescribed for each, and arbitrary metadata about each source. In the simplest case, a catalog can be described by a file in YAML format, a “Catalog file”. In real usage, catalogues can be defined in a number of ways, such as remote files, by connecting to a third-party data service (e.g., SQL server) or through an Intake Server protocol, which can implement any number of ways to search and deliver data sources.
The intake example
command, above, created a catalog file
with the following YAML-syntax content:
sources:
states:
description: US state information from [CivilServices](https://civil.services/)
driver: csv
args:
urlpath: '{{ CATALOG_DIR }}/states_*.csv'
metadata:
origin_url: 'https://github.com/CivilServiceUSA/us-states/blob/v1.0.0/data/states.csv'
To load a Catalog from a Catalog file:
>>> cat = intake.open_catalog('us_states.yml')
>>> list(cat)
['states']
This catalog contains one data source, called states
. It can be accessed by attribute:
>>> cat.states.to_dask()[['state','slug']].head()
state slug
0 Alabama alabama
1 Alaska alaska
2 Arizona arizona
3 Arkansas arkansas
4 California california
Placing data source specifications into a catalog like this enables declaring data sets in a single canonical place, and not having to use boilerplate code in each notebook/script that makes use of the data. The catalogs can also reference one-another, be stored remotely, and include extra metadata such as a set of named quick-look plots that are appropriate for the particular data source. Note that catalogs are not restricted to being stored in YAML files, that just happens to be the simplest way to display them.
Many catalog entries will also contain “user_parameter” blocks, which are indications of options explicitly
allowed by the catalog author, or for validation or the values passed. The user can customise how a data
source is accessed by providing values for the user_parameters, overriding the arguments specified in
the entry, or passing extra keyword arguments to be passed to the driver. The keywords that should
be passed are limited to the user_parameters defined and the inputs expected by the specific
driver - such usage is expected only from those already familiar with the specifics of the given
format. In the following example, the user overrides the “csv_kwargs” keyword, which is described
in the documentation for CSVSource
and gets passed down to the CSV reader:
# pass extra kwargs understood by the csv driver
>>> intake.cat.states(csv_kwargs={'header': None, 'skiprows': 1}).read().head()
0 1 ... 17
0 Alabama alabama ... https://twitter.com/alabamagov
1 Alaska alaska ... https://twitter.com/alaska
Note that, if you are creating such catalogs, you may well start by trying the open_csv
command,
above, and then use print(ds.yaml())
. If you do this now, you will see that the output is very
similar to the catalog file we have provided.
Installing Data Source Packages
Intake makes it possible to create Data packages (pip
or conda
)
that install data sources into a
global catalog. For example, we can
install a data package containing the same data we have been working with:
conda install -c intake data-us-states
Conda installs the catalog file in this package to $CONDA_PREFIX/share/intake/us_states.yml
.
Now, when we import
intake
, we will see the data from this package appear as part of a global catalog called intake.cat
. In this
particular case we use Dask to do the reading (which can handle larger-than-memory data and parallel
processing), but read()
would work also:
>>> import intake
>>> intake.cat.states.to_dask()[['state','slug']].head()
state slug
0 Alabama alabama
1 Alaska alaska
2 Arizona arizona
3 Arkansas arkansas
4 California california
The global catalog is a union of all catalogs installed in the conda/virtualenv environment and also any catalogs installed in user-specific locations.
Adding Data Source Packages using the Intake path
Intake checks the Intake config file for catalog_path
or the environment variable "INTAKE_PATH"
for a colon
separated list of paths (semicolon on windows) to search for catalog files.
When you import intake
we will see all entries from all of the catalogues referenced as part of a global catalog
called intake.cat
.