The hvPlot website provides comprehensive documentation on using the plotting API to quickly visualize and explore small and large datasets. The main features offered by the plotting API include:
- Support for tabular data stored in pandas and dask dataframes
- Support for gridded data stored in xarray backed nD-arrays
- Support for plotting large datasets with datashader
Using Intake alongside hvPlot allows declaratively persisting plot declarations and default options in the regular catalog.yaml files.
For detailed installation instructions see the getting started section in the hvPlot documentation. To start with install hvplot using conda:
conda install -c pyviz hvplot
or using pip:
pip install hvplot
The plotting API is designed to work well in and outside the Jupyter notebook, however when using it in JupyterLab the PyViz lab extension must be installed first:
jupyter labextension install @pyviz/jupyterlab_pyviz
For detailed instructions on displaying plots in the notebook and from the Python command prompt see the hvPlot user guide.
Python Command Prompt & Scripts¶
Assuming the US Crime dataset has been installed (in this repo’s examples/ directory, or from conda with conda install -c intake us_crime):
Once installed the plot API can be used, by using the
.plot method on an intake
import intake import hvplot as hp crime = intake.cat.us_crime columns = ['Burglary rate', 'Larceny-theft rate', 'Robbery rate', 'Violent Crime rate'] violin = crime.plot.violin(y=columns, group_label='Type of crime', value_label='Rate per 100k', invert=True) hp.show(violin)
Inside the notebook plots will display themselves, however the notebook extension must be loaded first. The
extension may be loaded by importing
hvplot.intake module or explicitly loading the holoviews extension,
or by calling
# To load the extension run this import import hvplot.intake # Or load the holoviews extension directly import holoviews as hv hv.extension('bokeh') # convenience function import intake intake.output_notebook() crime = intake.cat.us_crime columns = ['Violent Crime rate', 'Robbery rate', 'Burglary rate'] crime.plot(x='Year', y=columns, value_label='Rate (per 100k people)')
Some catalogs will define plots appropriate to a specific data source. These will be specified such that the user gets the right view with the right columns and labels, without having to investigate the data in details - this is idea for quick-look plotting when browsing sources.
import intake intake.us_crime.plots
Returns [‘example’]. This works whether accessing the entry object or the source instance. To visualise
Intake allows catalog yaml files to declare metadata fields for each data source which are made available alongside the actual dataset. The plotting API reserves certain fields to define default plot options, to label and annotate the data fields in a dataset and to declare pre-defined plots.
The first set of metadata used by the plotting API is the plot field in the metadata section. Any options found in the metadata field will apply to all plots generated from that data source, allowing the definition of plotting defaults. For example when plotting a fairly large dataset such as the NYC Taxi data, it might be desirable to enable datashader by default ensuring that any plot that supports it is datashaded. The syntax to declare default plot options is as follows:
sources: nyc_taxi: description: NYC Taxi dataset driver: parquet args: urlpath: 's3://datashader-data/nyc_taxi_wide.parq' metadata: plot: datashade: true
Declaring data fields¶
The columns of a CSV or parquet file or the coordinates and data variables in a NetCDF file often have shortened, or cryptic names with underscores. They also do not provide additional information about the units of the data or the range of values, therefore the catalog yaml specification also provides the ability to define additional information about the fields in a dataset.
Valid attributes that may be defined for the data fields include:
- label: A readable label for the field which will be used to label axes and widgets
- unit: A unit associated with the values inside a data field
- range: A range associated with a field declaring limits which will override those computed from the data
Just like the default plot options the fields may be declared under the metadata section of a data source:
sources: nyc_taxi: description: NYC Taxi dataset driver: parquet args: urlpath: 's3://datashader-data/nyc_taxi_wide.parq' metadata: fields: dropoff_x: label: Longitude dropoff_y: label: Latitude total_fare: label: Fare unit: $
Declaring custom plots¶
As shown in the hvPlot user guide, the plotting API provides a variety of plot types, which can be declared using the kind argument or via convenience methods on the plotting API, e.g. cat.source.plot.scatter(). In addition to declaring default plot options and field metadata data sources may also declare custom plot, which will be made available as methods on the plotting API. In this way a catalogue may declare any number of custom plots alongside a datasource.
To make this more concrete consider the following custom plot declaration on the plots field in the metadata section:
sources: nyc_taxi: description: NYC Taxi dataset driver: parquet args: urlpath: 's3://datashader-data/nyc_taxi_wide.parq' metadata: plots: dropoff_scatter: kind: scatter x: dropoff_x y: dropoff_y datashade: True width: 800 height: 600
This declarative specification creates a new custom plot called dropoff_scatter, which will be available on the catalog under cat.nyc_taxi.plot.dropoff_scatter(). Calling this method on the plot API will automatically generate a datashaded scatter plot of the dropoff locations in the NYC taxi dataset.
Of course the three metadata fields may also be used together, declaring global defaults under the plot field, annotations for the data fields under the fields key and custom plots via the plots field.