:orphan: EDA, Feature Engineering, and Modeling With Papermill ===================================================== Exploratory Data Analysis (EDA) refers to the critical process of performing initial investigations on data to discover patterns, spot anomalies, test hypotheses and check assumptions with the help of summary statistics and graphical representations. EDA cannot be solely implemented within Flyte as it requires visual analysis of the data. In such scenarios, we are inclined towards using a Jupyter notebook as it helps visualize and feature engineer the data. **Now the question is, how do we leverage the power of Jupyter Notebook within Flyte to perform EDA on the data?** Papermill --------- `Papermill `__ is a tool for parameterizing and executing Jupyter Notebooks. Papermill lets you: - parameterize notebooks - execute notebooks We have a pre-packaged version of Papermill with Flyte that lets you leverage the power of Jupyter Notebook within Flyte pipelines. To install the plugin, run the following command: .. prompt:: bash $ pip install flytekitplugins-papermill Examples -------- There are three code examples that you can refer to in this tutorial: - Run the whole pipeline (EDA + Feature Engineering + Modeling) in one notebook - Run EDA and feature engineering in one notebook, fetch the result (EDA'ed and feature engineered-dataset), and model the data as a Flyte task by sending the dataset as an argument - Run EDA and feature engineering in one notebook, fetch the result (EDA'ed and feature engineered-dataset), and model the data in another notebook by sending the dataset as an argument Notebook Etiquette ^^^^^^^^^^^^^^^^^^ - If you want to send inputs and receive outputs, your Jupyter notebook has to have ``parameters`` and ``outputs`` tags, respectively. To set up tags in a notebook, follow this `guide `__. - ``parameters`` cell must only have the input variables. - ``outputs`` cell looks like the following: .. code-block:: python from flytekitplugins.papermill import record_outputs record_outputs(variable_name=variable_name) Of course, you can have any number of variables! - The ``inputs`` and ``outputs`` variable names in the ``NotebookTask`` must match the variable names in the notebook. .. note:: You will see three outputs on running the Python code files, although a single output is returned. One output is the executed notebook, and the other is the rendered HTML of the notebook. .. raw:: html
.. raw:: html
.. only:: html .. image:: /auto/case_studies/feature_engineering/eda/images/thumb/sphx_glr_notebook_thumb.png :alt: Flyte Pipeline in One Jupyter Notebook :ref:`sphx_glr_auto_case_studies_feature_engineering_eda_notebook.py` .. raw:: html
Flyte Pipeline in One Jupyter Notebook
.. raw:: html
.. only:: html .. image:: /auto/case_studies/feature_engineering/eda/images/thumb/sphx_glr_notebook_and_task_thumb.png :alt: EDA and Feature Engineering in Jupyter Notebook and Modeling in a Flyte Task :ref:`sphx_glr_auto_case_studies_feature_engineering_eda_notebook_and_task.py` .. raw:: html
EDA and Feature Engineering in Jupyter Notebook and Modeling in a Flyte Task
.. raw:: html
.. only:: html .. image:: /auto/case_studies/feature_engineering/eda/images/thumb/sphx_glr_supermarket_regression_2_thumb.png :alt: Supermarket Regression 2 Notebook :ref:`sphx_glr_auto_case_studies_feature_engineering_eda_supermarket_regression_2.py` .. raw:: html
Supermarket Regression 2 Notebook
.. raw:: html
.. only:: html .. image:: /auto/case_studies/feature_engineering/eda/images/thumb/sphx_glr_supermarket_regression_thumb.png :alt: Supermarket Regression Notebook :ref:`sphx_glr_auto_case_studies_feature_engineering_eda_supermarket_regression.py` .. raw:: html
Supermarket Regression Notebook
.. raw:: html
.. only:: html .. image:: /auto/case_studies/feature_engineering/eda/images/thumb/sphx_glr_notebooks_as_tasks_thumb.png :alt: EDA and Feature Engineering in One Jupyter Notebook and Modeling in the Other :ref:`sphx_glr_auto_case_studies_feature_engineering_eda_notebooks_as_tasks.py` .. raw:: html
EDA and Feature Engineering in One Jupyter Notebook and Modeling in the Other
.. raw:: html
.. only:: html .. image:: /auto/case_studies/feature_engineering/eda/images/thumb/sphx_glr_supermarket_regression_1_thumb.png :alt: Supermarket Regression 1 Notebook :ref:`sphx_glr_auto_case_studies_feature_engineering_eda_supermarket_regression_1.py` .. raw:: html
Supermarket Regression 1 Notebook
.. raw:: html
.. toctree:: :hidden: /auto/case_studies/feature_engineering/eda/notebook /auto/case_studies/feature_engineering/eda/notebook_and_task /auto/case_studies/feature_engineering/eda/supermarket_regression_2 /auto/case_studies/feature_engineering/eda/supermarket_regression /auto/case_studies/feature_engineering/eda/notebooks_as_tasks /auto/case_studies/feature_engineering/eda/supermarket_regression_1 .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-gallery .. container:: sphx-glr-download sphx-glr-download-python :download:`Download all examples in Python source code: eda_python.zip ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download all examples in Jupyter notebooks: eda_jupyter.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_