Python In Excel. What impact will it have? Will it be beneficial for Machine learning Engineers and Data Scientists?

What has been announced?

On the 22nd August, Microsoft made a major announcement. They would be implementing Python in Excel.

At the time of writing, this has only been released in a Beta test environment and is only available to those on the Microsoft 365 Insiders program.

The announcements do however give us a tantalising glimpse of the functionality and allow us to build a first impression of what this capability may offer users.

The Implementation

The first thing to note, is that the excel Python deployment has been built on the Anaconda instance of Python and will run in the Microsoft cloud. This means that users will not need to have Python installed locally and the vast range of data specific modules in Anaconda (Matplotlib, sci-kit learn, re, Pandas, Numpy…) should be available for use. Users will be able to implement multiple lines of Python code called in Excel via the new PY function. This coding environment lets users reference ranges in the data set via the XL function providing what we can only assume will be array like objects for use in code. The PY function allows the return to either be a value (i.e. a function output) or a python object (i.e. a Matplot lib plot)

What do we think this will enable?

Advanced Visualizations

Excel has always had very good graphing capability, however the graphs have been limited to the inbuilt selection available. While the range of inbuilt graphs has evolved over time, it is not uncommon to find situations where you want a different plot.

The inclusion of the comprehensive Matplotlib and Seaborne libraries takes massive strides towards closing this gap. It will provide significantly more base plots to work with; allow users to produce highly customised visualisations that can be updated from data in the workbook; and easily share them in an Excel workbook with other users in their organisation.

Machine Learning, Predictive Analytics, and Forecasting

The inclusion of the Scikit learn library will allow users to create and deploy machine learning models within an Excel workbook. Such functionality will dramatically enhance the range of tools available to an analyst and improve the tools available to a business.

Unsupervised machine learning algorithms (such as k means clustering) will allow analysists to look at data in new ways and find new insights. Supervised machine learning will also help with building predictive and forecasting models and tools in Excel. Both offer significant enhancements to the tools available to Excel analysts today.

Other Significant New Functionality

In theory, the ability to access the Anaconda build of Python from within Excel, opens a vast toolset of data science and analysis features. Microsoft have only alluded to a few in the release, but the potential tool list is now extensive. There are far too many to note in this short article (a full list can be found here), however there are a few of particular interest.

The inclusion of the RE package will allow regular expressions to be implemented within Excel enabling free text to be processed with ease. NLTK (Natural language tool kit) will bring advanced sentiment analysis tools and the data science work horses of Pandas and Numpy will potentially offer the possibility of speeding up the processing of large data sets.

We will have to wait to get our hands on the new PY function before we understand how many of these are available in Excel and exactly how they are implemented, but the potential is exciting.

Questions still to be answered?

From what we can see at this stage there are still several outstanding questions or potential issues. While some of these may get resolved and cleared up in Beta and early deployment, it is likely that some of these will pervade.

Loading External Inputs

We do not know at this stage if it will be possible to load external libraries, data and inputs into the Microsoft cloud environment that Python will be running from. If this is not possible, then there are some possible impacts.

Any packages not in the Anaconda environment will not be accessible. While the available sources are extensive, it is not uncommon to find a useful tool in Python that is not in the Anaconda set up.

We do not know if it will be possible to load pre trained models into an Excel file or if an algorithm will need to train itself each time the workbook is opened. If we are in the latter situation, then we will need to take care when randomly seeding some models to ensure some models do not output differing results at differing times from the same data. We may also hit issues with having to circulate models will large training data sets hidden in the background.

We also do not know if we will be able to load external data into the file. If this is possible then it may alleviate some of the potential issues with training data file size. The same questions are also posed when we consider user written modules. Many use their own modules to carry our regularly performed tasks or to house their own bespoke models. If this feature is not accessible, then for some, the deployment may have its limits.

Transferability of code

This is fundamentally a deployment of Python, however there will be a few Excel specific nuances contained within the written code. Specifically, the references to data are made via the XL function. This function will have no specific meaning outside of XL and will not assign data to a variable in an IDE (such as VS code) or in a Jupyter notebook.

This will create a small nuance where code may not copy directly from excel to other deployments of python without adaption. Hopefully the differences are limited to assigning data, however it is plausible that code that works in Excel will not work elsewhere.

How effective will this be for repeatable data cleaning

Microsoft are still pushing Power Query as the tool for extracting, transforming, and loading data.

In our earlier article, Supercharge Your Excel Data Cleaning With Python, we assessed the pros and cons of using Power Query to clean data in Excel. We argued that the use of Python to clean data was preferable, and future proofed an analyst’s career.

We do not see any evidence at this stage to change this recommendation and still firmly believe that advanced data cleaning is still best carried out in Python. We don’t know at this stage if the new PY function will enable us to clean data within Excel or if we will still need to extract and push back to Excel. It is a somewhat moot discussion as the code implemented to clean the data will be practically the same in both cases.

Python knowledge

Analysts will need Python knowledge to use this deployment and maximise its benefits for stakeholders. This is not a bad thing and any analyst who aspires to work in data science or machine learning roles will need to demonstrate knowledge of programming. This potentially provides a new way to do that.

There will no doubt be an increase in copy and paste cookbooks offering complex code that drops straight into Excel, however you will still need to understand what is being deployed to ensure it is implemented correctly.

Python is an easy language to learn, and an adequate knowledge should not be above any experienced analysist. There is however a learning curve and all such task take time.

Mathematical Knowledge

This deployment makes it easier for analysts to deploy and share advanced analysis and tools. Such tools are often used for decision making or improving understanding. Selecting the right tool for the job at hand is something the analysist must consider during development. This ensures that the tool developed provides reliable analysis to make inferences from.

There are now a vast range of machine learning and statistical analysis tools at the fingertips of analysists. The simplicity of the Python API to implement these means that they can be implemented in a few lines of code and the complex mathematics and underlying process hidden from all. This may result in tools being deployed that analysists do not fully understand; do not generalise well and provide output that fits an expected answer. This is a potentially dangerous situation for all.

Final Thoughts

This is an exciting enhancement to Excel and will dramatically increase the functionality. By aligning with Anaconda, Microsoft will ensure they stay up to date with the latest tools and techniques. However, we must remember that Excel is an analysis tool designed for data analysists and high end data science and machine learning solutions will most likely be impractical in Excel.

The ability to develop self-contained tools and models which use Python code will provide new and insightful analysis. It also offers the ability to rapidly spin up tools and work with users to ensure they are fit for purpose, which will fit well in Agile developments. Such tools can also provide a bridge between sticking plaster tactical solutions and costly strategic solutions.

Analysts will need to familiarise themselves with Python as well as gain an adequate mathematical knowledge to implement tools correctly. Shortcutting this process may result in incorrect tools being deployed or biased decisions being made.

Machine learning tasks which require vast data sets and significant processing power will most probably be impractical in this deployment. I strongly suspect that any solution where complex modelling or deep learning is being carried out will be best implemented in a dedicated Python Environment. I would also suspect that this will be the case for any solution where more than a few lines of code need to be implemented.

While this deployment will help data analysists, I strongly suspect that dedicated data scientists and machine learning engineers will still prefer to work in dedicated environments.

We will know more once the deployment goes fully live…to be continued!

Python In Excel, What Impact Will It Have?