Beyond Spreadsheets

It’s common in tech spheres to separate coders from business people. Coders are the software developers who create solutions using computer programming code. Business people then use these solutions to create and operate business processes.

But this distinction is blurrier than you might think—and only growing less clear as more jobs require data-driven decision-making.

Millions of business people, especially knowledge workers, write code every day—they just use spreadsheets as their primary integrated development environment (IDE). Cell formulas are their programming language of choice. If you’ve mastered spreadsheet functions like VLOOKUP, or SUMIF, you’re on your way to learning fancy functional programming—perhaps without realizing it. And even if you haven’t, rest assured that someone in your company is probably doing these very things right now to accomplish something important.

Spreadsheet-based business processes may be the liberator of business users (“we don’t need to wait for tech!”), but they can also be the bane of IT (“How do we govern this?”). Unofficial data flows often involve some kind of hacked-together process involving spreadsheets and a variety of software-as-a-service business tools. Even sophisticated data visualization tools such as Looker must include a feature for downloading data, or else spreadsheet coders would be stuck.

This situation is often an uncomfortable one for C-suite executives. In 2017 The Wall Street Journal ran a story about CFOs attempting to remove spreadsheets in favor of integrated business systems. There was so much reader feedback that a week later the paper ran a follow-up to summarize the pushback. The summary quoted a CPA who had tweeted “You can have my Excel, after you rip it from my cold, dead hands.” Clearly, people love their spreadsheets.

This is understandable: Spreadsheets are super powerful! Visualizing rows and columns of information in a grid is extremely logical, and the instant feedback from relatively complex chained formulas is powerful for most data analysis.

But spreadsheets have limits—not only from a business process and governance point of view, but also from a literal data analysis standpoint. Four challenges are worth noting in particular:

Data size
Linearity
Quality
Presentation and shareability

1. Data size

Spreadsheets are inherently limited in data size and are unable to process large amounts of data (though users do try). A single Excel sheet is limited to around one million rows. This sounds like a lot, but there are many real-world situations where the real data is larger than this. Even a midsize business may have more than a million transactions in their books, and larger ones may have well over a million customers, especially over time and across product lines. You’d need multiple linked worksheets to analyze all of them granularly without summarizing first.

Performance degrades rapidly with these large spreadsheets. Some users adopt complex workarounds, such as turning off automatic calculations, so that they can update data cells without the software struggling to keep up. When they’re done, they manually click the “Calculate Now” button—and go to lunch (sometimes a very long one) while the numbers crunch.

One way to mitigate data size limitations is to connect spreadsheets to enterprise data warehouses—databases designed to handle large amounts of data and make it available for analysis. For example, Google Sheets users can connect Sheets to BigQuery, and Microsoft Power Query for Excel can connect to various external data sources. This is often a good starting point, but once source data is brought into the spreadsheet, the same data size limitations often apply.

2. Linearity

Spreadsheets often contain nonlinear, nested analysis that’s difficult to follow or understand. It’s not unusual to find a formula like this:

=VLOOKUP(D4,IF(C4=”East”,Table1,Table2,2,TRUE)

Unwinding code like this can take a lot of reverse engineering, clicking into each cell formula to trace through the calculations. When these formulas are located on multiple worksheets that tie together, the reviewer needs to create an intricate 3D mental map to fully understand the analysis.

3. Quality

Unfortunately, spreadsheet calculation errors are all too common. One analysis of multiple studies estimated that 88% of spreadsheets contain “significant” errors. Cells may be inadvertently hidden instead of deleted, data may be copied and pasted incorrectly, and formulas may be used inappropriately. In some cases, enormous financial losses have been attributed to spreadsheet errors.

4. Presentation and shareability

It can be difficult to use just a spreadsheet to create an analysis that’s ready to present and share. Cleanly incorporating descriptive headers, detailed commentary, explanatory graphics, and data visualizations into a spreadsheet is tough. This becomes especially difficult when there are multiple sections of analysis.

Of course, users can import graphics into spreadsheets, render charts, and format cells as headers, footnotes, and other explanatory text. But this approach is, at best, unwieldy. Objects such as images and charts either float in the middle of a data sheet or live on their own as a separate sheet within the workbook. Both of these approaches are often quite disconnected visually from the appropriate section of analysis. Moreover, text cells are usually quite limited in formatting options, often requiring nearly endless tweaking to “get the formatting just right” for presentation.

It’s no wonder most users wind up exporting spreadsheet elements into separate presentation software such as Google Slides or Microsoft PowerPoint. But even this workaround isn’t a perfect solution, as it can result in “versionitis”—a malady in which the spreadsheet contains different versions of analysis components than the presentation. Object linking can help here, but efficiency often suffers.

Notebooks

Some organizations have turned to other tools, notably Jupyter Notebooks, to extend their data analysis capabilities. Notebooks are web documents containing text, graphics, and code cells that can be saved as files. When loaded in a browser, they can connect to a “kernel”—a chunk of computer processing power that provides CPU, RAM, networking, and so on. This kernel can physically live on the same computer as a web document (local hosting) or it can be in the cloud.

Notebooks embrace the interactive, visual experience of spreadsheets, but leverage more robust, extensible, open-source tools to provide additional functionality that enhances quality of analysis and improves collaboration. Early adopters include academia, tech, and finance.

For example, at Google, thousands of employees use Colaboratory (Colab), a notebook tool based on Jupyter, to analyze all kinds of data. This tool set became so popular that Google decided to offer Colab publicly. Demonstrating the growing appeal of notebooks, Colab recently introduced a pro version that offers more power and serves as a step between the public version and AI Platform Notebooks, a managed, enterprise-grade notebook solution. Notebook solutions such as these are helping enterprises quickly bypass the limitations of spreadsheets, letting them focus on analyzing data rather than struggling with partial solutions.

Notebooks cleverly address the four main spreadsheet limitations noted above:

Data size. Because notebooks can utilize cloud computing resources, in-memory data analysis with larger datasets becomes quite possible. For example, with pandas, a popular open-source data analysis library for Python, you can easily work with tabular data containing many millions of rows. And just like with connected spreadsheets, querying data warehouses is straightforward from within notebooks.
Linearity. Notebooks are generally executed from top to bottom. There is a linear path from the first cell at the top of the notebook to the last cell at the bottom. This may seem limiting compared to the two-dimensional nature of spreadsheet grids and three-dimensional aspect of multisheet workbooks. However, adopting this construct greatly simplifies the experience and allows for straightforward data analysis “storytelling.” Notebooks have a clear beginning, middle, and end.
Quality. Notebooks can incorporate unit testing, which is a practice of writing code that validates other code. Python includes robust unit testing tools that can be used, for example, to verify that a calculation produces the expected output on a sample of data. The chances of errors are greatly reduced when that same calculation is run on the corresponding full dataset.
Presentation and shareability. Each cell of a notebook can contain formatted text using Markdown, a popular and simple formatting framework similar to HTML, or code in a number of languages, including Python, R, and more. When you “run” a code cell that produces graphics, such as a table or a data visualization plot, the results are rendered in-line. Presenting analysis from notebooks is quite possible—in fact, entire books have been written in notebooks. To share the analysis with your coworkers, you can save the notebook file to a storage system such as Google Drive and share the link.

Ready to move beyond spreadsheets? The Welcome to Colaboratory notebook is a great starting point, as is the AI Platform Notebooks documentation.

This post first appeared in Forbes.
Featured image by Pexels from Pixabay.

Beyond Spreadsheets

1. Data size

2. Linearity

3. Quality

4. Presentation and shareability

Notebooks

Published by jeffsternberg

Leave a comment Cancel reply

1. Data size

2. Linearity

3. Quality

4. Presentation and shareability

Notebooks

Share this:

Related

Published by jeffsternberg

Leave a comment Cancel reply