Spreadsheet Understanding

The Spreadsheet Understanding is a subproject of the Research.PURe project that aims to apply program understanding and reverse engineering techniques to spreadsheets.


Members


Motivation

Spreadsheet tools can be viewed as programming environments for non-professional programmers. These so-called ''end-user'' programmers include teachers, secretaries, physicists, biologists, accountants, managers, consultants. For all these people, who are not employed or trained as programmers, spreadsheets are more often than not the programming environment of choice.

In fact, end-user programmers vastly outnumber professional programmers. By some estimates, the number of end-user programmers in the U.S. alone is expected to reach 55 million by 2005, as compared to only 2.75 million professional programmers. Through the efforts of these people, huge numbers of large and often complex spreadsheets are created and maintained every day in all companies and public institutes around the world. For these organizations, spreadsheets represent business-critical assets with significant economic value.

Spreadsheets, like any other program, can unfortunately easily turn from asset into liability. Development, maintenance, and extension of spreadsheets is notoriously error-prone. It is difficult to grasp the logic encoded in a spreadsheet, especially when it is large and developed by others. It is difficult to test spreadsheets, to assess their quality, and to validate their correctness. In this sense, a legacy problem exists with respect to spreadsheets as much, or perhaps even more, as with respect to conventional programs.

This observation leads to the suggestion that techniques for program understanding and reverse engineering, could be likewise applied, in adapted form, to spreadsheets. Just as they are useful for dealing with legacy software in the conventional sense, so they could be useful for dealing with spreadsheet legacy.

In fact, spreadsheets, when viewed as programming language, can be characterized as a particularly low-level one. Their language of cell references and formulas offers only low-level expressiveness compared to regular programming languages. They offer no support for abstraction, encapsulation, or structured programming. For this reason, it is fair to state that spreadsheets could benefit even more from program understanding and reverse engineering techniques than conventional programs.


State of the Art

Martin Erwig et al.

The UCheck project

In this project, Martin Erwig, Robin Abraham and Margaret Burnett defined a unit system for spreadsheets that allows to reason about the correctness of formulas in concrete terms. The fulcral point of this project is the high flexibility of the unit system developed, both in terms of error reporting and reasoning rules, that increases the possibility of a high acceptance among end users.

Selected Publications

  • How to Communicate Unit Error Messages in Spreadsheets, Robin Abraham and Martin Erwig, 1st Workshop on End-User Software Engineering, 52-56, 2005 (pdf)
  • Header and Unit Inference for Spreadsheets Through Spatial Analyses, Robin Abraham and Martin Erwig, IEEE Int. Symp. on Visual Languages and Human-Centric Computing, 165-172, 2004 (pdf), Best Paper Award
  • Visually Customizing Inference Rules About Apples and Oranges, Margaret M. Burnett and Martin Erwig, 2nd IEEE Int. Symp. on Human Centric Computing Languages and Environments, 140-148, 2002 (pdf)
  • Adding Apples and Oranges, Martin Erwig and Margaret M. Burnett, 4th Int. Symp. on Practical Aspects of Declarative Languages, LNCS 2257, 173-191, 2002 (pdf)

The Gencel project

Spreadsheets are likely to be full of errors and this can cause organizations to lose millions of dollars. However, finding and correcting spreadsheet errors is extremly hard and consumes a lot of time.

Probably inspired in database systems (personal guess) and how database management systems (DBMS) mantain the database consistent after every update and transaction, Martin Erwig, Robin Abraham, Irene Cooperstein and Steve Kollmansberger designed and implemented Gencel. Gencel is an extension to Excel, based on the concept of a spreadsheet template, which captures the essential structure of a spreadsheet and all of its future evolutions. Such a template ensures that every update is safe, avoiding reference, range and type errors.

Templates can be created with the editor ViTSL, similar to Excel (with additional features to design templates).

Selected Publications

  • Gencel: A Program Generator for Correct Spreadsheets, Martin Erwig and Robin Abraham and Irene Cooperstein and Steve Kollmansberger, Journal of Functional Programming, 2005, to appear (pdf)

  • Visual Specifications of Correct Spreadsheets, Robin Abraham, Martin Erwig, Steve Kollmansberger, and Ethan Seifert, IEEE Int. Symp. on Visual Languages and Human-Centric Computing, 2005, to appear (pdf)

  • Automatic Generation and Maintenance of Correct Spreadsheets, Martin Erwig, Robin Abraham, Irene Cooperstein, and Steve Kollmansberger, 27th IEEE Int. Conf. on Software Engineering, 136-145, 2005 (pdf)

  • Goal-Directed Debugging of Spreadsheets, Robin Abraham and Martin Erwig, VLHCC '05: Proceedings of the 2005 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC'05), 2005 (pdf)

Simon Peyton Jones et al.

Simon Peyton Jones, together with Margaret Burnett and Alan Blackwell, describe extensions to Excel that allow to integrate in the spreadsheet grid user-defined functions. That take us from a end-user programming paradigm to an extended system, which provides some of the capabilities of a general programming language.

Selected Publications

  • A user-centred approach to functions in Excel. Simon Peyton Jones, Margaret Burnett, Alan Blackwell. Proc International Conference on Functional Programming, Uppsala, Sept 2003 (ICFP'03), pp 165-176, 12 pages. (pdf)
  • Champagne Prototyping: A Research Technique for Early Evaluation of Complex End-User Programming Systems. Alan Blackwell, Margaret Burnett, Simon Peyton Jones. 12 pages, March 2004. (pdf)

Josť Nuno Oliveira

Josť Nuno Oliveira describes a pointfree calculus and refinement laws for spreadsheet normalization. He shows how a binary representation of an n-ary relation makes sense in database theory context, and how lengthy formulae can become very simple and elegant. As an example, it follows the definition of functional dependency satisfiability in the two calculus:

  • Pointwise calculus (classical)

form11.png

  • Pointfree calculus

form22.png

Selected Publications

  • Functional dependency theory made 'simpler', J.N. Oliveira, Technical report, DI-Research.PURe-05.01.01, January 2005. (pdf)


Our project

Research plan

The project proposal for which a BIC grant has been awarded is available:

  • BIC research proposal: pdf

In this proposal the following deliverables are expected:

  1. A Spreadsheet Transformation and Analysis API -- Develop a Haskell library that supports the processing of spreadsheets. The functionality offered by this library should include: (1) conversion of spreadsheets to and from XML, including structured representation of formulas. (2) marshaling between the XML repre- sentation of spreadsheets and the strongly-typed Haskell representation. (3) functions for querying and transforming the various elements of the Haskell representation. The Spreadsheet API should become part of the UMinho Haskell Library initiative.
  2. A Spreadsheet Type Reconstruction Component -- Based on the above API, implement an algorithm for reconstruction of types in a spreadsheet. The algorithm analysis the use of cells in formulas, and computes type and sub-type relations between cells. This component will be useful for testing purposes and error detection in spreadsheets.
  3. A Relational Data Model Extraction Component -- Based on the Spreadsheet API, implement an algorithm for extraction of the underlying data model of a spreadsheet. In particular, the algorithm detects which cells are used as identifying keys. Based on this, a relational data model will be inferred, including primary keys, foreign key relationships and integrity constrains. This component will be useful to support migration from spreadsheets solutions to databases solutions in industrial contexts.
  4. Spreadsheet Quality Assessment Case Study -- The developed components for type reconstruction and data model extraction will be applied to a set of commercially used spreadsheets with the ob jective of quality assessment. In particular, they will be used to detect errors, and uncover the underlying data model of the involved spreadsheets. The results and the methodology of the quality assessment will be described in a spreadsheet quality assessment report.

Possible integration with other projects

To promote the adoption of the Gencel system, tools that extract specifications from arbitrary spreadsheets are needed. Deliverable 3, based on Josť Nuno Oliveira's research, could be a nice tool to integrate with Gencel.


Documents

  • Presentation at Research.PURe WorkShop: pdf.


Links