FeaturedOriginal

Seven Ways That Data Science Projects Fail

Seven Ways That Data Science Projects Fail

Bob Glushko

The pragmatic value of data science for solving business problems has made it a rival or replacement for information science from an industry perspective. I reviewed numerous data science projects and interviewed numerous data science experts to understand the factors that make projects successful, but this work also revealed—by counterfactual reasoning and some confessions from the experts—why some data science projects fail. I identified seven causes of failure and I explain here how “information science thinking” can prevent or lessen these problems in data science projects:

  • unrealistic expectations
  • no clear goals or problem statement
  • missing skills
  • problems with data
  • over-reliance on technology
  • poor deployment planning
  • poor maintenance planning
—All seven of these ways to fail are different manifestations of the same cause—too narrow a disciplinary focus.—

Unrealistic Expectations

It is impossible to avoid the hype about data science. Some of this excitement is justified. Deep Mind, Google, Amazon, Microsoft, Facebook, Open AI, and other firms that are very deep in technical talent and resources have made significant conceptual and technological breakthroughs in computer science, machine learning, and artificial intelligence. 

But while this work gives some support to the hype and memes of data science, it creates unrealistic expectations for ordinary businesses, which can never afford similar capabilities, and which have specific problems to solve. As a result, firms are seduced to try data science and machine learning with overly ambitious and vague goals like “fully exploit data to maximize customer value” when simpler tools might have been sufficient, and incremental goals might have been more achievable.  

No Clear Goals or Problem Statement

Clear goals for data science projects only emerge when domain experts in business units identify specific problems related to prediction or classification that they have been unable to solve with their current methods and technology. This work raises the question “what data might help us solve these problems” and this starting point would highlight information science concerns about the sources, semantics, and value of business data. A business-driven effort is likely to use less advanced technology and have lower expectations than a technology-driven one. It will achieve results faster using smaller, more task-specific models that will fit more easily into the operating procedures of the business units. 

Missing Skills

Companies launching data science efforts often do so with outside consultants or technologists borrowed from other parts of the company. But this staffing approach also signals that the most essential skills are technology ones, implicitly de-emphasizing the business and people skills that would be useful in scoping problems and designing the deployment process for using the models to make the business more successful.

The best solution is to find or hire people whose skill sets combine depth and breadth in multiple disciplines to enable effective communication and cooperation across them. Such people can play important roles throughout the entire lifecycle of a data science project.   

Problems with Data

The “big data” meme in data science emphasizes the quantity of data used in a project and asks “what data do we have” rather than “how can we best refine and organize our information so that we can use it to solve our business problems.” Companies launching data science projects often collect data from every business unit into a “data lake,” combining structured and unstructured data with different quality, semantics, and relevance. “Data lake” might conjure in your mind a pristine Lake Tahoe, and technology does exist to “clean” data to varying degrees. But the “garbage in, garbage out” adage still applies.

Data lakes are useful tools during exploratory stages to test problem hypotheses and modeling approaches, but for deployed models it is invariably better to resolve data incompatibility and quality issues by working with the business units to fix those problems. 

Over-reliance on Technology

Business analysts have used spreadsheets and relational databases for decades, and these tools remain useful even though the innovative technologies of data science are more powerful.  It is true that TensorFlow, PyTorch, and other frameworks can automate many repetitive tasks of data engineering and model building, but much of this work is done in “black boxes” that do exotic statistical sampling or exhaustive search through parameter spaces that make the models impossible to interpret.  

Poor Deployment Planning

Companies often proudly announce they will start doing data science. But they should be saying that they “plan to use data science to build software that can be deployed by the business units to solve problems.”  Models handed over to business units with no guidance about how to integrate and manage them along with the software the business unit was already using are unlikely to be successful.

Poor Maintenance Planning

Similarly, companies often prioritize “moving fast” and as a result do not acknowledge and sometimes even embrace the tradeoff in maintainability that this decision imposes.  The technologists driving the effort don’t document many critical decisions about data selection, model training, and model optimization.  When things inevitably go wrong, problems are hard to find and fix.

A Way Forward

All seven of these ways to fail are different manifestations of the same cause—too narrow a disciplinary focus. Unsuccessful data science projects rely too heavily on computer scientists and statisticians and did not involve people with expertise in business, user research, information architecture, linguistics, and cognitive science—all of which come together in information science.

The way forward is clear. A successful data science project requires a multidisciplinary approach to ensure that all the relevant issues are considered, and that effort is directed to the most important ones. One way to ensure this is to build a team from experts in the needed disciplines, but a better way is to develop people who themselves have multidisciplinary skills. These “t-shaped” or “pi-shaped” people won’t be developed by requiring data science majors to take a few electives in ISchools or other departments, nor will they be developed by providing narrowly-trained information science majors with some additional training in data science. We don’t need a truce between the two fields that carves up the world into separate domains they each control. Data science and information science need to recognize that together they can accomplish more than they can accomplish separately.

Cite this article in APA as: Glushko, B. (2023, June 7). Seven ways that data science projects fail. Information Matters, Vol. 3, Issue 6. https://informationmatters.org/2023/06/seven-ways-that-data-science-projects-fail/

Bob Glushko

Bob Glushko is an Adjunct Full Professor at the University of California at Berkeley in the Cognitive Science Program, which he joined in 2017 after fifteen years at the School of Information. Before joining the Berkeley faculty in 2002, he had more than twenty years of R&D, consulting, and entrepreneurial experience in information systems and service design, content management, electronic publishing, Internet commerce, and human factors in computing systems. He founded or co-founded four companies, including Veo Systems in 1997, which pioneered the use of XML for electronic business before its 1999 acquisition by Commerce One. Veo's innovations included the Common Business Library (CBL), the first native XML vocabulary for business-to-business transactions, and the Schema for Object-Oriented XML (SOX), the first object-oriented XML schema language. From 1999-2002 he headed Commerce One's XML architecture and technical standards activities and was named an "Engineering Fellow" in 2000. In 2014 Glushko's book, The Discipline of Organizing, was named an Information Science Book of the Year by the Association of Information Science and Technology. It has been adopted by nearly 80 schools and is now in its 4th edition; freely downloadable from ISchools.org or from Berkeley.pressbooks.pub.