I recently came across several articles about failing data science projects (according to Gartner 85% big data projects are never fully productionised). The articles blame misaligned objectives, management resistance, unrealistic expectations, poor communication with stakeholders, poor data infrastructure. I think this is basically correct but too diplomatic. Here’s what I think:
The typical data science project doesn’t make any sense whatsoever and should never have been attempted.
Data science has a huge solution-looking-for-a-problem situation going on. Enterprise managers trying to appear data-driven, startup founders wanting to impress investors with cool buzzwords and proprietary IP, young data scientists themselves itching to try the newest technique from a paper - there are a lot of people looking for an excuse to do ML/AI/DL. When they finally find it, they (or rather - we) don’t try too hard to see if it makes business sense. As a result, the majority of data science projects never move beyond the stage of slides and jupyter notebooks.
Here is my subjective, non-exhaustive list of types of nonsense data science:
1. Vanity data science
By far the most common failure mode for a data science project is to never be productionised because of lack of infrastructure or lack of interest on the business side. These projects were only attempted because thought they sounded cool, in the complete absence of a realistic business case. This could have been avoided by asking a simple question before starting the project:
‘And then what?’
So you apply your DBSCAN on top of your vectors from Word2Vec to assign your customers to clusters - and then what?
Or you run sentiment analysis on all the comments on your website - and then what?
Or you train a GAN on all the images in your database - and then what?
‘How do we productionise the result? Do we have the infrastructure for it? What will the benefit be if we manage to do it?’
If the only answer is ‘and then we prepare slides to show to stakeholders’ - I suggest that we skip the ‘train the neural network’ bit and prepare the slides already. In the unlikely event that the stakeholders have a real use case for the classifier, we can start working on the use case immediately. Otherwise we move on to the next task having saved ourselves weeks, maybe months of unnecessary work.
2. Busywork
Another, less blatant way for a data science project to not make sense is for it to be sort of useful but completely not worth the effort. Like training a bespoke deep learning model to analyse 20 pages of text. Or an image quality assessment tool that saves a real estate agent 5 seconds per 1h house visit.
The question I ask stakeholders (sometimes that means asking myself) to address this problem is:
‘How much is the solution to this problem worth to you? If it’s so valuable, why haven’t you paid people to do it manually before?’
The set of good answers to this question includes:
we have been doing it manually, automating it would save us £X/year
and
we could do it manually but being able to do it in real time would be a game-changer, worth £X.
3. Reinventing the wheel
A special subcategory of ‘obviously not worth it’ projects contains ones where a solution already exists in a commoditised form on AWS, GCP, Azure etc. Examples include OCR, speech to text, generic text and image classification, object detection, named entity recognition and more.
Trying to build (for instance) a better or cheaper OCR than the one Google is selling is first of all hopeless but more importantly a distraction from your actual business (unless you’re business is selling OCR, in which case good luck!).
I sometimes hear data scientists complaining that it’s no fun calling APIs for everything and they would rather build ML models themselves. I disagree. For one, I find solving an already solved problem depressing. Secondly, outsourcing the most generic ML tasks frees up your time to do higher-level tasks and tasks specific to your business. If you really have nothing to do in your company except for reinventing the wheel then you’re in the wrong company.
4. Wishful thinking
The flipside of Busywork Data Science is Wishful Thinking Data Science. Attacking problems that it would be fantastic to have solved but which are obviously not solvable with the given data.
I most often see this kind of thing with predicting the future (which is the hardest period to predict).
Wouldn’t it be great to know the house price index/traffic on the website/demand for a product a year in advance? Can you fit your neural network/hidden markov model to the chart with historical data to make a forecast?
I can fit anything to anything but that won’t tell you much a hand-drawn trend line wouldn’t reveal. Next year’s house prices depend on a million different external political, economic and demographic factors that are either unpredictable or not predictable from price data alone. How the Prime Minister is going to handle Brexit is simply not something that can be divined from squiggly line of past house prices.
Sometimes projects like these are pitched by naive managers and CEOs who think AI is a magic dust you can sprinkle over a problem and make the impossible possible. More often it involves people who either know the prediction won’t work or don’t care enough to find out, their only concern being whether the technology will impress the customer.
5. If you don’t know where you’re going, any road will take you there
This is when the client has a vaguely data-sciencey task but adamantly refuses to specify the objective or acceptance criteria.
- We need you to calculate a score for every company.
- Ok. What do you what this score to measure or predict?
- Dunno. Like, how good they are?
- Good in what way? Good to work at? Good to invest in? A credit rating maybe?
- No, nothing mundane like that.
- Then what?
- You’re the data scientist, we were hoping you would tell us.
- …
- Be sure to include Twitter data!
It’s a normal part of a data scientist’s job to act as a psychoanalyst helping the client discover and articulate what they actually want. But sometimes there is just nothing there to discover because the whole project is just an empty marketing gimmick or an exercise in bureaucratic box-checking.
Conclusion
In 1985 sci-fi comedy movie Weird Science a pair of teenagers make a simulation of a perfect woman on their home computer. After they hook the computer to a plastic doll and hack into a government system, a power surge causes the magical dream woman to come to life.
Today even small children and the elderly are familiar enough with computers to know they don’t work like that. But replace the government system with the cloud, throw in some deep learning references and you’ve got yourself a plausible 2019 movie premise.
Bullshit data science happens because decision makers have the level of understanding of and attitude towards data science the 1980s audiences had for computers. They have unrealistic expectations, are easily bamboozled by it, don’t know how to use it and don’t trust it enough to use where it would make a real difference.
This will eventually change the same way it did with computers in general. The current generation of data scientists will start graduating into management roles, founding their own startups, eventually retiring - same as happened with the programmers from the 1980s.
Until then, we are going to have to fight the bullshit however we can. For data scientists themselves that entails paying more attention to the ‘why’ of what they’re doing, not just the ‘how’. And for the clients the first step would be to involve an experienced and business savvy data scientist from the get go, to help shape what needs to be done instead of just carrying out (potentially nonsensical) orders.