Real Data Scientists Ship

Oct 16, 2018


The field of data science seems to be going through an identity crisis. The past 10 years or so have created something of a flash point for the field: while the foundations of the most popular open-source platforms had existed for decades, the proliferation of cheap, portable computation and easy-to-use wrappers for the base software created a critical mass of adoption and poof! The analytics Big Bang.

However, many companies that rode the hype train to the tune of millions of dollars’ worth of investment are struggling to see the payoff. McKinsey recently estimated that only 8% of data science endeavors demonstrate ROI.[1] 8%!   I don’t want to get too technical about it, but as a data scientist, I would classify that proportion as “not good.” Unsurprisingly, many companies are reacting accordingly by discontinuing their internal analytics practices. By many indications, we are approaching the other side of the wave.

Much of the disillusionment around data science springs from the lack of specificity of the term itself. While the recent explosion in interest created a bumper crop of “data science” tags in LinkedIn, there is no clear definition of what a “data scientist” actually is. There is no board to grant certification, no standard exams to pass. The stark fact is that there are fewer hoops to jump through to put “data scientist” on your business card than there are to put “plumber.”

As the marketplace has shifted, we in the data science community have struggled to justify our existence. However, we seem to be trying to do so by focusing on the technical aspects of our role rather than its practical applications. This is apparent when one looks at the proportion of online data science publications and conference workshops focused on unveiling a shiny new algorithm or walking through a sexy application of neural nets on some web-scraped data.

On one hand, I understand the impulse to go that route. The quickest path to differentiating ourselves from lowly “analysts” or simple practitioners of “business intelligence” is to beef up on the science-y part. But on the other, by reducing the definition of “data science” to its technicalities, we may end up burning down our house to keep ourselves warm. We might be able spew out an impressive enough cloud of jargon to make people smile and nod in a status report, but we won’t be able to demonstrate enough value to keep the lights on long-term.

The uncomfortable truth is that predictive modeling and machine learning are, far and away, the least important aspects of data science. In fact, I would argue that the vast majority of high-return data science initiatives you could draft up today would be solved by nothing more complex than a logistic regression. However, if you only listened to data scientists talking to each other, you’d think that recurrent neural network/convolutional neural nets/gradient boosted machines/[insert other nerdy-sounding algorithm here] were the key to success. That isn’t to say that R&D isn’t important. Data science would have never reached the prominence it currently has if the brave pioneers of yesteryear hadn’t kept pushing the envelope. However, if the primary social currency of data scientists becomes myopically focused on technical sophistication, our contributions to the business will only become harder to illustrate.

Steve Jobs once famously said, “real artists ship,” meaning that the mark of a true artist is to put their art in other people’s hands. In the same vein, allow me to propose a rough algorithm to benchmark data science effectiveness, which I’m calling the Data Science Impact Factor. It goes something like this:

A: every 10% increase in predictive accuracy beyond a simple average or mode
B: business buy-in for your project. For every additional level in the org structure that is actively invested in the outcomes of your project, +1
Your boss: B=1
Boss’ boss: B=2
Boss’ boss’ boss: B=3
Y: estimated number of average daily working hours that would be directly affected by the outcomes of your work, summed over all individuals that would use it
D: Number of days you estimate it would take deploy your work in a production environment. If you don’t know, D = 1042.

Now, I’ll admit I’m being somewhat facetious here—the weights are arbitrary and the metrics are a bit haphazard, but the point of the DSIF is to illustrate that data science is a broad landscape, of which predictive models are only a tiny part, while organizational buy-in, deployment considerations, and impact on actual people all play tremendously important roles. If we continue to ignore the other elements of that equation, we will continue to make ourselves irrelevant—a shrinking pocket of inscrutable eggheads, hiding in the dusty corners the business, making our nets ever more neural. But hey, at least we won’t be “analysts!”

Footnotes   [ + ]