As more companies recognize the need for a data science platform, more vendors are claiming they have one. Increasingly, we see organizations describing their product as a “data science platform” without describing the features that make platforms so valuable.
A good data science platform should be able to …
- Find and understand past work, so that data scientists do not need to begin from scratch when asking new questions.
- Explore data on large machines, without dealing with development ops / infrastructure setup.
- Use new packages and tools, safely, i.e., without breaking past work or disrupting environments for other business users.
- Scale out compute resources to run many computationally intensive and complex experiments at once.
- Track your work (i.e., your experiments) so they are reproducible.
- Share work with peers and non-technical users (with other areas of expertise), to get feedback on evolving research and results
Data science work is only valuable insofar as it creates some impact on business outcomes. That means the work must be operationalized somehow, i.e., it must be integrated into business processes or decision making processes. This can be in the form of a predictive model exposed as an API, a web application for people to interact with, or a daily report that shows up in people’s inboxes.
In addition to helping researchers develop better models faster, platforms also bring a critical capability to teams and to managers. As companies invest more in quantitative research, they should build institutional knowledge and best practices to make the team even more effective over time.
A core value of a platform is its ability to centralize knowledge and research assets. That gives managers transparency into how people are working; it reduces key-man risk; it makes it easier to onboard people; it improves shared context, thus increasing creativity; and it accelerates the pace of research by making it possible to build upon past work rather than starting from scratch.