You are here

Feed aggregator

Better health insights by unlocking data

O'Reilly Radar - Thu, 2018/07/19 - 13:00

Mahdi Yusuf discusses new ways to unlock potential from the data you generate with smart health devices.

Continue reading Better health insights by unlocking data.

Categories: Technology

How to use goals to improve a product’s UX

O'Reilly Radar - Thu, 2018/07/19 - 07:15

Measuring each stage of the user journey allows you to measure conversions and improve user experience.

Continue reading How to use goals to improve a product’s UX.

Categories: Technology

Specialized hardware for deep learning will unleash innovation

O'Reilly Radar - Thu, 2018/07/19 - 04:05

The O’Reilly Data Show Podcast: Andrew Feldman on why deep learning is ushering a golden age for compute architecture.

In this episode of the Data Show, I spoke with Andrew Feldman, founder and CEO of Cerebras Systems, a startup in the blossoming area of specialized hardware for machine learning. Since the release of AlexNet in 2012, we have seen an explosion in activity in machine learning, particularly in deep learning. A lot of the work to date happened primarily on general purpose hardware (CPU, GPU). But now that we’re six years into the resurgence in interest in machine learning and AI, these new workloads have attracted technologists and entrepreneurs who are building specialized hardware for both model training and inference, in the data center or on edge devices.

Continue reading Specialized hardware for deep learning will unleash innovation.

Categories: Technology

Four Short Links: 19 July 2018

O'Reilly Radar - Thu, 2018/07/19 - 03:25

Microrobotics, Adaptive Chips, ACM Ethics, and Data Journalism

  1. DARPA's Insect-Scale Robot Olympics (IEEE) -- Yesterday, DARPA announced a new program called SHRIMP: SHort-Range Independent Microrobotic Platforms. The goal is “to develop and demonstrate multi-functional micro-to-milli robotic platforms for use in natural and critical disaster scenarios.”
  2. DARPA Changing How Electronics Are Made (IEEE) -- Step two, to be kicked off at the summit, is something we call “software-defined hardware.” That’s where the hardware is smart enough to reconfigure itself to be the type of hardware you want, based on an analysis of the data type that you’re working on. In that case, the very hard thing is to figure out how to do that data introspection, how to reconfigure the chip on a microsecond or millisecond timescale to be what you need it to be. And more importantly, it has to monitor whether you’re right or not, so that you can iterate and be constantly evolving toward the ideal solution.
  3. ACM Updates Ethics Code -- ACM revised their code of ethics to include references to emerging technology, discrimination, and data policy. They're also releasing case studies and an Ask An Ethicist advice column to help people understand how to apply the principles.
  4. Data Journalism Workshop Notes -- Harkanwal Singh gave a workshop on data journalism, which yielded these excellent notes via Liza Bolton.

Continue reading Four Short Links: 19 July 2018.

Categories: Technology

The whole is greater than the sum of its parts

O'Reilly Radar - Wed, 2018/07/18 - 13:00

Christopher Ferris says Hyperledger was formed to help deliver blockchain technology for the enterprise. Two and a half years later, that goal is being realized.

Continue reading The whole is greater than the sum of its parts .

Categories: Technology

Live coding: OSCON edition

O'Reilly Radar - Wed, 2018/07/18 - 13:00

Suz Hinton live codes an entertaining hardware solution in front of your eyes.

Continue reading Live coding: OSCON edition.

Categories: Technology

Open source and open standards in the age of cloud AI

O'Reilly Radar - Wed, 2018/07/18 - 13:00

Tim O'Reilly looks at how we can extend the values and practices of open source in the age of AI, big data, and cloud computing.

Continue reading Open source and open standards in the age of cloud AI.

Categories: Technology

Recognizing cultural bias in AI

O'Reilly Radar - Wed, 2018/07/18 - 13:00

Camille Eddy explains what we can do to create culturally sensitive computer intelligence and why that's important for the future of AI.

Continue reading Recognizing cultural bias in AI.

Categories: Technology

Highlights from the O'Reilly OSCON Conference in Portland 2018

O'Reilly Radar - Wed, 2018/07/18 - 13:00

Watch highlights covering open source, AI, cloud, and more. From the O'Reilly OSCON Conference in Portland 2018.

People from across the open source world came together in Portland, Oregon for the O'Reilly OSCON Conference. Below you'll find links to highlights from the event.

Open source and open standards in the age of cloud AI

Tim O'Reilly looks at how we can extend the values and practices of open source in the age of AI, big data, and cloud computing.

O’Reilly Radar: Open source tool trends—What our users tell us

Roger Magoulas shares insights about the open source tools ecosystem based on analysis of usage and search data from O'Reilly's learning platform.

Live coding: OSCON edition

Suz Hinton live codes an entertaining hardware solution in front of your eyes.

Drive innovation and collaboration through open source projects

Ying Xiong explains how Huawei collaborates with industry leaders and innovates through open source projects.

Recognizing cultural bias in AI

Camille Eddy explains what we can do to create culturally sensitive computer intelligence and why that's important for the future of AI.

The whole is greater than the sum of its parts

Christopher Ferris says Hyperledger was formed to help deliver blockchain technology for the enterprise. Two and a half years later, that goal is being realized.

The next big wave

Zaheda Bhorat explores the next wave of open source contributions.

Better health insights by unlocking data

Mahdi Yusuf discusses new ways to unlock potential from the data you generate with smart health devices.

Building with open source at the world’s largest home improvement retailer

Angie Brown explains how Home Depot uses open source in its stores, online search, order management, analytics, and more.

Open source opens doors for vets

Jerome Hardaway explains how Vets Who Code uses open source to create job opportunities for veterans.

20 years later, open source is as important as ever

Sarah Novotny outlines two reasons why open source continues to be important: choice and infrastructure.

Open sourcing quantum: Get ready to help build a new future

Jay Gambetta explores Qiskit, an open-source framework that aims to make quantum computing accessible for everyone.

The importance of community

Patricia Posey draws on her non-traditional journey into tech to illustrate how honest investments can build a sustainable community that is integral to the advancement of its members.

O'Reilly Open Source Awards 2018

The O’Reilly Open Source Awards recognize individual contributors who have demonstrated exceptional leadership, creativity, and collaboration in the development of open source software.

Continue reading Highlights from the O'Reilly OSCON Conference in Portland 2018.

Categories: Technology

Drive innovation and collaboration through open source projects

O'Reilly Radar - Wed, 2018/07/18 - 13:00

Ying Xiong explains how Huawei collaborates with industry leaders and innovates through open source projects.

Continue reading Drive innovation and collaboration through open source projects.

Categories: Technology

The next big wave

O'Reilly Radar - Wed, 2018/07/18 - 13:00

Zaheda Bhorat explores the next wave of open source contributions.

Continue reading The next big wave.

Categories: Technology

Data collection and data markets in the age of privacy and machine learning

O'Reilly Radar - Wed, 2018/07/18 - 04:00

While models and algorithms garner most of the media coverage, this is a great time to be thinking about building tools in data.

In this post I share slides and notes from a keynote I gave at the Strata Data Conference in London at the end of May. My goal was to remind the data community about the many interesting opportunities and challenges in data itself. Much of the focus of recent press coverage has been on algorithms and models, specifically the expanding utility of deep learning. Because large deep learning architectures are quite data hungry, the importance of data has grown even more. In this short talk, I describe some interesting trends in how data is valued, collected, and shared.

Economic value of data

It’s no secret that companies place a lot of value on data and the data pipelines that produce key features. In the early phases of adopting machine learning (ML), companies focus on making sure they have sufficient amount of labeled (training) data for the applications they want to tackle. They then investigate additional data sources that they can use to augment their existing data. In fact, among many practitioners, data remains more valuable than models (many talk openly about what models they use, but are reticent to discuss the features they feed into those models).

But if data is precious, how do we go about estimating its value? For those among us who build machine learning models, we can estimate the value of data by examining the cost of acquiring training data:

  • Many of us at some point in our data science work have augmented our existing data sets with external data sources that we paid for. Data providers like Bloomberg, Nielsen, Dun & Bradstreet, and more recent entrants like Planet Labs offer subscription services for a variety of data sets.
  • We also know exactly how much it costs to build training data sets from scratch. With the rise of data-hungry methods like deep learning, there is growing demand for services like Figure Eight and Mighty AI that help companies label images, video, and other data sources.
  • For specific data types (like images), there are new companies like Neuromation, DataGen, and AI.Reverie, that can help lower the cost of training data through tools for generating synthetic data.

Another way we can glean the value of data is to look at the valuation of startups that are known mainly for their data sets. I list a few examples from the media industry, but there are are numerous new startups that collect aerial imagery, weather data, in-game sports data, and logistics data, among other things. If you are an aspiring entrepreneur, note that you can build interesting and highly valued companies by focusing on data.

The reason data scientists and data engineers want more data is so they can measure its impact on their data products. This leads to another way one can estimate the value of data: by observing the incremental impact of new data sources on existing data products.

Relying on external data is not without risks. A security breach and other technical reasons might lead to your supply of data being cut off. A more likely reason is that because concerns about data collection and data privacy have gotten stronger, sharing and selling data has come under more scrutiny. Loss of access can also indirectly demonstrate how valuable data is:

  • the impact of a data mishap on a company’s market cap is observable
  • the impact of “loss of access” to a data source on your ML models is something you can measure
The state of data privacy: Views of key stakeholders

In recent months, we have had a change in expectations and attitudes around data privacy and data collection. Let’s examine the current situation from the perspective of some key stakeholders: users, regulators, companies, and data professionals.

User expectations for what happens to data they generate has changed. In light of recent headlines (Facebook and Cambridge Analytica), the general public is much more aware of data collection, storage, and sharing. Concerns about data privacy cut across countries, and contrary to popular perception, data privacy is a growing concern among Chinese users. The conversation has gone beyond data privacy; users are calling for:

  • better transparency--they want to know what data is being collected and with whom data is being shared
  • control over how their data is shared and used
  • limits on the nature and duration of data sharing

Regulators across many countries and localities are moving forward with landmark legislation: Europe (GDPR) and California (Consumer Privacy Act) have placed concepts like “user control” and “privacy-by-design” at the forefront for companies wanting to deploy analytic products. Australia recently added data breach notification to its existing data privacy rules.

Stance on data privacy and data monetization is becoming a competitive angle for some small and large firms. Apple in particular is raising the bar on data privacy and collection, but other companies are following suit (companies in China are also beginning to send similar signals to users). Moving forward, companies are going to have to adjust their services--not only in light of regulations, but also the changing expectations of users.

Data professionals are also taking data privacy very seriously. Many have implemented training programs within their companies, and some are already exploring new privacy-preserving tools and methods for building analytic products. Going beyond privacy, the next-generation data scientists and data engineers are undergoing training and engaging in discussions pertaining to ethics. Many Universities are offering courses, some like UC Berkeley have multiple courses.

Data liquidity in an age of privacy: New data exchanges

We are now operating during a period of heightened awareness about data privacy and user control. How do we continue to provide liquidity in an age when machine learning models require so much data?

Many organizations maintain data silos: separate systems that limit access between internal teams, and systems that store data they aren’t willing to share with external users. At the same time, companies usually have data they would be willing to share with others. The problem is that there aren’t convenient mechanisms or common formats for making sharing easy. Imagine if there are tools and mechanisms to make these data sets sharable (through a data commons). This would mean that individual data silos can now sit on a strong foundation of external data which all participating organizations can use to augment their ML models.

In a 2017 post, Roger Chen described the three major approaches companies have taken when building data exchanges (or “data networks”):

  • Open data: lack of market incentives make the open data model hard to scale, and they can be particularly prone to challenges with data heterogeneity.
  • Data cooperative: requires the most amount of trust and suffers from cold start challenges. With that said, there has been great progress building tools for data commons in the biomedical arena, and lessons learned in that domain should translate to other fields.
  • Data brokerage: participants gain financial reward for aggregating and selling data, which potentially leads to more scale and liquidity.

One trend that I’ve come across recently is decentralized data networks. The idea is to build data exchanges that are decentralized using technology based on blockchains and distributed ledgers, and an incentive structure that uses cryptocurrencies. Some startups in this space are specifically targeting machine learning--they aspire to enable data scientists to “train models without seeing the data.”

The most interesting initiative comes from a San Francisco startup called Computable Labs. They are building open source, decentralized infrastructure that will allow companies to securely share data and models. In the process, they want to “make blockchain networks compatible with machine learning computations.”

Closing thoughts

While models and algorithms garner most of the media coverage, this is a great time to be thinking about building tools in data. We are in an age when machine learning models demand huge amounts of data, and many companies have just begun deploying ML models. There are the core topics of security and privacy, but there are many other compelling and challenging problems and opportunities that touch on ethics, economic value, data liquidity, user control, and decentralization.

Related resources:

Continue reading Data collection and data markets in the age of privacy and machine learning.

Categories: Technology

Editing our world: CRISPR and the future of genomics

O'Reilly Radar - Wed, 2018/07/18 - 03:30

The basic technology behind gene editing and a conversation between Jennifer Doudna and Siddhartha Mukherjee.

The basic behind gene editing and a conversation between Jennifer Doudna and Siddhartha Mukherjee.

In the University of California, Berkeley’s antique-inspired Hertz Music Hall, sit two of contemporary science’s most prominent and of-the-moment figures. Jennifer Doudna, the scientist credited with some of the greatest advancements in CRISPR technology, is a Berkeley native, at least by virtue of occupation. She is the founder of the well-known Doudna Lab as well as the director of the Innovative Genomics Institute, a joint venture between UC Berkeley and UC San Francisco. Siddhartha Mukherjee is a physician, researcher, author, and Assistant Professor of Medicine at the University of Columbia’s Medical Center. Best known for his books, The Emperor of All Maladies: A Biography of Cancer and The Gene: An Intimate History, Mukherjee is well-versed in the intricacies of medical biology as well as the impact that advancements in genomics can have on modern medicine. In a conversation between these two influential thinkers, the prospects for CRISPR are discussed in relation to scientific research, medical treatments, and society at large.

Continue reading Editing our world: CRISPR and the future of genomics.

Categories: Technology

Four short links: 18 July 2018

O'Reilly Radar - Wed, 2018/07/18 - 03:15

Program Synthesis, Climate Change, Remote Teams, and Go Memory Management

  1. Program Synthesis in 2018 -- this is a readable and deeply informative guide to the state of the art in program synthesis (generating programs from specifications). I'm highly interested in this field, as it's a possible future of programming, and when advances are made in useful areas, it will be highly disruptive.
  2. Lights Out: Climate Change Risk to Internet Infrastructure -- We align the data formats and assess risks in terms of the amount and type of infrastructure that will be under water in different time intervals over the next 100 years. We find that 4,067 miles of fiber conduit will be under water and 1,101 nodes (e.g., points of presence and colocation centers) will be surrounded by water in the next 15 years. We further quantify the risks of sea level rise by defining a metric that considers the combination of geographic scope and internet infrastructure density. We use this metric to examine different regions and find that the New York, Miami, and Seattle metropolitan areas are at highest risk.
  3. Managing Your Remote Developer Team if You're Non-Technical -- I feel like this also applies to technical remote managers, too.
  4. Getting to Go: Memory Management and Garbage Collection -- The Go language features, goals, and use cases have forced us to rethink the entire garbage collection stack and have led us to a surprising place. The journey has been exhilarating. This talk describes our journey. Detailed and for a technical audience.

Continue reading Four short links: 18 July 2018.

Categories: Technology

Of oaths and checklists

O'Reilly Radar - Tue, 2018/07/17 - 04:00

Oaths have their value, but checklists will help put principles into practice.

"Oaths? We don't need no stinkin' oaths" (with apologies to Humphrey Bogart in “Treasure of the Sierra Madre”).

Over the past year, there has been a great discussion of data ethics, motivated in part by discomfort over “fake news,” targeted advertising, algorithmic bias, and the effect that data products have on individuals and on society. Concern about data ethics is hardly new; the ACM, IEEE, and the American Statistical Association all have ethical codes that address data. But the intensity with which we’ve discussed ethics shows that something significant is happening: data science is coming of age and realizing its responsibilities. A better world won’t come about simply because we use data; data has its dark underside.

The recent discussion frequently veers into a discussion of data oaths, looking back to the ancient Hippocratic Oath for doctors. Much as we appreciate the work and the thought that goes into oaths, we are skeptical about their value. Oaths have several problems:

  • They're one-shots. You take the oath once (if at all), and that's it. There's no reason to keep it in the front of your consciousness. You don’t recite it each morning. Or evaluate regularly whether you’re living up to the ideals.
  • Oaths are a set of very general and broad principles. Discussions of the Hippocratic Oath begin with the phrase "First, do no harm," words that don’t actually appear in the oath. But what does “do no harm” mean? For centuries doctors did very little but harm (many people died because doctors didn’t believe they needed to wash their hands). The doctors just didn't know they were doing harm. Nice idea, but short on the execution. And data science (like medicine) is all about execution.
  • Oaths can actually give cover to people and organizations who are doing unethical work. It’s easy to think “we can’t be unethical, because we endorsed this oath.” It’s not enough to say “don’t be evil.” You have to not be evil.
  • Oaths do very little to connect theories and principles to practice. It is one thing to say "researchers must obtain informed consent"; it's an entirely different thing to get informed consent at internet scale. Or to teach users what "informed consent" means.

We are not suggesting that the principles embodied in oaths aren't important, just that they don't get us to the endpoint we want. They don't connect our ideas about what's good or just to the practices that create goodness and justice. We can talk a lot about the importance of being fair and unbiased without knowing about how to be fair and unbiased. At this point, the oath actually becomes dangerous: it becomes a tool to convince yourself that you're one of the good guys, that you're doing the right thing, when you really don't know.

Oaths are good at creating discussion—and, in the past year, they have created quite a lot of discussion. The discussion has been tremendously helpful in making people aware of issues like algorithmic fairness. The discussion has helped software developers and data scientists to understand that their work isn’t value-neutral, that their work has real impact, both good and bad, on real people. And there has been a vigorous debate about what self-government means for data scientists, and what guiding principles would last longer than a few years. But we need to take the next step, and connect these ideas to practice. How will we do that?

In 2009, Atul Gawande wrote The Checklist Manifesto, a short book on how not to make big mistakes. He writes a lot about his practice as a surgeon. In a hospital, everyone knows what to do. Everyone knows that you're supposed to scrub down before the surgery. Everyone knows that you're not supposed to amputate the wrong leg. Everyone knows that you're not supposed to leave sponges and other equipment in patients when you close the incision.

But mistakes are made, particularly when people are in stressful environments. The surgeon operates on the wrong leg; the sponge is left behind; and so on. Gawande found that, simply by creating checklists for basic things you shouldn't forget, these mistakes could be eliminated almost completely. Yes, there were some doctors who found the idea of checklists insultingly simple; they were the ones who continued making mistakes.

Unlike oaths, checklists connect principle to practice. Everyone knows to scrub down before the operation. That's the principle. But if you have to check a box on a form after you've done it, you're not likely to forget. That's the practice. And checklists aren't one-shots. A checklist isn’t something you read once at some initiation ceremony; a checklist is something you work through with every procedure.

What would a checklist for data science and machine learning look like? The UK Government’s Data Ethics Framework and Data Ethics Workbook is one approach. They isolate 7 principles, and link to detailed discussions of each. The workbook asks a number of open-ended questions to probe your compliance with these principles. Our criticism is that their process imposes a lot of overhead. While anyone going through their entire process will certainly have thought carefully about ethical issues, in practice, asking developers to fill out a workbook with substantive answers to 46 questions is an effective way to ensure that ethical thought doesn’t happen.

We believe that checklists are built around simple, “have we done this?” questions—and they are effective because they are simple. They don’t leave much room to wiggle. Either you’ve analyzed how a project can be abused, or you haven’t. You’ve built a mechanism for gathering consent, or you haven’t. Granted, it’s still possible to take shortcuts: your analysis might be inadequate and your consent mechanism might be flawed, but you’ve at least gone on record for saying that you’ve done it.

Here's a checklist for people who are working on data projects:

❏ Have we listed how this technology can be attacked or abused?

❏ Have we tested our training data to ensure it is fair and representative?

❏ Have we studied and understood possible sources of bias in our data?

❏ Does our team reflect diversity of opinions, backgrounds, and kinds of thought?

❏ What kind of user consent do we need to collect to use the data?

❏ Do we have a mechanism for gathering consent from users?

❏ Have we explained clearly what users are consenting to?

❏ Do we have a mechanism for redress if people are harmed by the results?

❏ Can we shut down this software in production if it is behaving badly?

❏ Have we tested for fairness with respect to different user groups?

❏ Have we tested for disparate error rates among different user groups?

❏ Do we test and monitor for model drift to ensure our software remains fair over time?

❏ Do we have a plan to protect and secure user data?

This checklist isn't without its problems, but it's a start; feel free to use it and modify in your projects. It covers most of the bases that we’ve seen discussed in various data oaths. Go over the checklist when starting a product so the developers know what’s needed and aren’t surprised by a new set of requirements at the last minute. Then work through it whenever you release software. Go through it, and actually check off all the boxes before your project hits the public.

Oaths and codes of conduct have their value. The value of an oath isn't the pledge itself, but the process you go through in developing the oath. People who work with data are now having discussions that would never have taken place a decade ago. But discussions don’t get the hard work done, and we need to get down to the hard work. We don't want to talk about how to use data ethically; we want to use data ethically. It's hypocritical to talk about ethics, but never do anything about it. We want to put our principles into practice. And that's what checklists will help us do.

Continue reading Of oaths and checklists.

Categories: Technology

MLflow: A platform for managing the machine learning lifecycle

O'Reilly Radar - Tue, 2018/07/17 - 04:00

An overview of the challenges MLflow tackles and a primer on how to get started.

Although machine learning (ML) can produce fantastic results, using it in practice is complex. Beyond the usual challenges in software development, machine learning developers face new challenges, including experiment management (tracking which parameters, code, and data went into a result); reproducibility (running the same code and environment later); model deployment into production; and governance (auditing models and data used throughout an organization). These workflow challenges around the ML lifecycle are often the top obstacle to using ML in production and scaling it up within an organization.

To address these challenges, many companies are starting to build internal ML platforms that can automate some of these steps. In a typical ML platform, a dedicated engineering team builds a suite of algorithms and management tools that data scientists can invoke. For example, Uber and Facebook have built Michelangelo and FBLearner Flow to manage data preparation, model training, and deployment. However, even these internal platforms are limited: typical ML platforms only support a small set of algorithms or libraries with limited customization (whatever the engineering team builds), and are tied to each company’s infrastructure.

At Spark+AI Summit 2018, my team at Databricks introduced MLflow, a new open source project to build an open ML platform. Beyond being open source, MLflow is also “open” in the sense that anyone in the organization—or in the open source community—can add new functionality, such as a training algorithm or a deployment tool, that automatically works with the rest of MLflow. MLflow offers a powerful way to simplify and scale up ML development throughout an organization by making it easy to track, reproduce, manage, and deploy models. In this post, I’ll give a short overview of the challenges MLflow tackles and a primer on how to get started.

Machine learning workflow challenges

At Databricks, we work with hundreds of companies using ML in production. Across these companies, we have repeatedly heard the same set of concerns around ML:

  • There are a myriad of disjointed tools. Hundreds of software tools cover each phase of the ML lifecycle, from data preparation to model training. Moreover, unlike traditional software development, where teams select one tool for each phase, in ML you usually want to try every available tool (e.g., algorithm) to see whether it improves results. ML developers thus need to use and productionize dozens of libraries.
  • It’s hard to track experiments. Machine learning algorithms have dozens of configurable parameters, and whether you work alone or on a team, it is difficult to track which parameters, code, and data went into each experiment to produce a model.
  • It’s hard to reproduce results. Without detailed tracking, teams often have trouble getting the same code to work again. Whether you are a data scientist passing your training code to an engineer for use in production, or you are going back to your past work to debug a problem, reproducing steps of the ML workflow is critical. We’ve heard multiple horror stories where the production version of a model did not behave like the training one, or where one team couldn’t reproduce another team’s result.
  • It’s hard to deploy ML. Moving a model to production can be challenging due to the plethora of deployment tools and environments it needs to run in (e.g., REST serving, batch inference, or mobile apps). There is no standard way to move models from any library to any of these tools, creating a new risk with each new deployment.
MLflow: An open machine learning platform

MLflow is designed to tackle these workflow challenges through a set of APIs and tools that you can use with any existing ML library and codebase. In the current alpha release, MLflow offers three main components:

  • MLflow Tracking: an API and UI for recording data about experiments, including parameters, code versions, evaluation metrics, and output files used.
  • MLflow Projects: a code packaging format for reproducible runs. By packaging your code in an MLflow Project, you can specify its dependencies and enable any other user to run it again later and reliably reproduce results.
  • MLflow Models: a simple model packaging format that lets you deploy models to many tools. For example, if you can wrap your model as a Python function, MLflow Models can deploy it to Docker or Azure ML for serving, Apache Spark for batch scoring, and more.
Figure 1. MLflow components. Image courtesy of Matei Zaharia.

MLflow is designed to be modular, so you can use each of these components on their own in your existing ML process or combine them. Let’s dive into each of these components in turn to see how to use them and how they simplify ML development.

Getting started with MLflow

MLflow is open source and easy to install using pip install mlflow. To get started with MLflow, follow the instructions in the MLflow documentation or view the code on GitHub.

MLflow Tracking

MLflow Tracking is an API and UI for logging parameters, code versions, metrics, and output files when running your ML code to later visualize them. With a few simple lines of code, you can track parameters, metrics, and “artifacts” (arbitrary output files you want to store):

import mlflow # Log parameters (key-value pairs) mlflow.log_param("num_dimensions", 8) mlflow.log_param("regularization", 0.1) # Log a metric; metrics can also be updated throughout the run mlflow.log_metric("accuracy", model.accuracy) # Log artifacts (output files) mlflow.log_artifact("roc.png") mlflow.log_artifact("model.pkl")

You can use MLflow Tracking in any environment where you can run code (for example, a standalone script or a notebook) to log results to local files or to a server, then compare multiple runs. Using the web UI, you can view and compare the output of multiple runs:

Figure 2. MLflow Tracking UI. Image courtesy of Matei Zaharia. MLflow Projects

Tracking results is useful, but you often need to reproduce them as well. MLflow Projects provide a standard format for packaging reusable data science code. Each project is simply a directory with code or a Git repository, and uses a descriptor file to specify its dependencies and how to run the code. An MLflow Project is defined by a simple YAML file called MLproject.

name: My Project conda_env: conda.yaml entry_points: main: parameters: data_file: path regularization: {type: float, default: 0.1} command: "python train.py -r {regularization} {data_file}" validate: parameters: data_file: path command: "python validate.py {data_file}"

Projects can specify their dependencies through a Conda environment. A project may also have multiple entry points for invoking runs, with named parameters. You can run projects using the mlflow run command-line tool, either from local files or from a Git repository:

mlflow run example/project -P alpha=0.5 mlflow run git@github.com:databricks/mlflow-example.git -P alpha=0.5

MLflow will automatically set up the right environment for the project and run it. In addition, if you use the MLflow Tracking API in a Project, MLflow will remember the project version executed (that is, the Git commit) and any parameters. You can then easily rerun the exact same code. The project format thus makes it easy to share reproducible data science code, whether within your company or in the open source community.

MLflow Models

MLflow’s third component is MLflow Models, a simple but powerful way to package models. While many model storage formats (such as ONNX and PMML) already exist, MLflow Models’ goal is different: the goal is to represent how the model should be invoked, so that many different types of downstream deployment tools can use it. To do this, MLflow Models can store a model in multiple formats called “flavors.” These flavors can be library-specific (such as a TensorFlow graph) but can also be very generic flavors like “Python function,” which any deployment tool that understands Python can use.

Each MLflow Model is simply saved as a directory containing arbitrary files and an MLmodel YAML file that lists the flavors it can be used in. Here is a sample model exported from SciKit-Learn:

time_created: 2018-02-21T13:21:34.12 flavors: sklearn: sklearn_version: 0.19.1 pickled_model: model.pkl python_function: loader_module: mlflow.sklearn pickled_model: model.pkl

MLflow provides tools to deploy many common model types to diverse platforms. For example, any model supporting the python_function flavor can be deployed to a Docker-based REST server, to cloud serving platforms such as Azure ML and AWS SageMaker, and as a user-defined function in Spark SQL for batch and streaming inference. If you output MLflow Models as artifacts using the MLflow Tracking API, MLflow will also automatically remember which Project and run they came from so you can reproduce them later.

Putting these tools together

While the individual components of MLflow are simple, you can combine them in powerful ways whether you work on ML alone or in a large team. For example, you can use MLflow to:

  • Record and visualize code, data, parameters, and metrics as you develop a model on your laptop.
  • Package code as MLflow Projects to run them at scale in a cloud environment for hyperparameter search.
  • Build a leaderboard to compare performance of different models for the same task inside your team.
  • Share algorithms, featurization steps, and models as MLflow Projects or Models that other users in the organization can combine into a workflow.
  • Deploy the same model to batch and real-time scoring without rewriting it for two tools.
What’s next?

We are just getting started with MLflow, so there is a lot more to come. Apart from updates to the project, we plan to introduce major new components (such as monitoring), library integrations, and language bindings. Just a few weeks ago, for example, we released MLflow 0.2 with built-in TensorFlow support and several other new features.

We’re excited to see what you can do with MLflow, and we would love to hear your feedback.

Related resources:

Continue reading MLflow: A platform for managing the machine learning lifecycle.

Categories: Technology

How to choose and deploy a service mesh

O'Reilly Radar - Tue, 2018/07/17 - 03:30

O’Reilly Media Podcast: George Miranda discusses the benefits and challenges of a service mesh, and the best ways to get started using one.

In a recent episode of the O’Reilly Media Podcast, we spoke with George Miranda about the importance of service mesh technology in creating reliable distributed systems. As discussed in the new report The Service Mesh: Resilient Service-to-Service Communication for Cloud Applications, service mesh technology has emerged as a popular tool for companies looking to build cloud-native applications that are reliable and secure.

During the podcast, we discussed the problems a service mesh infrastructure solves and the service mesh features you’ll find most valuable. We also talked about how to choose the right service mesh for your organization, the challenges involved in getting it deployed to production, and the best ways for getting started with a service mesh.

Here are some highlights from the conversation:

The rise of containers

The rise of containers has made it easy to adopt patterns that were once relegated only to the ultra-large web-scale giants of the world. There are all sorts of benefits to that modular cloud-native-driven approach. I'm not going to rehash those here, but what I see is that, oftentimes, organizations adopt these new patterns without a whole lot of forethought to how it's going to impact their applications and production. We see the benefits, but we don't always realize some of the rub that's behind them.

The fallacies of distributed computing

It turns out that your network is unreliable, it turns out that latency is not zero, transport cost is not zero, bandwidth is finite, and so on. All of those limitations need to be accounted for. But what we're seeing now is that in applications that are shifting to this microservice world, typically, those applications have never had to account for this kind of distributed nature in the past. A service mesh gives you ways to easily solve those problems without needing to change your applications.

Managing, monitoring, and controlling distributed apps

The tunables that are exposed in the service mesh’s control plane give you control that you've never had before. Things like performance-based load balancing—not just things like round robin, but load balancing schemes based on performance metrics observed in session layer—timeouts, and retries. Retries can fall into lengthy retry loops consuming resources, creating bottlenecks, and causing secondary failures. You also get a lot of constructs to help mitigate cascading failures, create custom routing rules, set up mutual TLS, provide rich service metrics and so on. At a high level, those are some of the basic components that you can expect in any tool calling itself a “service mesh.”

Choosing and deploying a service mesh

It's important to choose a service mesh tool that builds around an ability to be introspective. Runtime diagnostics are a really big deal in production. You have to be able to see what's happening and determine what's happening in the service mesh layer and what's actually happening in the application layer. Otherwise, there's going to be a lot of blame and sadness and tears and confusion when things go wrong. That just jeopardizes the entire production push. From the technical perspective, you should be able to select tools with really well-understood failure modes, and with great constructs around observability.

The biggest hurdle to deploying a service mesh

How do you get buy-in across your organization to support this layer? A lot of that is just a process of understanding the needs of your stakeholders and aligning with their values. William Morgan, our CEO, likes to say, "Any sufficiently advanced engineering work is indistinguishable from sales." I think that's true. You have to sell it internally. You have to educate people on the value, understand what their needs are and have a fit. And, above all, you have to be crystal clear—what real business problem is this new tool solving? Because without that kind of clarity, you're going to have a really hard time deploying to production. Because, again, this new tool will inevitably experience some kind of failure, and if you don't understand that failure well, and if you don't understand why it's strategic to your business to tolerate that kind of failure while you figure it out, that is going to be the next biggest hurdle and the biggest challenge to getting this deployed—and staying deployed—in your production environment.

This post is part of a collaboration between O'Reilly and Buoyant. See our statement of editorial independence.

Continue reading How to choose and deploy a service mesh.

Categories: Technology

Four short links: 17 July 2018

O'Reilly Radar - Tue, 2018/07/17 - 03:15

Sizing Teams, Publishing Incentives, Serverless Experience, and Configuration Languages

  1. Sizing Engineering Teams -- Teams should be six to eight during steady state. To create a new team, grow an existing team to eight to 10, and then bud into two teams of four or five. Never create empty teams. Never leave managers supporting more than eight folks.
  2. Cockygate -- in which a somewhat amusing lawsuit between Kindle Unlimited authors is pulled apart, and the incentives that drove weird behaviour are laid bare.
  3. AWS Kinesis with Lambdas: Lessons Learned -- These are our learnings from building a fully reactive serverless pipeline on AWS. See also the Hacker News comments with some other thoughtful heavy users sharing their cautionary tales.
  4. Dhall -- A configuration language guaranteed to terminate—useful for specifying Kubernetes, etc., configurations.

Continue reading Four short links: 17 July 2018.

Categories: Technology

Four short links: 16 July 2018

O'Reilly Radar - Mon, 2018/07/16 - 04:35

Automate Everything, TDD Retention, Automating Programming, and System Design

  1. Unfollowing Everybody (Anil Dash) -- Anil has a good way of dealing with overload, but that's not the only reason I list it. Note how his method requires automation. A system that can't be automated is a prison.
  2. A Longitudinal Cohort Study on the Retainment of Test-Driven Development -- The use of TDD has a statistically significant effect neither on the external quality of software products nor on the developers’ productivity. However, we observed that participants using TDD produced significantly more tests than those applying a non-TDD development process, and that the retainment of TDD is particularly noticeable in the number of tests written.
  3. What ML Means for Software Development (Lorica, Loukides) -- a subject dear to my heart. I can't wait for software development to be improved. Good software developers have always sought to automate tedious, repetitive tasks; that’s what computers are for. It should be no surprise that software development itself will increasingly be automated.
  4. Learn How to Design Large-Scale Systems -- This repo is an organized collection of resources to help you learn how to build systems at scale. It even has Anki flashcards to help you prep for the exam.

Continue reading Four short links: 16 July 2018.

Categories: Technology

Data engineering: A quick and simple definition

O'Reilly Radar - Mon, 2018/07/16 - 03:00

Get a basic overview of data engineering and then go deeper with recommended resources.

As the the data space has matured, data engineering has emerged as a separate and related role that works in concert with data scientists.

Ian Buss, principal solutions architect at Cloudera, notes that data scientists focus on finding new insights from a data set, while data engineers are concerned with the production readiness of that data and all that comes with it: formats, scaling, resilience, security, and more.

Continue reading Data engineering: A quick and simple definition.

Categories: Technology

Pages

Subscribe to LuftHans aggregator