You are here


Four short links: 24 December 2018

O'Reilly Radar - Mon, 2018/12/24 - 04:50

Learning Prolog, Data Race, Animating Photos, and Easy Flashing

  1. Solving Murder with Prolog -- if THIS was the motivating example for Prolog, I'd have taken to it a lot sooner! I love those logic puzzle books.
  2. The Machine Learning Race is Really a Data Race (MIT Sloan Review) -- Organizations that hope to make AI a differentiator need to draw from alternative data sets—ones they may have to create themselves.
  3. Photo Wakeup: 3-D Character Animation from a Single Photo -- this is incredible work. Watch the video if nothing else.
  4. Etcher -- Flash OS images to SD cards and USB drives, safely and easily. Open source.

Continue reading Four short links: 24 December 2018.

Categories: Technology

Four short links: 21 December 2018

O'Reilly Radar - Fri, 2018/12/21 - 06:45

Tech in China, Wisdom of Small Groups, iOS VPN, and Gameboy Supercomputer

  1. MIT TR: The China Issue -- from AI to landscaping, it's the state of big tech in China.
  2. Aggregated Knowledge From a Small Number of Debates Outperforms the Wisdom of Large Crowds -- what it says on the box. This is why I like the World Cafe Method of facilitating discussions.
  3. Wireguard for iOS -- a port of Wireguard VPN to the Apple mobile ecosystem.
  4. A Gameboy Supercomputer -- At a total of slightly over one billion frames per second, it is arguably the fastest 8-bit game console cluster in the world.

Continue reading Four short links: 21 December 2018.

Categories: Technology

Trends in data, machine learning, and AI

O'Reilly Radar - Thu, 2018/12/20 - 06:00

The O’Reilly Data Show Podcast: Ben Lorica looks ahead at what we can expect in 2019 in the big data landscape.

For the end-of-year holiday episode of the Data Show, I turned the tables on Data Show host Ben Lorica to talk about trends in big data, machine learning, and AI, and what to look for in 2019. Lorica also showcased some highlights from our upcoming Strata Data and Artificial Intelligence conferences.

Continue reading Trends in data, machine learning, and AI.

Categories: Technology

Four short links: 20 December 2018

O'Reilly Radar - Thu, 2018/12/20 - 05:10

Misinformation Research, AI UI, Facebook's Value, and Python Governance

  1. Common-Knowledge Attacks on Democracy -- We argue that scaling up computer security arguments to the level of the state, so that the entire polity is treated as an information system with associated attack surfaces and threat models, provides the best immediate way to understand these attacks and how to mitigate them. We demonstrate systematic differences between how autocracies and democracies work as information systems, because they rely on different mixes of common and contested political knowledge. Released 17 November; Bruce Schneier is co-author.
  2. Can Users Control and Understand a UI Driven by Machine Learning? -- In this article, we examine some of the challenges users encounter when interacting with machine learning algorithms on Facebook, Instagram, Google News, Netflix, and Uber Driver.
  3. Estimating the Value of Facebook by Paying Users to Stop Using It -- across all three samples, the mean bid to deactivate Facebook for a year exceeded $1,000.
  4. Python Gets a New Governance Model -- The council is imbued with "broad authority to make decisions about the project," but the goal is that it uses that authority rarely; it is meant to delegate its authority broadly. The PEP says the council should seek consensus, rather than dictate, and that it should define a standard PEP decision-making process that will (hopefully) rarely need council votes to resolve. It is, however, the "court of final appeal" for decisions affecting the language. But the council cannot change the governance PEP; that can only happen via a two-thirds vote of the core team. Python gets a constitution (aka PEP 8016).

Continue reading Four short links: 20 December 2018.

Categories: Technology

What is neural architecture search?

O'Reilly Radar - Thu, 2018/12/20 - 05:00

An overview of NAS and a discussion on how it compares to hyperparameter optimization.

Deep learning offers the promise of bypassing the process of manual feature engineering by learning representations in conjunction with statistical models in an end-to-end fashion. However, neural network architectures themselves are typically designed by experts in a painstaking, ad hoc fashion. Neural architecture search (NAS) has been touted as the path forward for alleviating this pain by automatically identifying architectures that are superior to hand-designed ones.

But with the field moving so fast both in terms of research progress and hype, it can be hard to get answers to basic questions: What exactly is NAS and is it fundamentally different from AutoML or hyperparameter optimization? Do specialized NAS methods actually work? Aren't they prohibitively expensive to use? Should I be using specialized NAS methods? In this post, we'll answer each of these questions. Our discussion touches upon a few key points:

  • There is a false dichotomy between NAS and traditional hyperparameter optimization; in fact, NAS is a subset of hyperparameter optimization. Moreover, specialized NAS methods are not actually fully automated, as they rely on human-designed architectures as starting points.
  • While exploring and tuning different neural network architectures is of crucial importance in developing high-quality deep learning applications, in our view specialized NAS methods are not ready for primetime just yet: they introduce significant algorithmic and computational complexities compared to high-quality hyperparameter optimization algorithms (e.g., ASHA) without demonstrating improved performance on standard benchmarking tasks.
  • Specialized NAS methods have nonetheless exhibited remarkable advances in the past few years in terms of improved accuracy, reduced computational costs, and architecture size, and could eventually surpass human performance on neural architecture design.

To set the stage, let's first discuss how NAS fits within the wider umbrella of AutoML (automated machine learning).

AutoML ⊃ hyperparameter optimization ⊃ NAS Figure 1. Image courtesy of Determined AI.

AutoML focuses on automating every aspect of the machine learning (ML) workflow to increase efficiency and democratize machine learning so that non-experts can apply machine learning to their problems with ease. While AutoML encompasses the automation of a wide range of problems associated with ETL (extract, transform, load), model training, and model deployment, the problem of hyperparameter optimization is a core focus of AutoML. This problem involves configuring the internal settings that govern the behavior of an ML model/algorithm in order to return a high-quality predictive model.

For example, ridge regression models require setting the value of a regularization term, random forest models require the user to set the maximum tree depth and minimum number of samples per leaf, and training any model with stochastic gradient descent requires setting an appropriate step size. Neural networks also require setting a multitude of hyperparameters, including (1) selecting an optimization method along with its associated set of hyperparameters; (2) setting the dropout rate and other regularization hyperparameters; and, if desired, (3) tuning parameters that control the architecture of the network (e.g., number of hidden layers, number of convolutional filters).

Although the exposition on NAS might suggest it is a completely new problem, our final example above hints at a close relationship between hyperparameter optimization and NAS. While the search spaces used for NAS are generally larger and control different aspects of the neural network architecture, the underlying problem is the same as that addressed by hyperparameter optimization: find a configuration within the search space that performs well on the target task. Hence, we view NAS to be a subproblem within hyperparameter optimization.

NAS is nonetheless an exciting direction to study, as focusing on a specialized subproblem provides the opportunity to exploit additional structure to design custom tailored solutions, as is done by many specialized NAS approaches. In the next section, we will provide an overview of NAS and delve more into the similarities and differences between hyperparameter optimization and NAS.

NAS overview Figure 2. Image courtesy of Determined AI.

Interest in NAS ballooned after the work of Zoph, et. al., 2016 used reinforcement learning to design, at the time, state-of-the-art architectures for image recognition and language modeling. However, Zoph, et. al., 2016, in addition to other first generation specialized approaches for NAS, required a tremendous amount of computational power (e.g., hundreds of GPUs running for thousands (!) of GPU days in aggregate), making them impractical for all but the likes of companies like Google. More recent approaches exploit various methods of reuse to drastically reduce the computational cost, and new methods are being rapidly introduced in the research community.

We'll next dive a bit deeper into the core design decisions associated with all of these specialized NAS methods (for a detailed overview of NAS, we recommend the excellent survey by Elsken, et al., 2017). The three main components are:

  1. Search space. This component describes the set of possible neural network architectures to consider. These search spaces are designed specific to the application—e.g., a space of convolutional networks for computer vision tasks or a space of recurrent networks for language modeling tasks. Hence, NAS methods are not fully automated, as the design of these search spaces fundamentally relies on human-designed architectures as starting points. Even so, there are still many architectural decisions remaining. In fact, the number of possible architectures considered in these search spaces are often over 10^10.
  2. Optimization method. This component determines how to explore the search space in order to find a good architecture. The most basic approach here is random search, while various adaptive methods have also been introduced—e.g., reinforcement learning, evolutionary search, gradient-based optimization, and Bayesian optimization. While these adaptive approaches differ in how they determine which architectures to evaluate, they all attempt to bias the search toward architectures that are more likely to perform well. Unsurprisingly, all of these methods have counterparts that have been introduced in the context of traditional hyperparameter optimization tasks.
  3. Evaluation method. This component measures the quality of each architecture considered by the optimization method. The simplest, but most computationally expensive choice is to fully train an architecture. One can alternatively exploit partial training, similar in spirit to early-stopping methods commonly used in hyperparameter optimization like ASHA. NAS-specific evaluation methods—such as network morphism, weight-sharing, and hypernetworks—have also been introduced to exploit the structure of neural networks to provide cheaper, heuristic estimates of quality. Partial training approaches are typically an order-of-magnitude cheaper than full training, while NAS-specific evaluation methods are two to three orders of magnitude cheaper than full training.

Notably, these are the same three requisite ingredients for traditional hyperparameter optimization methods. The research community has converged on a few canonical benchmarking data sets and tasks to evaluate the performance of different search methods, and we'll next use these benchmarks to report results on head-to-head comparisons between (1) human-designed architectures tuned via hyperparameter optimization methods, and (2) NAS-designed architectures identified via leading specialized NAS methods. (NAS focuses on the problem of identifying architectures, but nonetheless requires a secondary hyperparameter optimization step to tune the non-architecture-specific hyperparameters of the architecture it identifies. Our results show the test error after performing both steps.)

NAS models vs. human-designed models

The two most common tasks used to benchmark NAS methods are (1) designing convolutional neural network (CNN) architectures evaluated on the CIFAR-10 data set, and (2) designing recurrent neural network (RNN) architectures evaluated on the PennTree Bank (PTB) data set. We show the test error for different architectures on CIFAR-10 in the table below.

Source Number of Parameters (Millions) Test Error Search Method Evaluation Method PyramidNet + ShakeDrop Yamada et al., 2018 26 2.31 Human designed - NASNet-A + cutout Zoph et al., 2017 3.3 2.65 Reinforcement Learning Full Train AmoebaNet-B + cutout Real et al., 2018 34.9 2.13 Evolutionary Full Train NAONET Luo et al., 2018 28.6 2.98 Gradient Partial Train DARTS + cutout H. Liu et al., 2018 3.4 2.83 Gradient Weight Sharing Table 1. Test error on CIFAR-10 for leading architectures either human designed or via specialized NAS methods with various search and evaluation methods. Note that all architectures were tuned via standard hyperparameter optimization methods.

For the CIFAR-10 benchmark, specialized NAS methods that use full training perform comparably to manually designed architectures; however, they are prohibitively expensive and take more than 1,000 GPU days. Although methods that exploit partial training or other NAS-specific evaluation methods require less computation to perform the search (400 GPU days and ~1 GPU day, respectively), they are outperformed by the manually designed architecture in Table 1. Notably, the NAS architectures have nearly an order of magnitude fewer parameters than the human-designed model, indicating promising applications of NAS to memory- and latency-constrained settings.

The test perplexity for different architectures on the PTB data set are shown in Table 2.

Source Test Perplexity Search Method Evaluation Method LSTM with MoS Yang et al., 2017 54.4 Human designed - NASNet Zoph et al., 2016 62.4 Reinforcement Learning Full Train NAONET Luo et al., 2018 56.0 Gradient Partial Train DARTS H. Liu et al., 2018 55.7 Gradient Weight Sharing Table 2. Test perplexity on PTB for leading architectures either designed by humans or via specialized NAS methods with various search and evaluation methods. Note that all architectures were tuned via standard hyperparameter optimization methods.

The specialized NAS results are less competitive on the PTB benchmark compared to manually designed architectures. It is surprising, however, that cheaper evaluation methods outperform full training on this benchmark; this is likely due to the additional advances that have been made in training LSTMs since the publication of Zoph,, 2016.

Are specialized NAS methods ready for widespread adoption?

Not yet! To be clear, exploring various architectures and performing extensive hyperparameter optimization remain crucial components of any deep learning application workflow. However, in light of the existing research results (as highlighted above), we believe that while specialized NAS methods have demonstrated promising results on these two benchmarks, they are still not ready for prime time for the following reasons:

  1. Since highly tuned, manually designed architectures are competitive with computationally tractable NAS methods on CIFAR-10 and outperform specialized NAS methods on PTB, we believe resources are better spent on hyperparameter optimization of existing manually designed architectures.
  2. Most specialized NAS methods are fairly specific to a given search space and need to be retrained or retooled for each new search space. Additionally, certain approaches suffer from robustness issues and can be hard to train. These issues currently hinder the general applicability of existing specialized NAS methods to different tasks.


Continue reading What is neural architecture search?.

Categories: Technology

Four short links: 19 December 2018

O'Reilly Radar - Wed, 2018/12/19 - 08:55

Observable Notebooks, Disinformation Report, Chained Blocking, and Trivia from 2018

  1. Observable Notebooks -- JavaScript notebooks. (via Observable Notebooks and iNaturalist)
  2. Disinformation Report -- selective amplification (or pre-consumption filtering) remains one of the most interesting open challenges in infotech, and this report gives context and urgency to it. The IRA shifted a majority of its activity to Instagram in 2017; this was perhaps in response to increased scrutiny on other platforms, including media coverage of its Twitter operation. Instagram engagement outperformed Facebook. New Knowledge note that the Russian misinformation agency was run like a digital marketing shop [...] They built their content using digital marketing best practices, even evolving page logos and typography over time.. (via Renee DiResta)
  3. Twitter Block Chain -- a Chrome extension that blocks followers of the jerk, not just the jerk themselves. The power of the open web is that we can write the tools the platforms don't yet provide, however clunky. (via Hadyn Green)
  4. 52 Things I Learned in 2018 -- each comes with attribution. Three sample facts, sans attribution: (*) 35% of Rwanda’s national blood supply outside the capital city is now delivered by drone. (*) [Unicode] includes a group of ‘ghost characters’ (妛挧暃椦槞蟐袮閠駲墸壥彁) which have no known meaning. It’s believed they are errors introduced by folds and wrinkles during a paper-based 1978 Japanese government project to standardize the alphabet, but are now locked into the standard forever. (*) Cassidy Williams had a dream about a Scrabble-themed mechanical keyboard. When she woke up, she started cold-calling Hasbro to ask for permission to make it real. Eventually, she made it happen.

Continue reading Four short links: 19 December 2018.

Categories: Technology

Deep automation in machine learning

O'Reilly Radar - Wed, 2018/12/19 - 05:00

We need to do more than automate model building with autoML; we need to automate tasks at every stage of the data pipeline.

In a previous post, we talked about applications of machine learning (ML) to software development, which included a tour through sample tools in data science and for managing data infrastructure. Since that time, Andrej Karpathy has made some more predictions about the fate of software development: he envisions a Software 2.0, in which the nature of software development has fundamentally changed. Humans no longer implement code that solves business problems; instead, they define desired behaviors and train algorithms to solve their problems. As he writes, “a neural network is a better piece of code than anything you or I can come up with in a large fraction of valuable verticals.” We won’t be writing code to optimize scheduling in a manufacturing plant; we’ll be training ML algorithms to find optimum performance based on historical data.

If humans are no longer needed to write enterprise applications, what do we do? Humans are still needed to write software, but that software is of a different type. Developers of Software 1.0 have a large body of tools to choose from: IDEs, CI/CD tools, automated testing tools, and so on. The tools for Software 2.0 are only starting to exist; one big task over the next two years is developing the IDEs for machine learning, plus other tools for data management, pipeline management, data cleaning, data provenance, and data lineage.

Karpathy’s vision is ambitious, and we don’t think enterprise software developers need to worry about their jobs any time soon. However, it is clear that the way software is developed is changing. With machine learning, the challenge isn’t writing the code; the algorithms are implemented in a number of well-known and highly optimized libraries. We don’t need to implement our own versions of long short-term memory (LSTM) or reinforcement learning; we get that from PyTorch, Ray RLlib, or some other library. However, machine learning isn’t possible without data, and our tools for working with data aren’t adequate. We have great tools for working with code: creating it, managing it, testing it, and deploying it. But they don’t address the data side, and with ML, managing the data management as important as managing the code itself. GitHub is an excellent tool for managing code, but we need to think about [code+data]. There is no GitHub for data, though we are starting to see version control projects for machine learning models, such as DVC.

It’s important to think precisely about what git does. It captures source code, and all the changes to the source code. For any codebase, it can tell you where the code came from (provenance), and all the changes that led from the original commit to the version you downloaded. It’s capable of maintaining many different branches, reflecting different custom views of the code. If someone has changed a line of code, you will see that change, and who made it. And (with some human help and pain) it can resolve conflicting changes on different branches. Those capabilities are all important for data; but good as git is for code, it isn’t adequate for data. It has trouble with data that isn’t formatted as a sequence of lines (like source code), has problems with binary data, and it chokes on huge files. And it is ill-suited for tracking transformations that change every item in a data set, such as a matrix multiplication or normalization.

We also need better tools for collecting data. Given all the talk about the explosion of data, it’s ironic that most of the data that's exploding falls on the floor and is never captured. Data management isn’t limited to issues like provenance and lineage; one of the most important things you can do with data is collect it. Given the rate at which data is created, data collection has to be automated. How do you do that without dropping data? Given that the results produced by any model will reflect the data used to create the model, how do you ensure your data collection process is fair, representative, and unbiased?

Toward a sustainable ML practice

In our forthcoming report Evolving Data Infrastructure, one aspect we studied was what European organizations were doing to build a sustainable machine learning practice: not a proof of concept or a one-time cool idea to be dropped when the next technical fad comes along, but a permanent part of the organization’s plans. It’s one thing to kick the tires briefly; it’s something else entirely to deeply build the infrastructure needed to integrate machine learning into your organization.

Building a sustainable practice means investing in the tools that allow you to work effectively over the long term. These tools enable you to build software you can rely on, not just proof-of-concept hacks that don’t need to be duplicated. These tools include basics like ETL (extract, transform and load: extracting data from multiple sources, transforming it into a form that’s useful, and loading it into a datastore for analysis). It’s no surprise that companies are investing in data science platforms to run machine learning at scale, just as they invested in Hadoop a decade ago. And given that most of the work of a data scientist is cleaning the data prior to analysis, it’s no surprise that most companies are investing in tools for data preparation. These are tools we would have expected to see on the list five years ago as companies started building their data science practices.

We also see investment in new kinds of tools. Anomaly detection is well-known in the financial industry, where it’s frequently used to detect fraudulent transactions, but it can also be used to catch and fix data quality issues automatically. This isn’t surprising; if you’re collecting data from several weather stations and one of them malfunctions, you would expect to see anomalous data. A faulty weather station might stop reporting data (which might be turned into zeros, infinities, or nulls in your data stream), or it might just send readings that are a few degrees above what’s expected, or that are out of line with other stations in the area. In any case, there will be an anomaly in the input data, and it will be easier for a machine to detect that anomaly than a human. If you suddenly see unexpected patterns in your social data, that may mean adversaries are attempting to poison your data sources. Anomaly detection may have originated in finance, but it is becoming a part of every data scientist’s toolkit.

Metadata analysis makes it possible to build data catalogs, which in turn allow humans to discover data that’s relevant to their projects. Democratizing access to data is a major step on the process to becoming a data-driven (or an AI-driven) company; users must be empowered to explore data and to create their own projects. That is difficult without some kind of data catalog. You can tell users they have access to all the data they need, and given them access to databases, but unless they know what data is available and how to find it, that access doesn’t mean anything. Creating that catalog by hand isn’t possible; it needs to be automated.

Data lineage

The history of data analysis has been plagued with a cavalier attitude toward data sources. That is ending; discussions of data ethics have made data scientists aware of the importance of data lineage and provenance. Both refer to the source of the data: where does the data come from, how was it gathered, and how was it modified along the way? Data provenance is increasingly a legal issue; it’s clearly important to know where data came from and how it was obtained. It’s particularly important when you’re combining data from multiple sources; we’ve often observed that data is most powerful when several sources are combined. Provenance can get very complex, particularly when results generated from one set of data are further combined with other data.

It’s important to be able to trace data lineage at a granular level, to understand the entire path from the source to the application. Data is modified all the time: it’s often been observed that most of the work in data science is cleanup or preparation. Data cleaning involves modifying the data: eliminating rows that have missing or illegal values, for example. We’re beginning to understand exactly how important it is to understand what happened during that cleanup, how data evolved from its raw state: that can be a source of error and bias. As companies ingest and use more data, and as the number of consumers of that data increases, it’s important to know the data is trustworthy. When data is modified, it’s important to track exactly how and when it was modified.

The tools for tracking data provenance and lineage are limited, although products from commercial vendors such as Trifacta are starting to appear. Git and its predecessors (SVN and even RCS) can track every change to every line of code in software, maintain multiple branches of the code, and reconcile differences between branches. How do we do that for data? Furthermore, what will we do with the results? It's common to normalize data, or to transform in some way, but such transformations can easily change every byte in the data set.

Not only do such changes pose problems, but tools like git force humans to supply explanatory comments when they commit a new version to explain why any change was made. That's not possible with an automated data pipeline. It might be possible for systems to log and "explain" the changes they make, but this assumes you have fine-grained control to force them to do so.

Such control may be possible within the scope of a single tool. For example, Jacek Laskowski describes how to extract a resilient distributed data set (RDD) lineage graph that describes a series of Spark transformations. This graph could be committed to a lineage tracking system, or even a more traditional version-control system, to document transformations that have been applied to the data. But this process only applies to a single machine learning platform: Spark. To be generally useful, every platform would need to support extracting a lineage graph, preferably in a single format and without requiring additional scripting by developers. That's a good vision for where we need to go, but we're not there yet.

Data provenance and lineage isn’t just about the quality of the results; it’s a security and compliance issue. At the Strata Data Conference in New York in 2017, danah boyd argued that social media systems were intentionally poisoned by tools that propagated low-quality content designed to sway the algorithms that determined what people watch. Malicious agents have learned to “hack the attention economy.” In "Flat Light: Data Protection for the Disoriented, from Policy to Practice," Andrew Burt and Daniel Geer argue that in the past, data accuracy was binary; data was either correct or incorrect. Now, data provenance is as important as correctness, if not more so. You can’t judge whether data is reliable if you don’t know its origin. For machine learning systems, this means we need to track source data as well as source code: the data used to train the system is as important to its behavior as the algorithms and their implementation.

We are starting to see some tools that automate data quality issues. Intuit uses the Circuit Breaker pattern to halt data pipelines when they detect anomalies in the data. Their tool tracks data lineage because it’s important to understand the inputs and outputs of every stage of the pipeline; it also tracks the status of the pipeline components themselves and the quality of the data at every stage of the pipeline (is it within expected bounds, is it of the appropriate type, etc.). Intuit, Netflix, and Stitchfix have built data lineage systems that track the origin and evolution of the data that they use in their systems.

Automation is more than model building

In the past year, we have seen several companies build tools to “automate machine learning,” including Google and Amazon. These tools automate the process of building models: trying different algorithms and topologies, to minimize error when the model is used on test data. But these tools just build models, and we’ve seen that machine learning requires much more. The model can’t exist without tools for data integration and ETL, data preparation, data cleaning, anomaly detection, data governance, and more. Automating model building is just one component of automating machine learning.

To be truly useful, automated machine learning has to go much deeper than model building. It’s too simple to think a machine learning project will require a single model; one project can easily require several different models, doing different things. And different aspects of the business, while superficially similar, can require different models, trained from different data sources. Consider a hotel business such as Marriott: more than 6,000 hotels, and more than $20 billion in gross revenue. Any hotel would like to predict occupancy, income, and the services they need to provide. But each hotel is a completely different business: The Times Square Marriott is dominated by large corporate conferences and New York City tourism, while the Fairfield Inn in Sebastopol is dominated by local events and wine country tourism. The customer demographics are different; but more than that, the event sources are different. The Sebastopol hotel needs to know about local weddings and wine country events; I’d expect them to use natural language processing to parse feeds from local newspapers. The Times Square hotel needs to know about Broadway openings, Yankees games, and Metro-North train schedules. This isn’t just a different model; these two businesses require completely different data pipelines. Automating the model building process is helpful, but it doesn’t go far enough.

Hotels aren’t the only business requiring more models than humans can conceivably build. Salesforce provides AI services for its clients, which number in the hundreds of thousands. Each client needs a custom model; models can’t be shared, even between clients in similar businesses. Aside from confidentiality issues, no two clients have the same customers or the same data, and small differences between clients can add up to large errors. Even with the most optimistic estimates for machine learning talent, there aren’t enough people to build that many models by hand. Salesforce’s solution is TransmogrifAI, an open source automated ML library for structured data. TransmogrifAI automates the model building process, like other Auto ML solutions, but it also automates many other tasks, including data preparation and feature validation.

Other enterprise software vendors are in the same boat: they have many customers, each of whom requires “custom models.” They cannot hire enough data scientists to support all of these customers with conventional manual workflows. Automation isn’t an option; it’s a necessity.

Automation doesn’t stop when the model is “finished”; in any real-world application, the model can never be considered “finished.” Any model’s performance will degrade over time: situations change, people change, products change, and the model may even play a role in driving that change. We expect to see new tools for automating model testing, either alerting developers when a model needs to be re-trained or starting the training process automatically. And we need to go even further: beyond simple issues of model accuracy, we need to test for fairness and ethics. Those tests can’t be automated completely, but tools can be developed to help domain experts and data scientists detect problems of fairness. For example, such a tool might generate an alert when it detects a potential problem, like a significantly higher loan rejection rate from a protected group; it might also provide tools to help a human expert analyze the problem and make a correction.

Closing thoughts

The way we build software is changing. Whether or not we get to Karpathy’s Software 2.0, we’re certainly on a road headed in that direction. The future holds more machine learning, not less; developing and maintaining models will be part of the job of building software. Software developers will be spending less time writing code and more time training models.

However, the lack of data—and of tools for working with data—remains a fundamental bottleneck. Over the past 50 years, we’ve developed excellent tools for working with software. We now need to build the tools for software+data: tools to track data provenance and lineage, tools to build catalogs from metadata, tools to do fundamental operations like ETL. Companies are investing in these foundational technologies.

The next bottleneck will be model building itself; the number of models we need will always be much greater than the number of people capable of building those models by hand. Again, the solution is building tools for automating the process. We need to do more than automate model building with autoML; we also need to automate feature engineering, data preparation, and other tasks at every stage of the data pipeline. Software developers are, after all, in the business of automation. And the most important thing for software developers to automate is their own work.

Related content

Continue reading Deep automation in machine learning.

Categories: Technology

10 top AWS resources on O’Reilly’s online learning platform

O'Reilly Radar - Wed, 2018/12/19 - 04:00

Our most-used AWS resources will help you stay on track in your journey to learn and apply AWS.

We dove into the data on our online learning platform to identify the most-used Amazon Web Services (AWS) resources. These are the items our platform subscribers regularly turn to as they apply AWS in their projects and organizations.

AWS Certified Solutions Architect Official Study Guide — This official study guide, written by AWS experts, covers exam concepts and provides key review on exam topics.

Amazon Web Services AWS: AWS Fundamentals — Richard Jones walks you through six hours of video instruction on AWS with coverage on cloud computing and available AWS services and provides a guided hands-on look at using services such as EC2 (Elastic Compute Cloud), S3 (Simple Storage Service), and more.

AWS System Administration — Federico Lucifredi and Mike Ryan show developers and system administrators how to configure and manage AWS services, including EC2, CloudFormation, Elastic Load Balancing, S3, and Route 53.

Serverless Architectures on AWS: With Examples Using AWS Lambda — Peter Sbarski teaches you how to build, secure, and manage serverless architectures that can power the most demanding web and mobile apps.

Learning AWS, 2nd Edition — Amit Shah and Aurobindo Sarkar help you become well-versed with the various services that AWS provides and show you how to use AWS infrastructure to accelerate the development process.

Automation in AWS with CloudFormation, CLI, and SDKs — Richard Jones covers AWS services and tools used to automate the creation and maintenance of AWS infrastructure, including VPC, EC2, Lambda, RDS, and deploying containerized microservices with Docker.

AWS: Security Best Practices on AWS — Albert Anthony focuses on using native AWS security features and managed AWS services to help you achieve continuous security.

Amazon Web Services in Action — Michael Wittig and Andreas Wittig introduce you to computing, storing, and networking in the AWS cloud.

AWS CloudFormation Master Class — Stéphane Maarek teaches you to write complete AWS CloudFormation templates using YAML and covers all the recent CloudFormation features.

AWS Lambda in Action: Event-Driven Serverless Applications — Danilo Poccia offers an example-driven tutorial that teaches you how to build applications that use an event-driven approach on the back end.

Continue reading 10 top AWS resources on O’Reilly’s online learning platform.

Categories: Technology

Four short links: 18 December 2018

O'Reilly Radar - Tue, 2018/12/18 - 05:05

Singing AI, Content Signing, Data Rights, and Query Processing

  1. AI Voices -- marketing copy, but I can't find technical detail. The demos are worth checking out. The sprint to automated pop music generation has begun. Not just limited to Japanese, as it is also capable of producing convincing Mandarin and even English voices for songs such as Adele’s "Rolling in the Deep" and Britney Spears’ "Everytime" on their official website.
  2. Notary -- publishers can sign their content offline using keys kept highly secure. Once the publisher is ready to make the content available, they can push their signed trusted collection to a Notary server. Consumers, having acquired the publisher's public key through a secure channel, can then communicate with any Notary server or (insecure) mirror, relying only on the publisher's key to determine the validity and integrity of the received content.
  3. It's Time for a Bill of Data Rights (MIT TR) -- this essay argues that “data ownership” is a flawed, counterproductive way of thinking about data. It not only does not fix existing problems, it creates new ones. Instead, we need a framework that gives people rights to stipulate how their data is used without requiring them to take ownership of it themselves. (via Cory Doctorow)
  4. Trill -- a single-node query processor for temporal or streaming data: open source from Microsoft. Described in this blog post.

Continue reading Four short links: 18 December 2018.

Categories: Technology

Four short links: 17 December 2018

O'Reilly Radar - Mon, 2018/12/17 - 05:10

Open Source Licensing, Computer History, Serverless, and Wicked Problems

  1. Open Source Confronts Its Midlife Crisis (Bryan Cantrill) -- To be clear, the underlying problem is not the licensing, it’s that these companies don’t know how to make money—they want open source to be its own business model, and seeing that the cloud service providers have an entirely viable business model, they want a piece of the action. Also see Bryan's followup: A EULA in FOSS Clothing: You will notice that this looks nothing like any traditional source-based license—but it is exactly the kind of boilerplate that you find on EULAs, terms-of-service agreements, and other contracts that are being rammed down your throat.
  2. A Computer of One's Own -- fantastic precis of the work of significant women in computing history.
  3. Serverlessness (Tim Bray) -- Tim works in AWS's Serverless group and has been collecting what he's learned in his years building serverless infrastructure.
  4. Why We Suck at Solving Wicked Problems -- this rings true with my experience.

Continue reading Four short links: 17 December 2018.

Categories: Technology

Four short links: 14 December 2018

O'Reilly Radar - Fri, 2018/12/14 - 02:00

Satellite LoRaWAN, Bret Victor, State of AI, and Immutable Documentation

  1. Fleet -- launched satellites as backhaul for LoRaWAN base station traffic.
  2. Computing is Everywhere -- podcast episode with Bret Victor. Lots of interesting history and context to what he's up to at Dynamicland. (via Paul Ford)
  3. AI Index 2018 Report (Stanford) -- think of it as the Mary Meeker report for AI.
  4. Etsy's Experiment with Immutable Documentation -- In trying to overcome the problem of staleness, the crucial observation is that how-docs typically change faster than why-docs do. Therefore the more how-docs are mixed in with why-docs in a doc page, the more likely the page is to go stale. We’ve leveraged this observation by creating an entirely separate system to hold our how-docs.

Continue reading Four short links: 14 December 2018.

Categories: Technology

Four short links: 13 December 2018

O'Reilly Radar - Thu, 2018/12/13 - 02:00

CS Ethics, Insect IoT, Glitch Showcase, and SQL Repos

  1. Embedded Ethics -- Harvard project that integrates ethics modules into courses across the standard computer science curriculum. Those modules are straightforward, online, and open access.
  2. Living IOT: A Flying Wireless Platform on Live Insects -- We develop and deploy our platform on bumblebees which includes backscatter communication, low-power self-localization hardware, sensors, and a power source. We show that our platform is capable of sensing, backscattering data at 1 kbps when the insects are back at the hive, and localizing itself up to distances of 80 m from the access points, all within a total weight budget of 102 mg. (via BoingBoing)
  3. Looky What We Made -- showcase of Glitch apps.
  4. Git Your SQL Together -- why I recommend tracking SQL queries in git: 1. You will *always* need that query again. 2. Queries are living artifacts that change over time. 3. If it’s useful to you, it’s useful to others (and vice versa)

Continue reading Four short links: 13 December 2018.

Categories: Technology

Four short links: 12 December 2018

O'Reilly Radar - Wed, 2018/12/12 - 05:00

Render as Comic, Notebook to Production, Population Visualization, and Location Privacy

  1. Comixify -- render video as comics.
  2. How to Grow Neat Software Architecture out of Jupyter Notebooks -- everyone's coding in notebooks as a sweet step up from the basic one-command REPL loop. Here's some good advice on how to grow these projects without creating a spaghetti monster.
  3. City 3D -- This project wields data from the Global Human Settlement Layer, which uses “satellite imagery, census data, and volunteered geographic information” to create population density maps. Best visualization I've seen in a very long time.
  4. Your Apps Know Where You Were Last Night, and They're Not Keeping It Secret (NY Times) -- At least 75 companies receive anonymous, precise location data from apps whose users enable location services to get local news and weather or other information. They claim 200M mobile devices, with updates as often as every six seconds. These companies sell, use, or analyze the data to cater to advertisers, retail outlets, and even hedge funds seeking insights into consumer behavior. [...] An app may tell users that granting access to their location will help them get traffic information, but not mention that the data will be shared and sold. That disclosure is often buried in a vague privacy policy.

Continue reading Four short links: 12 December 2018.

Categories: Technology

Four short links: 11 December 2018

O'Reilly Radar - Tue, 2018/12/11 - 12:25

Can We Stop?, Everything Breaks, Edge Cloud, and Molly Guard

  1. The Seductive Diversion of Solving Bias in Artificial Intelligence -- provocative title, but the point is that the preoccupation with narrow computational puzzles distracts us from the far more important issue of the colossal asymmetry between societal cost and private gain in the rollout of automated systems. It also denies us the possibility of asking: should we be building these systems at all? The expected value of pursuing this line of thinking is pretty low because there's a vanishingly small probability that we can coordinate activity globally to prevent something bad from happening. Exhibit A: climate change.
  2. Everything Breaks (Michael Lopp) -- Humans will greatly benefit from a clear explanation of the rules of the game. The rules need to evolve in unexpected ways to account for the arrival of more humans. The only way to effectively learn to what is going to break is keeping playing...and learning. See also lessons learned from scaling Stripe's engineering team.
  3. Terrarium (Fastly) -- an interesting glimpse at a possible future for web apps, where your CDN (which you need to have anyway if you're publishing anything remotely contentious or interesting) blurs with your hosting infrastructure provider. Terrarium is a multi-language deployment platform based on WebAssembly. Think of it as a playground for experimenting with edge-side WebAssembly. Being one of the first Fastly Labs projects, you can also think of it as our way of publicly experimenting with what the future of real highly performant edge computing could look like.
  4. molly-guard -- protects machines from accidental shutdowns/reboots. Etymology of the name: originally a Plexiglas cover improvised for the Big Red Switch on an IBM 4341 mainframe after a programmer's toddler daughter (named Molly) tripped it twice in one day. Later generalized to covers over stop/reset switches on disk drives and networking equipment. (via Mike Forbes)

Continue reading Four short links: 11 December 2018.

Categories: Technology

Four short links: 10 December 2018

O'Reilly Radar - Mon, 2018/12/10 - 04:55

Language Zoo, VS AI, Advertising Plus, and Minecraft Scripting

  1. The Programming Languages Zoo -- a collection of miniature programming languages that demonstrates various concepts and techniques used in programming language design and implementation.
  2. AI in Visual Studio Code -- good to see IDEs getting AI-powered features to augment coders. In some small way, Doug Engelbart would be proud.
  3. Outgrowing Advertising: Multimodal Business Models as a Product Strategy -- business models from Chinese companies that are augmenting advertising with other revenue streams.
  4. Minecraft Scripting API in Public Beta -- The Minecraft Script Engine uses the JavaScript language. Scripts can be written and bundled with Behaviour Packs to listen and respond to game events, get (and modify) data in components that entities have, and affect different parts of the game.

Continue reading Four short links: 10 December 2018.

Categories: Technology

Four short links: 7 December 2018

O'Reilly Radar - Fri, 2018/12/07 - 04:45

Broken Feedback, Fake AI, Teaching with Jupyter, and Multiplayer Code UI

  1. Why Ratings and Feedback Forms Don't Work (The Atlantic) -- Negative feedback is actually good feedback because it yields greater efficiency and performance. [...] Positive feedback, by contrast, causes the system to keep going, unchecked. Like a thermostat that registers the room as too warm and cranks up the furnace, it’s generally meant to be avoided. But today’s understanding of feedback has reversed those terms.
  2. How to Recognize Fake AI-Generated Images -- worth remembering that researchers are in a war with these kinds of heuristics because if "straight hair looks like paint," then a researcher can get a paper out of fixing that.
  3. Teaching and Learning with Jupyter -- open about Jupyter and its use in teaching and learning.
  4. Multiplayer -- code with friends in the same editor, execute programs in the same interpreter, interact with the same terminal, chat in the IDE, edit files and share the same system resources, and ship applications from the same interface.

Continue reading Four short links: 7 December 2018.

Categories: Technology

Assessing progress in automation technologies

O'Reilly Radar - Thu, 2018/12/06 - 10:09

When it comes to automation of existing tasks and workflows, you need not adopt an “all or nothing” attitude.

In this post, I share slides and notes from a keynote Roger Chen and I gave at the Artificial Intelligence conference in London in October 2018. We presented an overview of the state of automation technologies: we tried to highlight the state of the key building block technologies and we described how these tools might evolve in the near future.

To assess the state of adoption of machine learning (ML) and AI, we recently conducted a survey that garnered more than 11,000 respondents. As I pointed out in previous posts, we learned many companies are still in the early stages of deploying machine learning:

Companies cite “lack of data” and “lack of skilled people” as the main factors holding back adoption. In many instances, “lack of data” is literally the state of affairs: companies have yet to collect and store the data needed to train the ML models they desire. The “skills gap” is real and persistent. Developers have taken heed of this growth in demand. In our own online learning platform, we are seeing strong growth in usage of content across AI topics, including 77% growth in consumption of content pertaining to deep learning:

We are also seeing strong growth in interest in new tools and topics such as PyTorch and reinforcement learning. In the case of reinforcement learning, new tools like Ray are already spurring companies to examine alternative solutions to multi-step decision problems, where models might be hard to build using supervised learning.

Decision-makers also are investing in AI and automation technologies. A recent survey of close to 4,000 IT leaders across 84 countries found that more companies are starting to invest in AI and automation technologies:

  • The level of investment depends on the company. Companies that already consider themselves digital leaders tend to report a much higher level of investment in AI and automation.
  • Location also matters. Given the highly competitive business environment in China, it’s no surprise that companies there also tend to invest at a much higher rate. This aligns with a recent overview on AI in China delivered by Kai-Fu Lee at our AI conference in San Francisco this past September.

Progress in AI technologies has been fueled by the growth in data and improvements in compute and models. Let’s briefly examine each of these elements.

Deep learning models

Resurgence in deep learning began in 2011/2012 with record-setting models for speech recognition and computer vision. When I first began following deep learning in 2013, the community was small and tight-knit. Best practices were passed through internships in a few groups, and a lot of knowledge was shared in the form of “oral tradition.” Today, the community is much larger.

Progress in research has been made possible by the steady improvement in: (1) data sets, (2) hardware and software tools, and (3) a culture of sharing and openness through conferences and websites like arXiv. Novices and non-experts have also benefited from easy-to-use, open source libraries for machine learning.

These open source ML libraries have leveled the playing field and have made it possible for non-expert developers to build interesting applications. In fact, in 2017 we featured a couple of talented teenagers (Kavya Kopparapu and Abu Qader) at our AI conferences. They both were self-taught, and both were able to build potentially high-impact prototypes involving deep learning.

Companies have taken notice and want to build ML and AI into their systems and products. In 2015, LinkedIn ran a study and found that the U.S. had a national surplus of people with data science skills. That’s no longer the case today:

  • Demand in key metro areas in the U.S. is extremely high.
  • Cutting-edge skills like AI and machine learning will likely spread to other industries and geographies in the future.

With that said, having great models isn’t sufficient. At least for now, many of the models we rely on—including deep learning and reinforcement learning—are data hungry. Since they have the potential to scale to many, many users, the largest companies in the largest countries have an advantage over the rest of us. China, in particular, has been dubbed “the Saudi Arabia of data.” Because AI research depends on having access to large data sets, we’re already seeing more cutting-edge research coming out of the large U.S. and Chinese companies. NIPS used to be a sleepy academic conference. Now it sells out within minutes, and we’re seeing more papers coming from large U.S. and Chinese companies.

The good news is that there are new tools that might help the rest of us gain access to more data. Services for generating labeled data sets are increasingly using AI technologies. The ones that rely on human labelers are beginning to use machine learning tools to help their human workers scale, improve their accuracy, and make training data more affordable. In certain domains, new tools like GANs and simulation platforms are able to provide realistic synthetic data that can be used to train machine learning models.

In addition to data generation, another important aspect is data sharing. There are also new startups building open source tools to improve data liquidity. These startups are using tools like cryptography, blockchains, and secure communication to build data networks that enable organizations to share data securely.


Machine learning researchers are constantly exploring new algorithms. In the case of deep learning, this usually means trying new neural network architectures, refining parameters, or exploring new optimization techniques. As Turing Award winner David Patterson describes it, “The appetite for training is unlimited!”

The challenge is that experiments can take a long time to complete: hours, days, or even weeks. Computation also can cost a lot of money. This means researchers cannot casually run such long and complex experiments, even if they had the patience to wait for them to finish.

We are in year seven of this renewed interest in AI and deep learning. At this stage, companies know the types of computations involved and they are beginning to see enough demand to justify building specialized hardware to accelerate those computations. Hardware companies, including our partner Intel, continue to release suites of hardware products for AI (including compute, memory, host bandwidth, and I/O bandwidth). The demand is so great that other companies—including ones that aren’t known for processors—are beginning to jump into the fray.

More help is on the way. We see a lot of new companies working on specialized hardware. You have hardware for the data center, where the task of training large models using large data sets usually takes place. We are also entering an age where billions of edge devices will be expected to perform inference tasks, like image recognition. Hardware for these edge devices needs to be energy efficient and reasonably priced.

Numerous hardware startups are targeting deep learning both in China and in the U.S. The San Francisco Bay Area, in particular, is a hotbed for experienced hardware engineers and entrepreneurs, many of whom are working on AI-related startups. As you can see below, many hardware startups are targeting edge devices:

Closing thoughts

We’ve talked about data, models, and compute mainly in the context of traditional performance measures: namely, optimizing machine learning or even business metrics. The reality is that there are many other considerations. For example, in certain domains (including health and finance) systems need to be explainable. Other aspects including fairness, privacy and security, and reliability and safety are also all important considerations as ML and AI get deployed more widely. This is a real concern for companies. In a recent survey, we found strong awareness and concern over these issues on the part of data scientists and data engineers.

Consider reliability and safety. While we can start building computer vision applications today, we need to remember that they can be brittle. In certain domains, we will need to understand safety implications and we will need to prioritize reliability over efficiency gains provided by automation. The founders of Mobileye described it best: the main parameter in the race for autonomous cars cannot be who will have the first car on the road.

Developing safe, explainable, fair, and secure AI applications will happen in stages. When it comes to automation of existing tasks and workflows, you need not adopt an “all or nothing” attitude. Many of these technologies can already be used for basic and partial automation of workflows.

Related content:

Continue reading Assessing progress in automation technologies.

Categories: Technology

Tools for generating deep neural networks with efficient network architectures

O'Reilly Radar - Thu, 2018/12/06 - 06:05

The O’Reilly Data Show Podcast: Alex Wong on building human-in-the-loop automation solutions for enterprise machine learning.

In this episode of the Data Show, I spoke with Alex Wong, associate professor at the University of Waterloo, and co-founder of DarwinAI, a startup that uses AI to address foundational challenges with deep learning in the enterprise. As the use of machine learning and analytics become more widespread, we’re beginning to see tools that enable data scientists and data engineers to scale and tackle many more problems and maintain more systems. This includes automation tools for the many stages involved in data science, including data preparation, feature engineering, model selection, and hyperparameter tuning, as well as tools for data engineering and data operations.

Wong and his collaborators are building solutions for enterprises, including tools for generating efficient neural networks and for the performance analysis of networks deployed to edge devices.

Continue reading Tools for generating deep neural networks with efficient network architectures.

Categories: Technology

Four short links: 6 December 2018

O'Reilly Radar - Thu, 2018/12/06 - 05:15

Public Domain, Optimistic Sci-Fi, C64 Defrag, and Quantum Computing

  1. Re-Opening of the Public Domain (Creative Commons) -- after years of legal extension of copyright terms, 2019 will be the first year in which new materials fall into the American public domain, and Creative Commons is throwing a bash at the Internet Archive.
  2. Better Worlds (The Verge) -- starting on January 14th, we’ll be publishing Better Worlds: 10 original fiction stories, five animated adaptations, and five audio adaptations by a diverse roster of science fiction authors who take a more optimistic view of what lies ahead in ways both large and small, fantastical and everyday. Necessary! I heard a great interview with Tyler Cowen where he said, "you cannot live with pessimism, right? There’s also a notion that more optimism is a partially self-fulfilling prophecy. Believing pessimistic views might make them more likely to come about." It is a fallacy to conflate optimism with naivete.
  3. A Disk Defragmenter for the Commodore 64 -- I don't know what's more insane: watching a great 40x25 homage to the classic Windows defrag progress screen or reading the bonkers BASIC code behind it.
  4. Quantum Computing Progress and Prospects -- an introduction to the field, including the unique characteristics and constraints of the technology, and assesses the feasibility and implications of creating a functional quantum computer capable of addressing real-world problems. This report considers hardware and software requirements, quantum algorithms, drivers of advances in quantum computing and quantum devices, benchmarks associated with relevant use cases, the time and resources required, and how to assess the probability of success. Separate the hype from the reality and develop a sense of the probability of different possible evolutionary paths for the technology.

Continue reading Four short links: 6 December 2018.

Categories: Technology

Distributed systems: A quick and simple definition

O'Reilly Radar - Thu, 2018/12/06 - 04:00

Get a basic understanding of distributed systems and then go deeper with recommended resources.

The technology landscape has evolved into an always-on environment of mobile, social, and cloud applications where programs can be accessed and used across a multitude of devices.

These always-on and always-available expectations are handled by distributed systems, which manage the inevitable fluctuations and failures of complex computing behind the scenes.

“The increasing criticality of these systems means that it is necessary for these online systems to be built for redundancy, fault tolerance, and high availability,” writes Brendan Burns, distinguished engineer at Microsoft, in Designing Distributed Systems. “The confluence of these requirements has led to an order of magnitude increase in the number of distributed systems that need to be built.”

In Distributed Systems in One Lesson, developer relations leader and teacher Tim Berglund says a simple way to think about distributed systems is that they are a collection of independent computers that appears to its user as a single computer.

Virtually all modern software and applications built today are distributed systems of some sort, says Sam Newman, director at Sam Newman & Associates and author of Building Microservices. Even a monolithic application talking to a database is a distributed system, he says, “just a very simple one.”

While those simple systems can technically be considered distributed, when engineers refer to distributed systems they’re typically talking about massively complex systems made up of many moving parts communicating with one another, with all of it appearing to an end-user as a single product, says Nora Jones, a senior software engineer at Netflix.

Think anything from, well, Netflix, to an online store like Amazon, to an instant messaging platform like WhatsApp, to a customer relationship management application like Salesforce, to Google’s search application. These systems require everything from login functionality, user profiles, recommendation engines, personalization, relational databases, object databases, content delivery networks, and numerous other components all served up cohesively to the user.

Benefits of distributed systems

These days, it’s not so much a question of why a team would use a distributed system, but rather when they should shift in that direction and how distributed the system needs to be, experts say. 

Here are three inflection points—the need for scale, a more reliable system, and a more powerful system—when a technology team might consider using a distributed system.

Horizontal Scalability

Computing processes across a distributed system happen independently from one another, notes Berglund in Distributed Systems in One Lesson. This makes it easy to add nodes and functionality as needed. Distributed systems offer “the ability to massively scale computing power relatively inexpensively, enabling organizations to scale up their businesses to a global level in a way that was not possible even a decade ago,” write Chad Carson, cofounder of Pepperdata, and Sean Suchter, director of Istio at Google, in Effective Multi-Tenant Distributed Systems.


Distributed systems create a reliable experience for end users because they rely on “hundreds or thousands of relatively inexpensive computers to communicate with one another and work together, creating the outward appearance of a single, high-powered computer,” write Carson and Suchter. In a single-machine environment, if that machine fails then so too does the entire system. When computation is spread across numerous machines, there can be a failure at one node that doesn’t take the whole system down, writes Cindy Sridharan, distributed systems engineer, in Distributed Systems Observability.


In Designing Distributed Systems, Burns notes that a distributed system can handle tasks efficiently because work loads and requests are broken into pieces and spread over multiple computers. This work is completed in parallel and the results are returned and compiled back to a central location.

The challenges of distributed systems

While the benefits of creating distributed systems can be great for scaling and reliability, distributed systems also introduce complexity when it comes to design, construction, and debugging. Presently, most distributed systems are one-off bespoke solutions, writes Burns in Designing Distributed Systems, making them difficult to troubleshoot when problems do arise.

Here are three of the most common challenges presented by distributed systems.


Because the work loads and jobs in a distributed system do not happen sequentially, there must be prioritization, note Carson and Suchter in Effective Multi-Tenant Distributed Systems:

One of the primary challenges in a distributed system is in scheduling jobs and their component processes. Computing power might be quite large, but it is always finite, and the distributed system must decide which jobs should be scheduled to run where and when, and the relative priority of those jobs. Even sophisticated distributed system schedulers have limitations that can lead to underutilization of cluster hardware, unpredictable job run times, or both.

Take Amazon, for example. Amazon technology teams need to understand which aspects of the online store need to be called upon first to create a smooth user experience. Should the search bar be called before the navigation bar? Think of the many ways both small and large that Amazon makes online shopping as useful as possible for its users.


With such a complex interchange between hardware computing, software calls, and communication between those pieces over networks, latency can become a problem for users.

“The more widely distributed your system, the more latency between the constituents of your system becomes an issue,” says Newman. “As the volume of calls over the networks increases, the more you’ll start to see transient partitions and potentially have to deal with them.”

Over time, this can lead to technology teams needing to make tradeoffs around availability, consistency, and latency, Newman says.

Performance monitoring and observability

Failure is inevitable, says Nora Jones, when it comes to distributed systems. How a technology team manages and plans for failure so a customer hardly notices it is key. When distributed systems become complex, observability into the technology stack to understand those failures is an enormous challenge.

Carson and Suchter illustrate this challenge in Effective Multi-Tenant Distributed Systems:

Truly useful monitoring for multi-tenant distributed systems must track hardware usage metrics at a sufficient level of granularity for each interesting process on each node. Gathering, processing, and presenting this data for large clusters is a significant challenge, in terms of both systems engineering (to process and store the data efficiently and in a scalable fashion) and the presentation-level logic and math (to present it usefully and accurately). Even for limited, node-level metrics, traditional monitoring systems do not scale well on large clusters of hundreds to thousands of nodes.

There are several approaches companies can use to detect those failure points, such as distributed tracing, chaos engineering, incident reviews, and understanding expectations of upstream and downstream dependencies. “There’s a lot of different tactics to achieve high quality and robustness, and they all fit into the category of having as much insight into the system as possible,” Jones says.

Learn more

Ready to go deeper into distributed systems? Check out these recommended resources from O’Reilly’s editors.

Distributed Systems Observability — Cindy Sridharan provides an overview of monitoring challenges and trade-offs that will help you choose the best observability strategy for your distributed system.

Designing Distributed Systems — Brendan Burns demonstrates how you can adapt existing software design patterns for designing and building reliable distributed applications.

The Distributed Systems Video Collection — This 12-video collection dives into best practices and the future of distributed systems.

Effective Multi-Tenant Distributed Systems — Chad Carson and Sean Suchter outline the performance challenges of running multi-tenant distributed computing environments, especially within a Hadoop context.

Distributed Systems in One Lesson — Using a series of examples taken from a fictional coffee shop business, Tim Berglund helps you explore five key areas of distributed systems.

Chaos Engineering — This report introduces you to Chaos Engineering, a method of experimenting on infrastructure that lets you expose weaknesses before they become problems.

Designing Data-Intensive Applications — Martin Kleppmann examines the pros and cons of various technologies for processing and storing data.

Continue reading Distributed systems: A quick and simple definition.

Categories: Technology


Subscribe to LuftHans aggregator - Technology