You are here

Feed aggregator

Four short links: 31 December 2018

O'Reilly Radar - Mon, 2018/12/31 - 05:55

Schema Crawler, Open Source Bug Bounties, Essential C, and AI Poker

  1. SchemaCrawler -- Free database schema discovery and comprehension tool. Make sense of the databases you inherit.
  2. EU To Fund Bug Bounties for Open Source Projects (ZD Net) -- this is good, but insufficient. See Katie Moussouris.
  3. Essential C -- a sweet little summary of C, an even terser K&R.
  4. AI, Game Theory, and Poker (YouTube) -- a talk by Tuomas Sandholm, CMU professor and co-creator of Libratus, which is the first AI system to beat top human players at the game of Heads-Up No-Limit Texas Hold'em. From the AI Podcast.

Continue reading Four short links: 31 December 2018.

Categories: Technology

0x5F: Was 2018 the Year of Non-FOSS Licensing?

FAIF - Mon, 2018/12/31 - 05:34

Show Notes Segment 0 (00:36)

Bradley and Karen discuss ideas for what to do with the oggcast going forward.

Segment 2 (07:49)
Categories: Free Software

Four short links: 28 December 2018

O'Reilly Radar - Fri, 2018/12/28 - 05:55

Bayes Notes, Fake Internet, Tensorflow Privacy, Sortable UUIDs

  1. Updating: A Set of Bayesian Notes -- Notes on Bayesian methods - written to supplement CS&SS/STAT 564: Bayesian Statistics for the Social Sciences.
  2. How Much of the Internet is Fake? (NY Mag) -- What’s gone from the internet, after all, isn’t “truth,” but trust: the sense that the people and things we encounter are what they represent themselves to be.
  3. TensorFlow Privacy -- Library for training machine learning models with privacy for training data.
  4. Universally Unique Lexicographically Sortable Identifiers -- 128-bit compatibility with UUID; 1.21e+24 unique ULIDs per millisecond; Lexicographically sortable!; Canonically encoded as a 26 character string, as opposed to the 36 character UUID; Uses Crockford's base32 for better efficiency and readability (5 bits per character); Case insensitive; No special characters (URL safe); Monotonic sort order (correctly detects and handles the same millisecond).

Continue reading Four short links: 28 December 2018.

Categories: Technology

Four short links: 27 December 2018

O'Reilly Radar - Thu, 2018/12/27 - 05:45

Reading Minds, Year Gotchas, LSTM Conversation, and Fast Scanning

  1. Reading Rats' Minds (MIT) -- In recent years, scientists have shown that by recording the electrical activity of groups of neurons in key areas of the brain, they could read a rat’s thoughts of where it was, both after it actually ran the maze and also later when it would dream of running the maze in its sleep—a key process in consolidating its memory. In the new study, several of the scientists involved in pioneering such mind-reading methods now report they can read out those signals in real time as the rat runs the maze, with a high degree of accuracy and the ability to account for the statistical relevance of the readings almost instantly after they are made. [...] The software of the system is open source and available for fellow neuroscientists to download and use freely, Chen and Wilson say. Rats not included. The paper is open access, too.
  2. yyyy and YYYY: Why Your Year May Be Wrong (Erica Sadun) -- The presence of YYYY in the date format without its expected supporting information reduces to “start of year, go back one week, report the first day.” (I’ll explain this more in just a little bit.)
  3. Conversation with Juergen Schmidhuber -- the co-creator of long short-term memory networks (LSTMs) that are used in billions of devices today for speech recognition, translation, and much more. ... The history of science is the history of compression progress. Metalearning, self-referential programs, and more. It's a dry discussion of fiery ideas. (via hardmaru)
  4. Scanning 250 Pages/Minute -- Our system continuously observes 3D deformation of each flipped page at 500 times per second and recognizes the best moment for book image digitization. The video is hypnotic. (via Reza Zadeh)

Continue reading Four short links: 27 December 2018.

Categories: Technology

Four short links: 26 December 2018

O'Reilly Radar - Wed, 2018/12/26 - 07:40

Evil FizzBuzz, Atari OS, Logic Guide, and Artificial Life

  1. Evil FizzBuzz (Jason Gorman) -- a really clever CI exercise for a team.
  2. EmuTOS -- open source reimplementation of the original Atari ST operating system. (via Hacker News)
  3. Teach Yourself Logic: A Study Guide -- a wonderfully chatty book that functions as an introduction to logic for mathematicians and philosophers.
  4. Lenia: Biology of Artificial Life -- a new model of artificial life called Lenia (from Latin lenis "smooth"), a two-dimensional cellular automaton with continuous space-time-state and generalized local rule. Computer simulations show that Lenia supports a great diversity of complex autonomous patterns or "lifeforms" bearing resemblance to real-world microscopic organisms. More than 400 species in 18 families have been identified, many discovered via interactive evolutionary computation. They differ from other cellular automata patterns in being geometric, metameric, fuzzy, resilient, adaptive, and rule-generic. Implementation with source.

Continue reading Four short links: 26 December 2018.

Categories: Technology

Four short links: 25 December 2018

O'Reilly Radar - Tue, 2018/12/25 - 04:00

Hardware Testing is Hard, Biological Keygen, Christmas Robots, and Open Data

  1. Maxclave (Bunnie Huang) -- you thought software testing was hard? Welcome to the world of hardware testing.
  2. Biological One-Way Functions for Secure Key Generation -- It is demonstrated that the spatiotemporal dynamics of an ensemble of living organisms such as T cells can be used for maximum entropy, high‐density, and high‐speed key generation.
  3. Christmas Robot Roundup (IEEE) -- selection of holiday greetings from various robots and robotics companies. I for one welcome our new tinsel-and-holly-clad industrial apparatus overlords.
  4. Congress Votes to Make Open Government Data the Default in the United States -- The Open, Public, Electronic, and Necessary Government Data Act (AKA the OPEN Government Data Act) is about to become law [...]. This codifies two canonical principles for democracy in the 21st century: 1. public information should be open by default to the public in a machine-readable format, where such publication doesn’t harm privacy or security. 2. federal agencies should use evidence when they make public policy. Merry Christmas, democracy; here's a small present in a bad year.

Continue reading Four short links: 25 December 2018.

Categories: Technology

Four short links: 24 December 2018

O'Reilly Radar - Mon, 2018/12/24 - 04:50

Learning Prolog, Data Race, Animating Photos, and Easy Flashing

  1. Solving Murder with Prolog -- if THIS was the motivating example for Prolog, I'd have taken to it a lot sooner! I love those logic puzzle books.
  2. The Machine Learning Race is Really a Data Race (MIT Sloan Review) -- Organizations that hope to make AI a differentiator need to draw from alternative data sets—ones they may have to create themselves.
  3. Photo Wakeup: 3-D Character Animation from a Single Photo -- this is incredible work. Watch the video if nothing else.
  4. Etcher -- Flash OS images to SD cards and USB drives, safely and easily. Open source.

Continue reading Four short links: 24 December 2018.

Categories: Technology

Four short links: 21 December 2018

O'Reilly Radar - Fri, 2018/12/21 - 06:45

Tech in China, Wisdom of Small Groups, iOS VPN, and Gameboy Supercomputer

  1. MIT TR: The China Issue -- from AI to landscaping, it's the state of big tech in China.
  2. Aggregated Knowledge From a Small Number of Debates Outperforms the Wisdom of Large Crowds -- what it says on the box. This is why I like the World Cafe Method of facilitating discussions.
  3. Wireguard for iOS -- a port of Wireguard VPN to the Apple mobile ecosystem.
  4. A Gameboy Supercomputer -- At a total of slightly over one billion frames per second, it is arguably the fastest 8-bit game console cluster in the world.

Continue reading Four short links: 21 December 2018.

Categories: Technology

Trends in data, machine learning, and AI

O'Reilly Radar - Thu, 2018/12/20 - 06:00

The O’Reilly Data Show Podcast: Ben Lorica looks ahead at what we can expect in 2019 in the big data landscape.

For the end-of-year holiday episode of the Data Show, I turned the tables on Data Show host Ben Lorica to talk about trends in big data, machine learning, and AI, and what to look for in 2019. Lorica also showcased some highlights from our upcoming Strata Data and Artificial Intelligence conferences.

Continue reading Trends in data, machine learning, and AI.

Categories: Technology

Four short links: 20 December 2018

O'Reilly Radar - Thu, 2018/12/20 - 05:10

Misinformation Research, AI UI, Facebook's Value, and Python Governance

  1. Common-Knowledge Attacks on Democracy -- We argue that scaling up computer security arguments to the level of the state, so that the entire polity is treated as an information system with associated attack surfaces and threat models, provides the best immediate way to understand these attacks and how to mitigate them. We demonstrate systematic differences between how autocracies and democracies work as information systems, because they rely on different mixes of common and contested political knowledge. Released 17 November; Bruce Schneier is co-author.
  2. Can Users Control and Understand a UI Driven by Machine Learning? -- In this article, we examine some of the challenges users encounter when interacting with machine learning algorithms on Facebook, Instagram, Google News, Netflix, and Uber Driver.
  3. Estimating the Value of Facebook by Paying Users to Stop Using It -- across all three samples, the mean bid to deactivate Facebook for a year exceeded $1,000.
  4. Python Gets a New Governance Model -- The council is imbued with "broad authority to make decisions about the project," but the goal is that it uses that authority rarely; it is meant to delegate its authority broadly. The PEP says the council should seek consensus, rather than dictate, and that it should define a standard PEP decision-making process that will (hopefully) rarely need council votes to resolve. It is, however, the "court of final appeal" for decisions affecting the language. But the council cannot change the governance PEP; that can only happen via a two-thirds vote of the core team. Python gets a constitution (aka PEP 8016).

Continue reading Four short links: 20 December 2018.

Categories: Technology

What is neural architecture search?

O'Reilly Radar - Thu, 2018/12/20 - 05:00

An overview of NAS and a discussion on how it compares to hyperparameter optimization.

Deep learning offers the promise of bypassing the process of manual feature engineering by learning representations in conjunction with statistical models in an end-to-end fashion. However, neural network architectures themselves are typically designed by experts in a painstaking, ad hoc fashion. Neural architecture search (NAS) has been touted as the path forward for alleviating this pain by automatically identifying architectures that are superior to hand-designed ones.

But with the field moving so fast both in terms of research progress and hype, it can be hard to get answers to basic questions: What exactly is NAS and is it fundamentally different from AutoML or hyperparameter optimization? Do specialized NAS methods actually work? Aren't they prohibitively expensive to use? Should I be using specialized NAS methods? In this post, we'll answer each of these questions. Our discussion touches upon a few key points:

  • There is a false dichotomy between NAS and traditional hyperparameter optimization; in fact, NAS is a subset of hyperparameter optimization. Moreover, specialized NAS methods are not actually fully automated, as they rely on human-designed architectures as starting points.
  • While exploring and tuning different neural network architectures is of crucial importance in developing high-quality deep learning applications, in our view specialized NAS methods are not ready for primetime just yet: they introduce significant algorithmic and computational complexities compared to high-quality hyperparameter optimization algorithms (e.g., ASHA) without demonstrating improved performance on standard benchmarking tasks.
  • Specialized NAS methods have nonetheless exhibited remarkable advances in the past few years in terms of improved accuracy, reduced computational costs, and architecture size, and could eventually surpass human performance on neural architecture design.

To set the stage, let's first discuss how NAS fits within the wider umbrella of AutoML (automated machine learning).

AutoML ⊃ hyperparameter optimization ⊃ NAS Figure 1. Image courtesy of Determined AI.

AutoML focuses on automating every aspect of the machine learning (ML) workflow to increase efficiency and democratize machine learning so that non-experts can apply machine learning to their problems with ease. While AutoML encompasses the automation of a wide range of problems associated with ETL (extract, transform, load), model training, and model deployment, the problem of hyperparameter optimization is a core focus of AutoML. This problem involves configuring the internal settings that govern the behavior of an ML model/algorithm in order to return a high-quality predictive model.

For example, ridge regression models require setting the value of a regularization term, random forest models require the user to set the maximum tree depth and minimum number of samples per leaf, and training any model with stochastic gradient descent requires setting an appropriate step size. Neural networks also require setting a multitude of hyperparameters, including (1) selecting an optimization method along with its associated set of hyperparameters; (2) setting the dropout rate and other regularization hyperparameters; and, if desired, (3) tuning parameters that control the architecture of the network (e.g., number of hidden layers, number of convolutional filters).

Although the exposition on NAS might suggest it is a completely new problem, our final example above hints at a close relationship between hyperparameter optimization and NAS. While the search spaces used for NAS are generally larger and control different aspects of the neural network architecture, the underlying problem is the same as that addressed by hyperparameter optimization: find a configuration within the search space that performs well on the target task. Hence, we view NAS to be a subproblem within hyperparameter optimization.

NAS is nonetheless an exciting direction to study, as focusing on a specialized subproblem provides the opportunity to exploit additional structure to design custom tailored solutions, as is done by many specialized NAS approaches. In the next section, we will provide an overview of NAS and delve more into the similarities and differences between hyperparameter optimization and NAS.

NAS overview Figure 2. Image courtesy of Determined AI.

Interest in NAS ballooned after the work of Zoph, et. al., 2016 used reinforcement learning to design, at the time, state-of-the-art architectures for image recognition and language modeling. However, Zoph, et. al., 2016, in addition to other first generation specialized approaches for NAS, required a tremendous amount of computational power (e.g., hundreds of GPUs running for thousands (!) of GPU days in aggregate), making them impractical for all but the likes of companies like Google. More recent approaches exploit various methods of reuse to drastically reduce the computational cost, and new methods are being rapidly introduced in the research community.

We'll next dive a bit deeper into the core design decisions associated with all of these specialized NAS methods (for a detailed overview of NAS, we recommend the excellent survey by Elsken, et al., 2017). The three main components are:

  1. Search space. This component describes the set of possible neural network architectures to consider. These search spaces are designed specific to the application—e.g., a space of convolutional networks for computer vision tasks or a space of recurrent networks for language modeling tasks. Hence, NAS methods are not fully automated, as the design of these search spaces fundamentally relies on human-designed architectures as starting points. Even so, there are still many architectural decisions remaining. In fact, the number of possible architectures considered in these search spaces are often over 10^10.
  2. Optimization method. This component determines how to explore the search space in order to find a good architecture. The most basic approach here is random search, while various adaptive methods have also been introduced—e.g., reinforcement learning, evolutionary search, gradient-based optimization, and Bayesian optimization. While these adaptive approaches differ in how they determine which architectures to evaluate, they all attempt to bias the search toward architectures that are more likely to perform well. Unsurprisingly, all of these methods have counterparts that have been introduced in the context of traditional hyperparameter optimization tasks.
  3. Evaluation method. This component measures the quality of each architecture considered by the optimization method. The simplest, but most computationally expensive choice is to fully train an architecture. One can alternatively exploit partial training, similar in spirit to early-stopping methods commonly used in hyperparameter optimization like ASHA. NAS-specific evaluation methods—such as network morphism, weight-sharing, and hypernetworks—have also been introduced to exploit the structure of neural networks to provide cheaper, heuristic estimates of quality. Partial training approaches are typically an order-of-magnitude cheaper than full training, while NAS-specific evaluation methods are two to three orders of magnitude cheaper than full training.

Notably, these are the same three requisite ingredients for traditional hyperparameter optimization methods. The research community has converged on a few canonical benchmarking data sets and tasks to evaluate the performance of different search methods, and we'll next use these benchmarks to report results on head-to-head comparisons between (1) human-designed architectures tuned via hyperparameter optimization methods, and (2) NAS-designed architectures identified via leading specialized NAS methods. (NAS focuses on the problem of identifying architectures, but nonetheless requires a secondary hyperparameter optimization step to tune the non-architecture-specific hyperparameters of the architecture it identifies. Our results show the test error after performing both steps.)

NAS models vs. human-designed models

The two most common tasks used to benchmark NAS methods are (1) designing convolutional neural network (CNN) architectures evaluated on the CIFAR-10 data set, and (2) designing recurrent neural network (RNN) architectures evaluated on the PennTree Bank (PTB) data set. We show the test error for different architectures on CIFAR-10 in the table below.

Source Number of Parameters (Millions) Test Error Search Method Evaluation Method PyramidNet + ShakeDrop Yamada et al., 2018 26 2.31 Human designed - NASNet-A + cutout Zoph et al., 2017 3.3 2.65 Reinforcement Learning Full Train AmoebaNet-B + cutout Real et al., 2018 34.9 2.13 Evolutionary Full Train NAONET Luo et al., 2018 28.6 2.98 Gradient Partial Train DARTS + cutout H. Liu et al., 2018 3.4 2.83 Gradient Weight Sharing Table 1. Test error on CIFAR-10 for leading architectures either human designed or via specialized NAS methods with various search and evaluation methods. Note that all architectures were tuned via standard hyperparameter optimization methods.

For the CIFAR-10 benchmark, specialized NAS methods that use full training perform comparably to manually designed architectures; however, they are prohibitively expensive and take more than 1,000 GPU days. Although methods that exploit partial training or other NAS-specific evaluation methods require less computation to perform the search (400 GPU days and ~1 GPU day, respectively), they are outperformed by the manually designed architecture in Table 1. Notably, the NAS architectures have nearly an order of magnitude fewer parameters than the human-designed model, indicating promising applications of NAS to memory- and latency-constrained settings.

The test perplexity for different architectures on the PTB data set are shown in Table 2.

Source Test Perplexity Search Method Evaluation Method LSTM with MoS Yang et al., 2017 54.4 Human designed - NASNet Zoph et al., 2016 62.4 Reinforcement Learning Full Train NAONET Luo et al., 2018 56.0 Gradient Partial Train DARTS H. Liu et al., 2018 55.7 Gradient Weight Sharing Table 2. Test perplexity on PTB for leading architectures either designed by humans or via specialized NAS methods with various search and evaluation methods. Note that all architectures were tuned via standard hyperparameter optimization methods.

The specialized NAS results are less competitive on the PTB benchmark compared to manually designed architectures. It is surprising, however, that cheaper evaluation methods outperform full training on this benchmark; this is likely due to the additional advances that have been made in training LSTMs since the publication of Zoph,, 2016.

Are specialized NAS methods ready for widespread adoption?

Not yet! To be clear, exploring various architectures and performing extensive hyperparameter optimization remain crucial components of any deep learning application workflow. However, in light of the existing research results (as highlighted above), we believe that while specialized NAS methods have demonstrated promising results on these two benchmarks, they are still not ready for prime time for the following reasons:

  1. Since highly tuned, manually designed architectures are competitive with computationally tractable NAS methods on CIFAR-10 and outperform specialized NAS methods on PTB, we believe resources are better spent on hyperparameter optimization of existing manually designed architectures.
  2. Most specialized NAS methods are fairly specific to a given search space and need to be retrained or retooled for each new search space. Additionally, certain approaches suffer from robustness issues and can be hard to train. These issues currently hinder the general applicability of existing specialized NAS methods to different tasks.


Continue reading What is neural architecture search?.

Categories: Technology

Four short links: 19 December 2018

O'Reilly Radar - Wed, 2018/12/19 - 08:55

Observable Notebooks, Disinformation Report, Chained Blocking, and Trivia from 2018

  1. Observable Notebooks -- JavaScript notebooks. (via Observable Notebooks and iNaturalist)
  2. Disinformation Report -- selective amplification (or pre-consumption filtering) remains one of the most interesting open challenges in infotech, and this report gives context and urgency to it. The IRA shifted a majority of its activity to Instagram in 2017; this was perhaps in response to increased scrutiny on other platforms, including media coverage of its Twitter operation. Instagram engagement outperformed Facebook. New Knowledge note that the Russian misinformation agency was run like a digital marketing shop [...] They built their content using digital marketing best practices, even evolving page logos and typography over time.. (via Renee DiResta)
  3. Twitter Block Chain -- a Chrome extension that blocks followers of the jerk, not just the jerk themselves. The power of the open web is that we can write the tools the platforms don't yet provide, however clunky. (via Hadyn Green)
  4. 52 Things I Learned in 2018 -- each comes with attribution. Three sample facts, sans attribution: (*) 35% of Rwanda’s national blood supply outside the capital city is now delivered by drone. (*) [Unicode] includes a group of ‘ghost characters’ (妛挧暃椦槞蟐袮閠駲墸壥彁) which have no known meaning. It’s believed they are errors introduced by folds and wrinkles during a paper-based 1978 Japanese government project to standardize the alphabet, but are now locked into the standard forever. (*) Cassidy Williams had a dream about a Scrabble-themed mechanical keyboard. When she woke up, she started cold-calling Hasbro to ask for permission to make it real. Eventually, she made it happen.

Continue reading Four short links: 19 December 2018.

Categories: Technology

Deep automation in machine learning

O'Reilly Radar - Wed, 2018/12/19 - 05:00

We need to do more than automate model building with autoML; we need to automate tasks at every stage of the data pipeline.

In a previous post, we talked about applications of machine learning (ML) to software development, which included a tour through sample tools in data science and for managing data infrastructure. Since that time, Andrej Karpathy has made some more predictions about the fate of software development: he envisions a Software 2.0, in which the nature of software development has fundamentally changed. Humans no longer implement code that solves business problems; instead, they define desired behaviors and train algorithms to solve their problems. As he writes, “a neural network is a better piece of code than anything you or I can come up with in a large fraction of valuable verticals.” We won’t be writing code to optimize scheduling in a manufacturing plant; we’ll be training ML algorithms to find optimum performance based on historical data.

If humans are no longer needed to write enterprise applications, what do we do? Humans are still needed to write software, but that software is of a different type. Developers of Software 1.0 have a large body of tools to choose from: IDEs, CI/CD tools, automated testing tools, and so on. The tools for Software 2.0 are only starting to exist; one big task over the next two years is developing the IDEs for machine learning, plus other tools for data management, pipeline management, data cleaning, data provenance, and data lineage.

Karpathy’s vision is ambitious, and we don’t think enterprise software developers need to worry about their jobs any time soon. However, it is clear that the way software is developed is changing. With machine learning, the challenge isn’t writing the code; the algorithms are implemented in a number of well-known and highly optimized libraries. We don’t need to implement our own versions of long short-term memory (LSTM) or reinforcement learning; we get that from PyTorch, Ray RLlib, or some other library. However, machine learning isn’t possible without data, and our tools for working with data aren’t adequate. We have great tools for working with code: creating it, managing it, testing it, and deploying it. But they don’t address the data side, and with ML, managing the data management as important as managing the code itself. GitHub is an excellent tool for managing code, but we need to think about [code+data]. There is no GitHub for data, though we are starting to see version control projects for machine learning models, such as DVC.

It’s important to think precisely about what git does. It captures source code, and all the changes to the source code. For any codebase, it can tell you where the code came from (provenance), and all the changes that led from the original commit to the version you downloaded. It’s capable of maintaining many different branches, reflecting different custom views of the code. If someone has changed a line of code, you will see that change, and who made it. And (with some human help and pain) it can resolve conflicting changes on different branches. Those capabilities are all important for data; but good as git is for code, it isn’t adequate for data. It has trouble with data that isn’t formatted as a sequence of lines (like source code), has problems with binary data, and it chokes on huge files. And it is ill-suited for tracking transformations that change every item in a data set, such as a matrix multiplication or normalization.

We also need better tools for collecting data. Given all the talk about the explosion of data, it’s ironic that most of the data that's exploding falls on the floor and is never captured. Data management isn’t limited to issues like provenance and lineage; one of the most important things you can do with data is collect it. Given the rate at which data is created, data collection has to be automated. How do you do that without dropping data? Given that the results produced by any model will reflect the data used to create the model, how do you ensure your data collection process is fair, representative, and unbiased?

Toward a sustainable ML practice

In our forthcoming report Evolving Data Infrastructure, one aspect we studied was what European organizations were doing to build a sustainable machine learning practice: not a proof of concept or a one-time cool idea to be dropped when the next technical fad comes along, but a permanent part of the organization’s plans. It’s one thing to kick the tires briefly; it’s something else entirely to deeply build the infrastructure needed to integrate machine learning into your organization.

Building a sustainable practice means investing in the tools that allow you to work effectively over the long term. These tools enable you to build software you can rely on, not just proof-of-concept hacks that don’t need to be duplicated. These tools include basics like ETL (extract, transform and load: extracting data from multiple sources, transforming it into a form that’s useful, and loading it into a datastore for analysis). It’s no surprise that companies are investing in data science platforms to run machine learning at scale, just as they invested in Hadoop a decade ago. And given that most of the work of a data scientist is cleaning the data prior to analysis, it’s no surprise that most companies are investing in tools for data preparation. These are tools we would have expected to see on the list five years ago as companies started building their data science practices.

We also see investment in new kinds of tools. Anomaly detection is well-known in the financial industry, where it’s frequently used to detect fraudulent transactions, but it can also be used to catch and fix data quality issues automatically. This isn’t surprising; if you’re collecting data from several weather stations and one of them malfunctions, you would expect to see anomalous data. A faulty weather station might stop reporting data (which might be turned into zeros, infinities, or nulls in your data stream), or it might just send readings that are a few degrees above what’s expected, or that are out of line with other stations in the area. In any case, there will be an anomaly in the input data, and it will be easier for a machine to detect that anomaly than a human. If you suddenly see unexpected patterns in your social data, that may mean adversaries are attempting to poison your data sources. Anomaly detection may have originated in finance, but it is becoming a part of every data scientist’s toolkit.

Metadata analysis makes it possible to build data catalogs, which in turn allow humans to discover data that’s relevant to their projects. Democratizing access to data is a major step on the process to becoming a data-driven (or an AI-driven) company; users must be empowered to explore data and to create their own projects. That is difficult without some kind of data catalog. You can tell users they have access to all the data they need, and given them access to databases, but unless they know what data is available and how to find it, that access doesn’t mean anything. Creating that catalog by hand isn’t possible; it needs to be automated.

Data lineage

The history of data analysis has been plagued with a cavalier attitude toward data sources. That is ending; discussions of data ethics have made data scientists aware of the importance of data lineage and provenance. Both refer to the source of the data: where does the data come from, how was it gathered, and how was it modified along the way? Data provenance is increasingly a legal issue; it’s clearly important to know where data came from and how it was obtained. It’s particularly important when you’re combining data from multiple sources; we’ve often observed that data is most powerful when several sources are combined. Provenance can get very complex, particularly when results generated from one set of data are further combined with other data.

It’s important to be able to trace data lineage at a granular level, to understand the entire path from the source to the application. Data is modified all the time: it’s often been observed that most of the work in data science is cleanup or preparation. Data cleaning involves modifying the data: eliminating rows that have missing or illegal values, for example. We’re beginning to understand exactly how important it is to understand what happened during that cleanup, how data evolved from its raw state: that can be a source of error and bias. As companies ingest and use more data, and as the number of consumers of that data increases, it’s important to know the data is trustworthy. When data is modified, it’s important to track exactly how and when it was modified.

The tools for tracking data provenance and lineage are limited, although products from commercial vendors such as Trifacta are starting to appear. Git and its predecessors (SVN and even RCS) can track every change to every line of code in software, maintain multiple branches of the code, and reconcile differences between branches. How do we do that for data? Furthermore, what will we do with the results? It's common to normalize data, or to transform in some way, but such transformations can easily change every byte in the data set.

Not only do such changes pose problems, but tools like git force humans to supply explanatory comments when they commit a new version to explain why any change was made. That's not possible with an automated data pipeline. It might be possible for systems to log and "explain" the changes they make, but this assumes you have fine-grained control to force them to do so.

Such control may be possible within the scope of a single tool. For example, Jacek Laskowski describes how to extract a resilient distributed data set (RDD) lineage graph that describes a series of Spark transformations. This graph could be committed to a lineage tracking system, or even a more traditional version-control system, to document transformations that have been applied to the data. But this process only applies to a single machine learning platform: Spark. To be generally useful, every platform would need to support extracting a lineage graph, preferably in a single format and without requiring additional scripting by developers. That's a good vision for where we need to go, but we're not there yet.

Data provenance and lineage isn’t just about the quality of the results; it’s a security and compliance issue. At the Strata Data Conference in New York in 2017, danah boyd argued that social media systems were intentionally poisoned by tools that propagated low-quality content designed to sway the algorithms that determined what people watch. Malicious agents have learned to “hack the attention economy.” In "Flat Light: Data Protection for the Disoriented, from Policy to Practice," Andrew Burt and Daniel Geer argue that in the past, data accuracy was binary; data was either correct or incorrect. Now, data provenance is as important as correctness, if not more so. You can’t judge whether data is reliable if you don’t know its origin. For machine learning systems, this means we need to track source data as well as source code: the data used to train the system is as important to its behavior as the algorithms and their implementation.

We are starting to see some tools that automate data quality issues. Intuit uses the Circuit Breaker pattern to halt data pipelines when they detect anomalies in the data. Their tool tracks data lineage because it’s important to understand the inputs and outputs of every stage of the pipeline; it also tracks the status of the pipeline components themselves and the quality of the data at every stage of the pipeline (is it within expected bounds, is it of the appropriate type, etc.). Intuit, Netflix, and Stitchfix have built data lineage systems that track the origin and evolution of the data that they use in their systems.

Automation is more than model building

In the past year, we have seen several companies build tools to “automate machine learning,” including Google and Amazon. These tools automate the process of building models: trying different algorithms and topologies, to minimize error when the model is used on test data. But these tools just build models, and we’ve seen that machine learning requires much more. The model can’t exist without tools for data integration and ETL, data preparation, data cleaning, anomaly detection, data governance, and more. Automating model building is just one component of automating machine learning.

To be truly useful, automated machine learning has to go much deeper than model building. It’s too simple to think a machine learning project will require a single model; one project can easily require several different models, doing different things. And different aspects of the business, while superficially similar, can require different models, trained from different data sources. Consider a hotel business such as Marriott: more than 6,000 hotels, and more than $20 billion in gross revenue. Any hotel would like to predict occupancy, income, and the services they need to provide. But each hotel is a completely different business: The Times Square Marriott is dominated by large corporate conferences and New York City tourism, while the Fairfield Inn in Sebastopol is dominated by local events and wine country tourism. The customer demographics are different; but more than that, the event sources are different. The Sebastopol hotel needs to know about local weddings and wine country events; I’d expect them to use natural language processing to parse feeds from local newspapers. The Times Square hotel needs to know about Broadway openings, Yankees games, and Metro-North train schedules. This isn’t just a different model; these two businesses require completely different data pipelines. Automating the model building process is helpful, but it doesn’t go far enough.

Hotels aren’t the only business requiring more models than humans can conceivably build. Salesforce provides AI services for its clients, which number in the hundreds of thousands. Each client needs a custom model; models can’t be shared, even between clients in similar businesses. Aside from confidentiality issues, no two clients have the same customers or the same data, and small differences between clients can add up to large errors. Even with the most optimistic estimates for machine learning talent, there aren’t enough people to build that many models by hand. Salesforce’s solution is TransmogrifAI, an open source automated ML library for structured data. TransmogrifAI automates the model building process, like other Auto ML solutions, but it also automates many other tasks, including data preparation and feature validation.

Other enterprise software vendors are in the same boat: they have many customers, each of whom requires “custom models.” They cannot hire enough data scientists to support all of these customers with conventional manual workflows. Automation isn’t an option; it’s a necessity.

Automation doesn’t stop when the model is “finished”; in any real-world application, the model can never be considered “finished.” Any model’s performance will degrade over time: situations change, people change, products change, and the model may even play a role in driving that change. We expect to see new tools for automating model testing, either alerting developers when a model needs to be re-trained or starting the training process automatically. And we need to go even further: beyond simple issues of model accuracy, we need to test for fairness and ethics. Those tests can’t be automated completely, but tools can be developed to help domain experts and data scientists detect problems of fairness. For example, such a tool might generate an alert when it detects a potential problem, like a significantly higher loan rejection rate from a protected group; it might also provide tools to help a human expert analyze the problem and make a correction.

Closing thoughts

The way we build software is changing. Whether or not we get to Karpathy’s Software 2.0, we’re certainly on a road headed in that direction. The future holds more machine learning, not less; developing and maintaining models will be part of the job of building software. Software developers will be spending less time writing code and more time training models.

However, the lack of data—and of tools for working with data—remains a fundamental bottleneck. Over the past 50 years, we’ve developed excellent tools for working with software. We now need to build the tools for software+data: tools to track data provenance and lineage, tools to build catalogs from metadata, tools to do fundamental operations like ETL. Companies are investing in these foundational technologies.

The next bottleneck will be model building itself; the number of models we need will always be much greater than the number of people capable of building those models by hand. Again, the solution is building tools for automating the process. We need to do more than automate model building with autoML; we also need to automate feature engineering, data preparation, and other tasks at every stage of the data pipeline. Software developers are, after all, in the business of automation. And the most important thing for software developers to automate is their own work.

Related content

Continue reading Deep automation in machine learning.

Categories: Technology

10 top AWS resources on O’Reilly’s online learning platform

O'Reilly Radar - Wed, 2018/12/19 - 04:00

Our most-used AWS resources will help you stay on track in your journey to learn and apply AWS.

We dove into the data on our online learning platform to identify the most-used Amazon Web Services (AWS) resources. These are the items our platform subscribers regularly turn to as they apply AWS in their projects and organizations.

AWS Certified Solutions Architect Official Study Guide — This official study guide, written by AWS experts, covers exam concepts and provides key review on exam topics.

Amazon Web Services AWS: AWS Fundamentals — Richard Jones walks you through six hours of video instruction on AWS with coverage on cloud computing and available AWS services and provides a guided hands-on look at using services such as EC2 (Elastic Compute Cloud), S3 (Simple Storage Service), and more.

AWS System Administration — Federico Lucifredi and Mike Ryan show developers and system administrators how to configure and manage AWS services, including EC2, CloudFormation, Elastic Load Balancing, S3, and Route 53.

Serverless Architectures on AWS: With Examples Using AWS Lambda — Peter Sbarski teaches you how to build, secure, and manage serverless architectures that can power the most demanding web and mobile apps.

Learning AWS, 2nd Edition — Amit Shah and Aurobindo Sarkar help you become well-versed with the various services that AWS provides and show you how to use AWS infrastructure to accelerate the development process.

Automation in AWS with CloudFormation, CLI, and SDKs — Richard Jones covers AWS services and tools used to automate the creation and maintenance of AWS infrastructure, including VPC, EC2, Lambda, RDS, and deploying containerized microservices with Docker.

AWS: Security Best Practices on AWS — Albert Anthony focuses on using native AWS security features and managed AWS services to help you achieve continuous security.

Amazon Web Services in Action — Michael Wittig and Andreas Wittig introduce you to computing, storing, and networking in the AWS cloud.

AWS CloudFormation Master Class — Stéphane Maarek teaches you to write complete AWS CloudFormation templates using YAML and covers all the recent CloudFormation features.

AWS Lambda in Action: Event-Driven Serverless Applications — Danilo Poccia offers an example-driven tutorial that teaches you how to build applications that use an event-driven approach on the back end.

Continue reading 10 top AWS resources on O’Reilly’s online learning platform.

Categories: Technology

Four short links: 18 December 2018

O'Reilly Radar - Tue, 2018/12/18 - 05:05

Singing AI, Content Signing, Data Rights, and Query Processing

  1. AI Voices -- marketing copy, but I can't find technical detail. The demos are worth checking out. The sprint to automated pop music generation has begun. Not just limited to Japanese, as it is also capable of producing convincing Mandarin and even English voices for songs such as Adele’s "Rolling in the Deep" and Britney Spears’ "Everytime" on their official website.
  2. Notary -- publishers can sign their content offline using keys kept highly secure. Once the publisher is ready to make the content available, they can push their signed trusted collection to a Notary server. Consumers, having acquired the publisher's public key through a secure channel, can then communicate with any Notary server or (insecure) mirror, relying only on the publisher's key to determine the validity and integrity of the received content.
  3. It's Time for a Bill of Data Rights (MIT TR) -- this essay argues that “data ownership” is a flawed, counterproductive way of thinking about data. It not only does not fix existing problems, it creates new ones. Instead, we need a framework that gives people rights to stipulate how their data is used without requiring them to take ownership of it themselves. (via Cory Doctorow)
  4. Trill -- a single-node query processor for temporal or streaming data: open source from Microsoft. Described in this blog post.

Continue reading Four short links: 18 December 2018.

Categories: Technology

Four short links: 17 December 2018

O'Reilly Radar - Mon, 2018/12/17 - 05:10

Open Source Licensing, Computer History, Serverless, and Wicked Problems

  1. Open Source Confronts Its Midlife Crisis (Bryan Cantrill) -- To be clear, the underlying problem is not the licensing, it’s that these companies don’t know how to make money—they want open source to be its own business model, and seeing that the cloud service providers have an entirely viable business model, they want a piece of the action. Also see Bryan's followup: A EULA in FOSS Clothing: You will notice that this looks nothing like any traditional source-based license—but it is exactly the kind of boilerplate that you find on EULAs, terms-of-service agreements, and other contracts that are being rammed down your throat.
  2. A Computer of One's Own -- fantastic precis of the work of significant women in computing history.
  3. Serverlessness (Tim Bray) -- Tim works in AWS's Serverless group and has been collecting what he's learned in his years building serverless infrastructure.
  4. Why We Suck at Solving Wicked Problems -- this rings true with my experience.

Continue reading Four short links: 17 December 2018.

Categories: Technology

Four short links: 14 December 2018

O'Reilly Radar - Fri, 2018/12/14 - 02:00

Satellite LoRaWAN, Bret Victor, State of AI, and Immutable Documentation

  1. Fleet -- launched satellites as backhaul for LoRaWAN base station traffic.
  2. Computing is Everywhere -- podcast episode with Bret Victor. Lots of interesting history and context to what he's up to at Dynamicland. (via Paul Ford)
  3. AI Index 2018 Report (Stanford) -- think of it as the Mary Meeker report for AI.
  4. Etsy's Experiment with Immutable Documentation -- In trying to overcome the problem of staleness, the crucial observation is that how-docs typically change faster than why-docs do. Therefore the more how-docs are mixed in with why-docs in a doc page, the more likely the page is to go stale. We’ve leveraged this observation by creating an entirely separate system to hold our how-docs.

Continue reading Four short links: 14 December 2018.

Categories: Technology

Four short links: 13 December 2018

O'Reilly Radar - Thu, 2018/12/13 - 02:00

CS Ethics, Insect IoT, Glitch Showcase, and SQL Repos

  1. Embedded Ethics -- Harvard project that integrates ethics modules into courses across the standard computer science curriculum. Those modules are straightforward, online, and open access.
  2. Living IOT: A Flying Wireless Platform on Live Insects -- We develop and deploy our platform on bumblebees which includes backscatter communication, low-power self-localization hardware, sensors, and a power source. We show that our platform is capable of sensing, backscattering data at 1 kbps when the insects are back at the hive, and localizing itself up to distances of 80 m from the access points, all within a total weight budget of 102 mg. (via BoingBoing)
  3. Looky What We Made -- showcase of Glitch apps.
  4. Git Your SQL Together -- why I recommend tracking SQL queries in git: 1. You will *always* need that query again. 2. Queries are living artifacts that change over time. 3. If it’s useful to you, it’s useful to others (and vice versa)

Continue reading Four short links: 13 December 2018.

Categories: Technology

Four short links: 12 December 2018

O'Reilly Radar - Wed, 2018/12/12 - 05:00

Render as Comic, Notebook to Production, Population Visualization, and Location Privacy

  1. Comixify -- render video as comics.
  2. How to Grow Neat Software Architecture out of Jupyter Notebooks -- everyone's coding in notebooks as a sweet step up from the basic one-command REPL loop. Here's some good advice on how to grow these projects without creating a spaghetti monster.
  3. City 3D -- This project wields data from the Global Human Settlement Layer, which uses “satellite imagery, census data, and volunteered geographic information” to create population density maps. Best visualization I've seen in a very long time.
  4. Your Apps Know Where You Were Last Night, and They're Not Keeping It Secret (NY Times) -- At least 75 companies receive anonymous, precise location data from apps whose users enable location services to get local news and weather or other information. They claim 200M mobile devices, with updates as often as every six seconds. These companies sell, use, or analyze the data to cater to advertisers, retail outlets, and even hedge funds seeking insights into consumer behavior. [...] An app may tell users that granting access to their location will help them get traffic information, but not mention that the data will be shared and sold. That disclosure is often buried in a vague privacy policy.

Continue reading Four short links: 12 December 2018.

Categories: Technology

Four short links: 11 December 2018

O'Reilly Radar - Tue, 2018/12/11 - 12:25

Can We Stop?, Everything Breaks, Edge Cloud, and Molly Guard

  1. The Seductive Diversion of Solving Bias in Artificial Intelligence -- provocative title, but the point is that the preoccupation with narrow computational puzzles distracts us from the far more important issue of the colossal asymmetry between societal cost and private gain in the rollout of automated systems. It also denies us the possibility of asking: should we be building these systems at all? The expected value of pursuing this line of thinking is pretty low because there's a vanishingly small probability that we can coordinate activity globally to prevent something bad from happening. Exhibit A: climate change.
  2. Everything Breaks (Michael Lopp) -- Humans will greatly benefit from a clear explanation of the rules of the game. The rules need to evolve in unexpected ways to account for the arrival of more humans. The only way to effectively learn to what is going to break is keeping playing...and learning. See also lessons learned from scaling Stripe's engineering team.
  3. Terrarium (Fastly) -- an interesting glimpse at a possible future for web apps, where your CDN (which you need to have anyway if you're publishing anything remotely contentious or interesting) blurs with your hosting infrastructure provider. Terrarium is a multi-language deployment platform based on WebAssembly. Think of it as a playground for experimenting with edge-side WebAssembly. Being one of the first Fastly Labs projects, you can also think of it as our way of publicly experimenting with what the future of real highly performant edge computing could look like.
  4. molly-guard -- protects machines from accidental shutdowns/reboots. Etymology of the name: originally a Plexiglas cover improvised for the Big Red Switch on an IBM 4341 mainframe after a programmer's toddler daughter (named Molly) tripped it twice in one day. Later generalized to covers over stop/reset switches on disk drives and networking equipment. (via Mike Forbes)

Continue reading Four short links: 11 December 2018.

Categories: Technology


Subscribe to LuftHans aggregator