You are here

Feed aggregator

Tools for machine learning development

O'Reilly Radar - Wed, 2019/07/03 - 06:35

The O'Reilly Data Show: Ben Lorica chats with Jeff Meyerson of Software Engineering Daily about data engineering, data architecture and infrastructure, and machine learning.

In this week's episode of the Data Show, we're featuring an interview Data Show host Ben Lorica participated in for the Software Engineering Daily Podcast, where he was interviewed by Jeff Meyerson. Their conversation mainly centered around data engineering, data architecture and infrastructure, and machine learning (ML).

Continue reading Tools for machine learning development.

Categories: Technology

New live online training courses

O'Reilly Radar - Wed, 2019/07/03 - 04:20

Get hands-on training in TensorFlow, AI applications, critical thinking, Python, data engineering, and many other topics.

Learn new topics and refine your skills with more than 151 new live online training courses we opened up for July and August on the O'Reilly online learning platform.

AI and machine learning

Getting Started with Tensorflow.js, July 23

Building Intelligent Analytics Through Time Series Data, July 31

Natural Language Processing (NLP) from Scratch , August 5

Cloud Migration Strategy: Optimizing Future Operations with AI, August 7

Intermediate Natural Language Processing (NLP), August 12

Machine Learning for Business Analytics: A Deep Dive into Data with Python, August 19

Inside unsupervised learning: Semisupervised learning using autoencoders, August 20

TensorFlow 2.0 Essentials – What's New , August 23

A Practical Introduction to Machine Learning , August 26

Artificial Intelligence: Real-world Applications , August 26

Inside Unsupervised Learning: Generative Models and Recommender Systems, August 27

Hands-On Algorithmic Trading with Python, September 3

Artificial Intelligence: AI for Business, September 4

TensorFlow Extended: Data Validation and Transform, September 11


Introducing Blockchain, August 2


Building Your People Network, July 8

Getting Unstuck, August 5

How to Choose Your Cloud Provider, August 7

Spotlight on Data: Data Pipelines and Power Imbalances—3 Cautionary Tales with Catherine D’Ignazio and Lauren Klein, August 19

Salary Negotiation Fundamentals, August 20

Fundamentals of Cognitive Biases, August 20

Empathy at Work, August 20

Developing Your Coaching Skills, August 21

Applying Critical Thinking, August 22

Building Your People Network, August 27

60 Minutes to Designing a Better PowerPoint Slide , August 27

60 Minutes to a Better Prototype, August 27

Introduction to Critical Thinking, August 27

Spotlight on Learning from Failure: Fixing with Sha Hwang, August 27

Managing Your Manager, August 28

Scrum Master: Good to Great, August 29

Being a Successful Team Member, September 4

Fundamentals of Learning: Learn faster and better using neuroscience, September 5

Leadership Communication Skills for Managers, September 10

Getting S.M.A.R.T about Goals, September 10

Spotlight on Innovation: Enabling Growth Through Disruption with Scott Anthony, September 11

Writing User Stories, September 11

Data science and data tools

Applied Probability Theory from Scratch, July 17

Interactive Visualization Approaches, July 25

Apache Hadoop, Spark and Big Data Foundations , August 1

Visualizing Software Architecture with the C4 Model, August 2

Data Engineering for Data Scientists, August 6

Analyzing and Visualizing Data with Microsoft Power BI, August 9

Hands-on Introduction to Apache Hadoop and Spark Programming, August 12-13

Scalable Data Science with Apache Hadoop and Spark, August 19

IoT Fundamentals, August 20-21

Algorithmic Risk Management in Trading and Investing, August 23

Business Data Analytics Using Python, August 26

Python Data Science Full Throttle with Paul Deitel: Introductory Artificial Intelligence (AI), Big Data and Cloud Case Studies, August 26

Real-time Data Foundations: Flink, August 27

Managing Enterprise Data Strategies with Hadoop, Spark, and Kafka, August 29

Design and product management

Introduction to UI & UX design, August 28


Kotlin for Android, July 11-12

SQL for Any IT Professional, July 16

Design Patterns in Java, July 29-30

Discovering Modern Java, August 2

Essentials of JVM Threading, August 2

Getting Started with Pandas, August 6

Programming with Data: Foundations of Python and Pandas, August 12

Beginner’s Guide to Writing AWS Lambda Functions in Python, August 12

Solving Java Memory Leaks, August 12

Introduction to Python Programming , August 12

Working with Dataclasses in Python 3.7, August 15

Reactive Programming with Java Completable Futures, August 15

Getting Started with Python's Pytest, August 19

Visualization in Python with Matplotlib, August 19

Python Full Throttle with Paul Deitel: A One-Day, Fast-Paced, Code-Intensive Python, August 19

Oracle Java SE Programmer I Crash Course: Pass the 1Z0-815 or 1Z0-808 Exams, August 19-21

Linux Troubleshooting: Advanced Linux Techniques, August 20

Introduction to the Bash Shell, August 21

Getting Started with Node.js, August 21

Applied Cryptography with Python, August 22

Mentoring Technologists, August 22

CSS Layout Fundamentals: From Floats to Flexbox and CSS Grid, August 22

React Hooks in Action, August 23

Getting Started with Java: From Core Concepts to Real Code in 4 Hours, August 23

Bash Shell Scripting in 4 Hours, August 23

Continuous Delivery and Tooling in Go, August 26

Mastering SELinux, August 26

Functional Programming in Java, August 26-27

Scalable Concurrency with the Java Executor Framework, August 29

SOLID Principles of Object-Oriented and Agile Design, August 30

Fraud Analytics using Python, September 3

Getting Started with Spring and Spring Boot, September 3-4

Linear Algebra with Python: Essential Math for Data Science, September 5

Python-Powered Excel, September 9

Design Patterns Boot Camp , September 9-10

Secure JavaScript with Node.js, September 12


Introduction to Digital Forensics and Incident Response (DFIR), July 31

Cisco Security Certification Crash Course , August 16

Security Operation Center (SOC) Best Practices, August 19

Expert Transport Layer Security (TLS), August 20

CompTIA A+ Core 1 (220-1001) Certification Crash Course, August 21-22

Introduction to Ethical Hacking and Penetration Testing, August 22-23

CISSP Crash Course, August 27-28

CISSP Certification Practice Questions and Exam Strategies, August 28

Defensive Cybersecurity Fundamentals, August 29

Cybersecurity Offensive and Defensive Techniques in 3 Hours, August 30

Azure Security Fundamentals, September 4

Systems engineering and operations

DevOps on Google Cloud Platform (GCP), July 8

Getting Started with Microsoft Azure, July 12

Getting Started with Amazon Web Services (AWS), July 24-25

Ansible for Managing Network Devices, August 1

Software Architecture for Developers, August 1

Practical Software Design from Problem to Solution , August 1

Facebook Libra, August 1

Introducing Infrastructure as Code with Terraform, August 1

AWS CloudFormation Deep Dive, August 5-6

Rethinking REST: A hands-on guide to GraphQL and queryable APIs, August 6

Julia 1.0 Essentials, August 6

Getting Started with Serverless Architectures on Azure, August 8

Deploying Container-Based Microservices on AWS, August 12-13

AWS Access Management, August 13

Exam AZ-103: Microsoft Azure Administrator Crash Course, August 15-16

Architecture for Continuous Delivery, August 19

Getting Started with OpenStack, August 19

AWS Certified Big Data - Specialty Crash Course, August 19-20

Google Cloud Platform – Professional Cloud Developer Crash Course, August 19-20

CompTIA Network+ N10-007 Crash Course , August 19-21

Shaping and Communicating Architectural Decisions, August 20

AWS Certified Cloud Practitioner Exam Crash Course, August 20-21

Software Architecture Foundations: Characteristics and Tradeoffs, August 21

Google Cloud Platform Professional Cloud Architect Certification Crash Course, August 21-22

Red Hat RHEL 8 New Features, August 22

Introduction to Google Cloud Platform , August 22-23

Istio on Kubernetes: Enter the Service Mesh, August 27

AWS Monitoring Strategies, August 27-28

Red Hat Certified System Administrator (RHCSA) Crash Course, August 27-30

Azure Architecture: Best Practices, August 28

Web Performance in Practice, August 28

AWS Account Setup Best Practices , August 29

Getting Started with Amazon SageMaker on AWS, August 29

Jenkins 2: Beyond the Basics, September 3

Comparing Service-based Architectures , September 3

Microservice Collaboration, September 3

Introduction to Docker Compose, September 3

Chaos Engineering: Planning and Running Your First Game Day, September 3

Next-level Git: Master Your Workflow, September 4

Introduction to Knative, September 5

Reactive Spring and Spring Boot, September 9

Developing DApps with Ethereum, September 9

Building a Deployment Pipeline with Jenkins 2, September 9-10

Building Data APIs with GraphQL, September 11

Creating React Applications with GraphQL , September 12

Jenkins 2: Up and Running, September 12

Microservices Caching Strategies, September 12

Chaos Engineering: Planning, Designing, and Running Automated Chaos Experiments, September 12

Google Cloud Platform Security Fundamentals, September 12

Understanding AWS Cloud Compute Option, September 12-13

Google Cloud Certified Associate Cloud Engineer Crash Course, September 12-13

Continue reading New live online training courses.

Categories: Technology

Four short links: 3 July 2019

O'Reilly Radar - Wed, 2019/07/03 - 04:00

Models, More Models, Robots.txt, and Event Sourcing

  1. On Models (Tom Stafford) -- a Twitter thread where he lays out his work in models and the value of them.
  2. Why Model? -- The [article] distinguishes between explanation and prediction as modeling goals, and offers 16 reasons other than prediction to build a model. It also challenges the common assumption that scientific theories arise from and 'summarize' data, when often, theories precede and guide data collection; without theory, in other words, it is not clear what data to collect. Among other things, it also argues that the modeling enterprise enforces habits of mind essential to freedom.
  3. Robots.txt -- Google's robots.txt parser and matcher as a C++ library (compliant to C++11). Released as part of standardization work.
  4. Mistakes We Made Adopting Event Sourcing (And How We Recovered) -- a useful post for those also considering their first system built around events as the mechanism for changing state.

Continue reading Four short links: 3 July 2019.

Categories: Technology

Four short links: 2 July 2019

O'Reilly Radar - Tue, 2019/07/02 - 05:00

Lock Convoys, AI Hardware, Lambda Observability, and AI for Science

  1. The Convoy Phenomenon (Adrian Colyer) -- locks on resources that lead to performance degradation which never recovers, a situation first described in 1979.
  2. AI is Changing the Entire Nature of Compute (ZD) -- workloads have been doubling every 3.5 months, while our post-Moore's law chip speed increases have been 3.5% per year. What that means, both authors believe, is that the design of chips, their architecture, as it's known, has to change drastically in order to get more performance out of transistors that are not of themselves producing performance benefits. The article explores some of those directions.
  3. The Annoying State of Lambda Observability -- In the current state of the world, the available strategies boil down to either: (1) Send telemetry directly to external observability tools during Lambda execution. (2) Scrape or trigger off the telemetry sent to CloudWatch and X-Ray to populate external providers. Spoiler: neither option is ideal.
  4. Accelerating Science: A Computing Research Agenda -- I found this quite challenging at first because it seemed to be "cheating" somehow. But once I viewed it as the computer augmenting the human, not replacing them, then it was more acceptable. But I can imagine that better tools for each step of the scientific journey (e.g., Expressing, reasoning with, updating scientific arguments (along with supporting assumptions, facts, observations), including languages and inference techniques for managing multiple, often conflicting arguments, assessing the plausibility of arguments, their uncertainty and provenance) will create controversy no less than the software "proof" of the four-color theorem did.

Continue reading Four short links: 2 July 2019.

Categories: Technology

Four short links: 1 July 2019

O'Reilly Radar - Mon, 2019/07/01 - 05:30

General-Purpose Probabilistic Programming, Microsoft's Linux, Decolonizing Data, Testing Statistical Softwares

  1. Gen -- general-purpose probabilistic programming system with programmable inference. Julia package described as Gen's flexible modeling and inference programming capabilities unify symbolic, neural, probabilistic, and simulation-based approaches to modeling and inference, including causal modeling, symbolic programming, deep learning, hierarchical Bayesian modeling, graphics and physics engines, and planning and reinforcement learning..
  2. WSL2 Linux Kernel -- source for the Linux kernel used in Windows Subsystem for Linux 2 (WSL2).
  3. Decolonizing Data -- Decolonizing data means that the community itself is the one determining the information they want us to gather. Why are we gathering it? Who's interpreting it? And are we interpreting it in a way that truly serves our communities? Decolonizing data is about controlling our own story and making decisions based on what is best for our people. That hasn't been done in data before, and that's what's shifting and changing.
  4. Testing Statistical Software -- In this post, I describe how I evaluate the trustworthiness of a modeling package, and in particular what I want from the test suite. If you use statistical software, this post will help you evaluate whether a package is worth using. If you write statistical software, this post will help you confirm the correctness of the code that you write.

Continue reading Four short links: 1 July 2019.

Categories: Technology

RISELab’s AutoPandas hints at automation tech that will change the nature of software development

O'Reilly Radar - Mon, 2019/07/01 - 04:00

Neural-backed generators are a promising step toward practical program synthesis.

There's a lot of hype surrounding AI, but are companies actually beginning to use AI technologies? In a survey we released earlier this year, we found that more than 60% of respondents worked in organizations that planned to invest some of their IT budgets into AI. We also found that the level of investment depended on how much experience a company already had with AI technologies, with companies further along the maturity curve planning substantially higher investments. As far as current levels of adoption, the answer depended on the industry sector. We found that in several industries, 30% or more of respondents described their organizations as having a mature AI practice:

Figure 1. Stage of adoption of AI technologies (by industry). Image by Ben Lorica.

In which areas or domains are AI technologies being applied? As with any new technology, AI is used for a lot of R&D-related activity. But we are also beginning to see AI and machine learning gain traction in areas like customer service and IT. In a recent post, we outlined the many areas pertaining to customer experience, where AI-related technologies are beginning to make an impact. This includes things like data quality, personalization, customer service, and many other factors that impact customer experience.

Figure 2. Areas where AI is being applied (by stage of adoption). Image by Ben Lorica.

One area I’m particularly interested in is the application of AI and automation technologies in data science, data engineering, and software development. We’ve sketched some initial manifestations of “human in the loop” technologies in software development, where initial applications of machine learning are beginning to change how people build and manage software systems. Automation has also emerged as one of the hottest topics in data science and machine learning (AutoML), and teams of researchers and practitioners are actively building tools that can automate every stage of a machine learning pipeline.

For a typical data scientist, data engineer, or developer, there is an explosion of tools and APIs they now need to work with and “master.” A data scientist might need to know Python, pandas, numpy, scikit-learn, one or more deep learning frameworks, Apache Spark, and more. According to a recent blog post by Khaliq Gant, a web developer is typically expected to demonstrate competence in things like "navigating the terminal, HTML, CSS, JavaScript, cloud infrastructure, deployment strategies, databases, HTTP protocols, and that’s just the beginning." Data engineers additionally need to master several pieces of infrastructure.

How do data scientists, data engineers, and developers cope with this explosion of tools and APIs? They typically use search (Google) or post in forums (Stack Overflow, Slack, mailing lists). In both instances, it takes some baseline knowledge to both frame a question and to be able to discern which answer is the "best one" to choose. In the case of forums, there might be a significant delay before one obtains an adequate response. Those with more resources and more time to spare can avail of free or paid learning resources such as books, videos, or training courses.

There are emerging automation tools that can drastically increase the efficiency and productivity of software developers. At his recent keynote at the Artificial Intelligence conference in Beijing, professor Ion Stoica, director of UC Berkeley’s RISELab, unveiled a new research project that hints at a path forward for software developers. Their initial output is AutoPandas, a program synthesis engine for Pandas, one of the most widely used data science libraries today. As described in a paper from Microsoft and the University of Washington, program synthesis is a longstanding research area in computer science:

Program synthesis is the task of automatically finding a program in the underlying programming language that satisfies the user intent expressed in the form of some specification. Since the inception of AI in the 1950s, this problem has been considered the holy grail of computer science.

An AutoPandas user simply specifies an input and output data structure (i.e., dataframes), and AutoPandas automatically synthesizes an optimal program that produces the desired output from the given input. AutoPandas relies on "program generators" that capture the API constraints to reduce the search space (the space of possible programs is immense), neural network models to predict the arguments of the API calls, and the distributed computing framework Ray to scale up the search.

Figure 3. An AutoPandas user specifies an input and output data structure, and neural-backed generators output an optimal program. Image by Ben Lorica.

While we are still very much in the early days, neural-backed generators are an extremely promising step toward practical program synthesis. Note that while researchers at RISELab have initially focused on pandas, the techniques and tools behind AutoPandas can be applied to other APIs (e.g., numpy, TensorFlow, etc.). So, any number of popular tools used by developers, data scientists, or data engineers can benefit from some level of automation via program synthesis.

Programming tools have always changed over time (I no longer use Perl, for example), and there has always been an expectation that technologists should be able to adapt to the latest tools and methods. Continued progress in tools for program synthesis means that automation will change how data scientists, data engineers, or developers do their work. One can imagine a future where mastery of individual tools and APIs will matter less, and technologists can focus on architecture and building end-to-end systems and applications. As tools and APIs get easier to use, your employer won't care as much about what tools you know coming into a job, but they will expect you to possess "soft skills" (including skills that cannot be easily automated), domain knowledge and expertise, and the ability to think holistically.

Related content:

Continue reading RISELab’s AutoPandas hints at automation tech that will change the nature of software development.

Categories: Technology

Four short links: 28 June 2019

O'Reilly Radar - Fri, 2019/06/28 - 03:40

Heartbeat Identity, Seam Carving, Q&A Facilitation, and Secure Data in Distributed Systems

  1. The Pentagon Has a Laser That Can Identify People From a Distance By Their Heartbeat (MIT TR) -- A new device, developed for the Pentagon after U.S. Special Forces requested it, can identify people without seeing their faces: instead, it detects their unique cardiac signature with an infrared laser. While it works at 200 meters (219 yards), longer distances could be possible with a better laser. [...] It takes about 30 seconds to get a good return, so at present the device is only effective where the subject is sitting or standing.
  2. Real-world Dynamic Programming: Seam Carving -- nifty explanation of using dynamic programming (which has a reputation as a technique you learn in school, then only use to pass interviews at software companies) to implement intelligent image resizing.
  3. How to Facilitate Q&As (Eve Tuck) -- People don’t always bring their best selves to the Q&A—people can act out their own discomfort about the approach or the topic of the talk. We need to do better. I believe in heavily mediated Q&A sessions.
  4. Project Oak -- a specification and a reference implementation for the secure transfer, storage, and processing of data in distributed systems. From Google.

Continue reading Four short links: 28 June 2019.

Categories: Technology

One simple chart: Who is interested in Spark NLP?

O'Reilly Radar - Thu, 2019/06/27 - 04:10

As we close in on its two-year anniversary, Spark NLP is proving itself a viable option for enterprise use.

In July 2016, I broached the idea for an NLP library aimed at Apache Spark users to my friend David Talby. A little over a year later, Talby and his collaborators announced the release of Spark NLP. They described the motivation behind the project in their announcement post and in this accompanying podcast that Talby and I wrote, as well as in this recent post comparing popular open source NLP libraries. [Full disclosure: I’m an advisor to Databricks, the startup founded by the team that originated Apache Spark.]

As we close in on the two-year anniversary of the project, I asked Talby where interest in the project has come from, and he graciously shared geo-demographic data of visitors to the project’s homepage:

Figure 1. Spark NLP geo-demographic data of visitors. Slide by Ben Lorica, data courtesy of David Talby.

Of the thousands of visitors to the site: 44% are from the Americas, 24% from Asia-Pacific, and the remaining 22% are based in the EMEA region.

Many of these site visitors are turning into users of the project. In our recent survey AI Adoption in the Enterprise, quite a few respondents signalled that they were giving Spark NLP a try. The project also garnered top prize—based on a tally of votes cast by Strata Data Conference attendees—in the open source category at the Strata Data awards in March.

There are many other excellent open source NLP libraries with significant numbers of users—spaCy, OpenNLP, Stanford CoreNLP, NLTK—but at the time when the project started, there seemed to be an opportunity for a library that appealed to users who already had Spark clusters (and needed a scalable solution). While the project started out targeting Apache Spark users, it has evolved to provide simple API’s that get things done in a few lines of code and fully hide Spark under the hood. The library’s Python API now has the most users. Installing Spark NLP is a one-liner operation using pip or conda for Python, or a single package pull on Java or Scala using maven, sbt, or spark-packages. The library’s documentation has also grown, and there are public online examples for common tasks like sentiment analysis, named entity recognition, and spell checking. Improvements in documentation, ease-of-use, and its production-ready implementation of key deep learning models, combined with speed, scalability, and accuracy has made Spark NLP a viable option for enterprises needing an NLP library.

For more on Spark NLP, join Talby and his fellow instructors for a three-hour tutorial, Natural language understanding at scale with Spark NLP, at the Strata Data Conference in New York City, September 23-26, 2019.

Related content:

Continue reading One simple chart: Who is interested in Spark NLP?.

Categories: Technology

Four short links: 27 June 2019

O'Reilly Radar - Thu, 2019/06/27 - 03:55

Security Mnemonics, Evidence Might Work, Misinformation Inoculation, and Spoofing Presidential Alerts

  1. STRIDE -- mnemonic for remembering the different types of threads: Spoofing of user identity; Tampering; Repudiation; Information disclosure (privacy breach or data leak); Denial of service (D.o.S); Elevation of privilege. Use when you're asking yourself, "what could possibly go wrong?" There's probably a parallel "how things can be misused" mnemonic like Nazis, Anti-Vaxx, Spam, Threats, and Your Ex- Follows You.
  2. Backfire Effect is Mostly a Myth (Nieman Lab) -- some evidence that giving people evidence that shows they're wrong can change their mind. Perhaps you no longer have to be careful to whom you show this story. Full Fact research manager Amy Sippett reviewed seven studies that have explored the backfire effect and found that “cases where backfire effects were found tended to be particularly contentious topics, or where the factual claim being asked about was ambiguous.” The studies where a backfire effect was not found also tended to be larger than the studies where it was found. Full Fact cautions that most of the research on the backfire effect has been done in the U.S., and “we still need more evidence to understand how fact-checking content can be most effective.”
  3. Bad News -- a browser game by Cambridge University researchers that seems to inoculate users against misinformation. We conducted a large-scale evaluation of the game with N = 15,000 participants in a pre-post gameplay design. We provide initial evidence that people’s ability to spot and resist misinformation improves after gameplay, irrespective of education, age, political ideology, and cognitive style. (via Cambridge University)
  4. Spoofing Presidential Alerts -- Their research showed that four low cost USRP or bladeRF TX capable software defined radios (SDR) with 1 watt output power each, combined with open source LTE base station software could be used to send a fake Presidential Alert to a stadium of 50,000 people (note that this was only simulated—real-world tests were performed responsibly in a controlled environment). The attack works by creating a fake and malicious LTE cell tower on the SDR that nearby cell phones connect to. Once connected an alert can easily be crafted and sent to all connected phones. There is no way to verify that an alert is legitimate. The article itself is paywalled, though Sci-Hub knows how to reach it.

Continue reading Four short links: 27 June 2019.

Categories: Technology

Four short links: 26 June 2019

O'Reilly Radar - Wed, 2019/06/26 - 04:15

Ethics and OKRs, Rewriting Binaries, Diversity of Implementation, and Uber's Metrics Systems

  1. Ethical Principles and OKRs -- Your KPIs can’t conflict with your principles if you don’t have principles. (So, start by defining your principles, then consider your principles before optimizing a KPI, monitor user experience to see if you're compromising your principles, and repeat.) (via Peter Skomoroch)
  2. RetroWrite -- Retrofitting compiler passes through binary rewriting. Paper. The ideal solution for binary security analysis would be a static rewriter that can intelligently add the required instrumentation as if it were inserted at compile time. Such instrumentation requires an analysis to statically disambiguate between references and scalars, a problem known to be undecidable in the general case. We show that recovering this information is possible in practice for the most common class of software and libraries: 64-bit, position independent code (via Mathias Payer)
  3. Re: A libc in LLVM -- very thoughtful post from a libc maintainer about the risks if Google implements an LLVM libc. Avoiding monoculture preserves the motivation for consensus-based standards processes rather than single-party control (see also: Chrome and what it's done to the web) and the motivation for people writing software to write to the standards rather than to a particular implementation.
  4. M3 and M3DB -- M3, a metrics platform, and M3DB, a distributed time series database, were developed at Uber out of necessity. After using what was available as open source and finding we were unable to use them at our scale due to issues with their reliability, cost, and operationally intensive nature, we built our own metrics platform piece by piece. We used our experience to help us build a native distributed time series database, a highly dynamic and performant aggregation service, query engine, and other supporting infrastructure.

Continue reading Four short links: 26 June 2019.

Categories: Technology

AI and machine learning will require retraining your entire organization

O'Reilly Radar - Wed, 2019/06/26 - 04:00

To successfully integrate AI and machine learning technologies, companies need to take a more holistic approach toward training their workforce.

In our recent surveys AI Adoption in the Enterprise and Machine Learning Adoption in the Enterprise, we found growing interest in AI technologies among companies across a variety of industries and geographic locations. Our findings align with other surveys and studies—in fact, a recent study by the World Intellectual Patent Office (WIPO) found that the surge in research in AI and machine learning (ML) has been accompanied by an even stronger growth in AI-related patent applications. Patents are one sign that companies are beginning to take these technologies very seriously.

Figure 1. A 2019 WIPO Study shows an over six-fold increase in AI patent publications from 2006 to 2017. Image source: Ben Lorica.

When we asked what held back their adoption of AI technologies, respondents cited a few reasons, including some that pertained to culture, organization, and skills:

  • [23%] Company culture does not yet recognize needs for AI
  • [18%] Lack of skilled people / difficulting hiring the required roles
  • [17%] Difficulties in identifying appropriate business use cases

Implementing and incorporating AI and machine learning technologies will require retraining across an organization, not just technical teams. Recall that the rise of big data and data science necessitated a certain amount of retraining across an entire organization: technologists and analysts needed to familiarize themselves with new tools and architectures, but business experts and managers also needed to reorient their workflows to adjust to data-driven processes and data-intensive systems. AI and machine learning will require a similar holistic approach to training. Here are a few reasons why:

  • As noted from our survey, identifying appropriate business use cases remains an ongoing challenge. Domain experts and business owners need to develop an understanding of these technologies in order to be able to highlight areas where they are likely to make an impact within a company.
  • Members of an organization will need to understand—even at a high-level—the current state of AI and ML technologies so they know the strengths and limitations of these new tools. For instance, in the case of robotic process automation (RPA), it’s really the people closest to tasks (“bottoms up”) who can best identify areas where it is most suitable.
  • AI and machine learning depend on data (usually labeled training data for machine learning models), and in many instances, a certain amount of domain knowledge will be needed to assemble high-quality data.
  • Machine learning and AI involve end-to-end pipelines, so development/testing/integration will often cut across technical roles and technical teams.
  • AI and machine learning applications and solutions often interact with (or augment) users and domain experts, so UX/design remains critical.
  • Security, privacy, ethics, and other risk and compliance issues will increasingly require that companies set up cross-functional teams when they build AI and machine learning systems and products.

At our upcoming Artificial Intelligence conferences in San Jose and London, we have assembled a roster of two-day training sessions, tutorials, and presentations to help individuals (across job roles and functions) sharpen their skills and understanding of AI and machine learning. We return to San Jose with a two-day Business Summit designed specifically for executives, business leaders, and strategists. This Business Summit includes a popular two-day training—AI for Managers—and tutorials—Bringing AI into the enterprise and Design Thinking for AI—along with 12 executive briefings designed to provide in-depth overviews into important topics in AI. We are also debuting a new half-day tutorial that will be taught by Ira Cohen (Product management in the Machine Learning era), which given the growing importance of AI and ML, is one that every manager should consider attending.

We will also have our usual strong slate of technical training, tutorials, and talks. Here are some two-day training sessions and tutorials that I am excited about:

AI and ML are going to impact and permeate most aspects of a company’s operations, products, and services. To succeed in implementing and incorporating AI and machine learning technologies, companies need to take a more holistic approach toward retraining their workforces. This will be an ongoing endeavor as research results continue to be translated into practical systems that companies can use. Individuals will need to continue to learn new skills as technologies continue to evolve and because many areas of AI and ML are increasingly becoming democratized.

Related training and tutorial links:

Continue reading AI and machine learning will require retraining your entire organization.

Categories: Technology

Four short links: 25 June 2019

O'Reilly Radar - Tue, 2019/06/25 - 03:50

Analog Deep Learning, Low-Trust Internet, Media Literacy, and Psych Experiments

  1. The Next Generation of Deep Learning: Analog Computing (IEEE) -- Further progress in compute efficiency for deep learning training can be made by exploiting the more random and approximate nature of deep learning work flows. In the digital space that means to trade off numerical precision for accuracy at the benefit of compute efficiency. It also opens the possibility to revisit analog computing, which is intrinsically noisy, to execute the matrix operations for deep learning in constant time on arrays of nonvolatile memories. (Paywalled paper)
  2. The Internet is Increasingly a Low-Trust Society (Wired) -- Zeynep Tufecki nails it. Social scientists distinguish high-trust societies (ones where you can expect most interactions to work) from low-trust societies (ones where you have to be on your guard at all times). People break rules in high-trust societies, of course, but laws, regulations, and norms help to keep most abuses in check; if you have to go to court, you expect a reasonable process. In low-trust societies, you never know. You expect to be cheated, often without recourse. You expect things not to be what they seem and for promises to be broken, and you don’t expect a reasonable and transparent process for recourse. It’s harder for markets to function and economies to develop in low-trust societies. It’s harder to find or extend credit, and it’s risky to pay in advance.
  3. Be Internet Awesome -- Google's media literacy materials. Be Internet Awesome is like an instruction manual for making smart decisions online. Kids today need a guide to the internet and media just as they need instruction on other topics. We need help teaching them about credible sources, the power of words and images, and more importantly, how to be smart and savvy when seeing different media while browsing the web. All of these resources are not only available for classrooms, but also free and easily accessible for families as well. They’re in both English and in Spanish, along with eight other languages. (via Google Blog)
  4. PsyToolkit -- create and run cognitive psychological experiments in your browser.

Continue reading Four short links: 25 June 2019.

Categories: Technology

Four short links: 24 June 2019

O'Reilly Radar - Mon, 2019/06/24 - 03:45

Wacky Timestamps, Computers and Spies, Surveillance Capitalism, and Twitter Adventures

  1. NTFS Timestamps -- a 64-bit value representing the number of 100-nanosecond intervals since January 1, 1601 (UTC). WTAF?
  2. Computers Changed Spycraft (Foreign Policy) -- so much has changed—eg., dead letter drops: It is easy for Russian counterintelligence to track the movements of every mobile phone in Moscow, so if the Canadian is carrying her device, observers can match her movements with any location that looks like a potential site for a dead drop. They could then look at any other phone signal that pings in the same location in the same time window. If the visitor turns out to be a Russian government official, he or she will have some explaining to do.
  3. Netflix Records All of your Bandersnatch Choices, GDPR Request Reveals (Verge) -- that's some next-level meta.
  4. Being Beyoncé’s Assistant for the Day (Twitter) -- a choose-your-own-adventure implemented in Twitter. GENIUS!

Continue reading Four short links: 24 June 2019.

Categories: Technology

Four short links: 21 June 2019

O'Reilly Radar - Fri, 2019/06/21 - 03:45

Private Computation, Robot Framework, 3D Objects, and Self-Supervised Learning

  1. Private Join and Compute (Google) -- This functionality allows two users, each holding an input file, to privately compute the sum of associated values for records that have common identifiers. (via Wired)
  2. PyRobot -- from CMU and Facebook. PyRobot is a framework and ecosystem that enables AI researchers and students to get up and running with a robot in just a few hours, without specialized knowledge of the hardware or of details such as device drivers, control, and planning.
  3. PartNet -- a consistent, large-scale data set of 3D objects annotated with fine-grained, instance-level, and hierarchical 3D part information. Our data set consists of 573,585 part instances over 26,671 3D models covering 24 object categories. This data set enables and serves as a catalyst for many tasks such as shape analysis, dynamic 3D scene modeling and simulation, affordance analysis, and others. (via IEEE Spectrum )
  4. Self-Supervised Learning (Andrew Zisserman) -- 122 slides, very readable, about learning from images, from video, and from video with sound.

Continue reading Four short links: 21 June 2019.

Categories: Technology

Four short links: 20 June 2019

O'Reilly Radar - Thu, 2019/06/20 - 14:55

Model Governance, Content Moderators, Interactive Fiction, and End-User Probabilistic Programming

  1. Model Governance and Model Operations -- models built or tuned for specific applications (in reality, this means models + data) will need to be managed and protected.
  2. Bodies in Seats -- the story of Facebook's 30,000 content moderators: contractors, low pay (as little as $28,800 a year), and a lot of PTSD for everyone. "Nobody’s prepared to see a little girl have her organs taken out while she’s still alive and screaming.” Moderators were told they had to watch at least 15 to 30 seconds of each video.
  3. Dialog -- a domain-specific language for creating works of interactive fiction. Inspired by Inform and Prolog, they say.
  4. End-User Probabilistic Programming -- We examine the sources of uncertainty actually encountered by spreadsheet users, and their coping mechanisms, via an interview study. We examine spreadsheet-based interfaces and technology to help reason under uncertainty, via probabilistic and other means. We show how uncertain values can propagate uncertainty through spreadsheets, and how sheet-defined functions can be applied to handle uncertainty. Hence, we draw conclusions about the promise and limitations of probabilistic programming for end-users.

Continue reading Four short links: 20 June 2019.

Categories: Technology

Enabling end-to-end machine learning pipelines in real-world applications

O'Reilly Radar - Thu, 2019/06/20 - 04:50

The O’Reilly Data Show Podcast: Nick Pentreath on overcoming challenges in productionizing machine learning models.

In this episode of the Data Show, I spoke with Nick Pentreath, principal engineer at IBM. Pentreath was an early and avid user of Apache Spark, and he subsequently became a Spark committer and PMC member. Most recently his focus has been on machine learning, particularly deep learning, and he is part of a group within IBM focused on building open source tools that enable end-to-end machine learning pipelines.

Continue reading Enabling end-to-end machine learning pipelines in real-world applications.

Categories: Technology

Four short links: 19 June 2019

O'Reilly Radar - Wed, 2019/06/19 - 04:05

Voice2Face, DIY Minivac, Cloud Metrics, and Envoy for Mobile

  1. Speech2Face: Learning the Face Behind a Voice -- complete with an interesting ethics discussion up front. I wonder where this was intended to go: after all, it can't perfectly reconstruct faces, so what you get is a stereotype based on the voice. Meh.
  2. Minivac 601 Replica (Instructables) -- Created by information theory pioneer Claude Shannon as an educational toy for teaching digital circuits, the Minivac 601 Digital Computer Kit was billed as an electromechanical digital computer system.
  3. Nines Are Not Enough: Meaningful Metrics for Clouds -- We show that this problem shares some similarities with the challenges of applying statistics to make decisions based on sampled data. We also suggest that defining guarantees in terms of defense against threats, rather than guarantees for application-visible outcomes, can reduce the complexity of these problems.
  4. Announcing Envoy Mobile (Lyft Engineering) -- as Simon Willison said: Lyft’s Envoy proxy / service mesh has been widely adopted across the industry as a server-side component for adding smart routing and observability to the network calls made between services in microservice architectures. “The reality is that three 9s at the server-side edge is meaningless if the user of a mobile application is only able to complete the desired product flows a fraction of the time”—so Lyft is building a C++ embedded library companion to Envoy which is designed to be shipped as part of iOS and Android client applications. “Envoy Mobile in conjunction with Envoy in the data center will provide the ability to reason about the entire distributed system network, not just the server-side portion.” Their decision to release an early working prototype and then conduct ongoing development entirely in the open is interesting, too.

Continue reading Four short links: 19 June 2019.

Categories: Technology

What are model governance and model operations?

O'Reilly Radar - Wed, 2019/06/19 - 04:00

A look at the landscape of tools for building and deploying robust, production-ready machine learning models.

Our surveys over the past couple of years have shown growing interest in machine learning (ML) among organizations from diverse industries. A few factors are contributing to this strong interest in implementing ML in products and services. First, the machine learning community has conducted groundbreaking research in many areas of interest to companies, and much of this research has been conducted out in the open via preprints and conference presentations. We are also beginning to see researchers share sample code written in popular open source libraries, and some even share pre-trained models. Organizations now also have more use cases and case studies from which to draw inspiration—no matter what industry or domain you are interested in, chances are there are many interesting ML applications you can learn from. Finally, modeling tools are improving, and automation is beginning to allow new users to tackle problems that used to be the province of experts.

With the shift toward the implementation of machine learning, it’s natural to expect improvement in tools targeted at helping companies with ML. In previous posts, we’ve outlined the foundational technologies needed to sustain machine learning within an organization, and there are early signs that tools for model development and model governance are beginning to gain users.

Figure 1. A collection of tools that focus primarily on aspects of model development, governance, and operations. Source: Ben Lorica. Model development

One sure sign that companies are getting serious about machine learning is the growing popularity of tools designed specifically for managing the ML model development lifecycle, such as MLflow and Why aren’t traditional software tools sufficient? In a previous post, we noted some key attributes that distinguish a machine learning project:

  • Unlike traditional software where the goal is to meet a functional specification, in ML the goal is to optimize a metric.
  • Quality depends not just on code, but also on data, tuning, regular updates, and retraining.
  • Those involved with ML usually want to experiment with new libraries, algorithms, and data sources—and thus, one must be able to put those new components into production.

The growth in adoption of tools like MLflow indicates that new tools are in fact very much needed. These ML development tools are designed specifically to help teams of developers, machine learning engineers, and data scientists collaborate, manage, and reproduce, ML experiments. Many tools in this category let users to systematically conduct modeling experiments (e.g., hyperparameter tuning, NAS) while emphasizing the ease with which one can manage, track, and reproduce such experiments.

Model governance

We are also beginning to come across companies that acknowledge the need for model governance tools and capabilities. Just as companies have long treated data as assets, as ML becomes more central to an organization’s operations, models will be treated as important assets. More precisely, models built or tuned for specific applications (in reality, this means models + data) will need to be managed and protected:

  • A database for authorization and security: who has read/write access to certain models
  • A catalog or a database that lists models, including when they were tested, trained, and deployed
  • A catalog of validation data sets and the accuracy measurements of stored models
  • Versioning (of models, feature vectors, data) and the ability to roll out, roll back, or have multiple live versions
  • Metadata and artifacts needed for a full audit trail
  • Who approved and pushed the model out to production, who is able to monitor its performance and receive alerts, and who is responsible for it
  • A dashboard that provides custom views for all principals (operations, ML engineers, data scientists, business owners)
Model operations, testing, and monitoring

As machine learning proliferates in products and services, we need a set of roles, best practices, and tools to deploy, manage, test, and monitor ML in real-world production settings. There are some initial tools aimed at model operations and testing—mainly for deploying and monitoring ML models—but it’s clear we are still in the early stages for solutions in these areas.

There are three common issues that diminish the value of ML models once they’re in production. The first is concept drift: the accuracy of models in production degrades over time, because of changes in the real world, stemming from a growing disparity between the data they were trained on and the data they are used on. The second is locality: when deploying models to new geographic locations, user demographics, or business customers, it’s often not the case that pre-trained models work at the expected level of accuracy. Measuring online accuracy per customer / geography / demographic group is important both to monitor bias and to ensure accuracy for a growing customer base. The third is data quality: since ML models are more sensitive to the semantics of incoming data, changes in data distribution that are often missed by traditional data quality tools wreak havoc on models’ accuracy.

Beyond the need to monitor that your current deployed models operate as intended, another challenge is knowing that a newly proposed model actually delivers better performance in production. Some early systems allow for the comparison of an “incumbent model” against “challenger models,” including having challengers in “dark launch” or “offline” mode (this means challenger models are evaluated on production traffic but haven’t been deployed to production). Other noteworthy items include:

  • Tools for continuous integration and continuous testing of models. A model is not “correct” if it returns a valid value—it has to meet an accuracy bar. There needs to be a way to validate this against a given metric and validation set before deploying a model.
  • Online measurement of the accuracy of each model (what’s the accuracy that users are experiencing “in the field”?). Related to this is the need to monitor bias, locality effects, and related risks. For example, scores often need to be broken down by demographics (are men and women getting similar accuracy?) or locales (are German and Spanish users getting similar accuracy?).
  • The ability to manage the quality of service for model inference to different customers, including rate limiting, request size limiting, metering, bot detection, and IP geo-fencing.
  • Ability to scale (and auto-scale), secure, monitor, and troubleshoot live models. Scaling has two dimensions—the size of the traffic hitting the models and the number of models that need to be evaluated.

Model operations and testing is very much still a nascent field where systematic checklists are just beginning to be assembled. An overview from a 2017 paper from Google lets us gauge how much tooling is still needed for model operations and testing. This paper came with a 28-item checklist that detailed things that need to be accounted for in order to have a reliable, production-grade machine learning system:

  • Features and data: seven items that include checks for privacy controls, feature validation, exploring the necessity and cost of a feature, and other data-related tests.
  • Tests for model development: seven sanity checks, including checking whether a simpler model will suffice, model performance on critical data slices (e.g., region, age, recency, frequency, etc.), the impact of model staleness, and other important considerations.
  • Infrastructure tests: a suite of seven considerations, including the reproducibility of model training, the ease with which models can be rolled back, integration tests on end-to-end model pipelines, model tests via a canary process.
  • Monitoring: the authors list a series of seven items to ensure models are working as expected. This includes tests for model staleness, performance metrics (training, inference, throughput), validating that training and serving code generate similar values, and other essential items.
New roles

Discussions around machine learning tend to revolve around the work of data scientists and model building experts. This is beginning to change now that many companies are entering the implementation phase for their ML initiatives. Machine learning engineers, data engineers, developers, and domain experts are critical to the success of ML projects. At the moment, few (if any) teams have checklists as extensive as the one detailed in the 2017 paper from Google. The task of building real-world production-grade ML models still requires stitching together tools and teams that cut across many functional areas. However, as tools for model governance and model operations and testing begin to get refined and become more widely available, it’s likely that specialists (an “ML ops team”) will be tasked to use such tools. Automation will also be an important component, as these tools will need to enable organizations to build, manage, and monitor many more machine learning models.

Figure 2. Demand for tools for managing ML in the enterprise. Source: Ben Lorica, using data from a Twitter poll.

We are beginning to see specialized tools that allow teams to manage the ML model development lifecycle. Tools like MLflow are being used to track and manage machine learning experiments (mainly offline, using test data). There are also new tools that cover aspects of governance, production deployment, serving, and monitoring, but at the moment they tend to focus on single ML libraries (TFX) or modeling tools (SAS Model Manager). The reality is, enterprises will want flexibility in the libraries, modeling tools, and environments they use. Fortunately, startups and companies are beginning to build comprehensive tools for enabling ML in the enterprise.

Related content:

Continue reading What are model governance and model operations?.

Categories: Technology

Four short links: 18 June 2019

O'Reilly Radar - Tue, 2019/06/18 - 04:50

JavaScript Spreadsheets, Pessimism, Privacy Policies, and AI Ethics

  1. jExcel -- a lightweight vanilla JavaScript plugin to create amazing web-based interactive tables and spreadsheets compatible with Excel or any other spreadsheet software. You can create an online spreadsheet table from a JS array, JSON, CSV, or XSLX files. You can copy from excel and paste straight to your jExcel spreadsheet and vice versa. It is very easy to integrate any third-party JavaScript plugins to create your own custom columns, custom editors, and customize any feature into your application.
  2. Why Are We So Pessimistic? (Brookings) -- The belief or perception that things are much worse than they really are is widespread, and I believe it comes with significant detrimental impacts on societies.
  3. We Read 150 Privacy Policies. They Were an Incomprehensible Disaster (NYT) -- Only Immanuel Kant’s famously difficult “Critique of Pure Reason” registers a more challenging readability score than Facebook’s privacy policy.
  4. Perspectives and Approaches in AI Ethics: East Asia -- Each country’s perspectives on and approaches to AI and robots on the tool-partner spectrum are evaluated by examining its policy, academic thought, local practices, and popular culture. This analysis places South Korea in the tool range, China in the middle of the spectrum, and Japan in the partner range.

Continue reading Four short links: 18 June 2019.

Categories: Technology

The quest for high-quality data

O'Reilly Radar - Tue, 2019/06/18 - 04:00

Machine learning solutions for data integration, cleaning, and data generation are beginning to emerge.

“AI starts with ‘good’ data” is a statement that receives wide agreement from data scientists, analysts, and business owners. There has been a significant increase in our ability to build complex AI models for predictions, classifications, and various analytics tasks, and there’s an abundance of (fairly easy-to-use) tools that allow data scientists and analysts to provision complex models within days. As model building become easier, the problem of high-quality data becomes more evident than ever. A recent O’Reilly survey found that those with mature AI practices (as measured by how long they’ve had models in production) cited “Lack of data or data quality issues” as the main bottleneck holding back further adoption of AI technologies.

Even with advances in building robust models, the reality is that noisy data and incomplete data remain the biggest hurdles to effective end-to-end solutions. The problem is even more magnified in the case of structured enterprise data. These data sets are often siloed, incomplete, and extremely sparse. Moreover, the domain knowledge, which often is not encoded in the data (nor fully documented), is an integral part of this data (see this article from Forbes). If you also add scale to the sparsity and the need for domain knowledge, you have the perfect storm of data quality issues.

In this post, we shed some light on various efforts toward generating data for machine learning (ML) models. In general, there are two main lines of work toward that goal: (1) clean the data you have, and (2) generate more data to help train needed models. Both directions have seen new advances in using ML models effectively, building on multiple new results from academia.

Data integration and cleaning

One of the biggest pitfalls in dealing with data quality is to treat all data problems the same. Academic research has been more deliberate in describing the different classes of data quality problems. We see two main classes of problems, which have varying degrees of complexity, and often mandate different approaches and tools to solve them. Since they consume a significant amount of time spent on most data science projects, we highlight these two main classes of data quality problems in this post:

  1. Data unification and integration
  2. Error detection and automatic repairing/imputation
Data unification and integration

Even with the rise of open source tools for large-scale ingestion, messaging, queuing, and stream processing, siloed data and data sets trapped behind the bars of various business units is the normal state of affairs in any large enterprise. Data unification or integration refers to the set of activities that bring this data together into one unified data context. Schema matching and mapping, record linkage and deduplication, and various mastering activities are the types of tasks a data integration solution performs. Advances in ML offer a scalable and efficient way to replace legacy top-down, rule-based systems, which often result in massive costs and very low success in today’s big data settings. Bottom-up solutions with human-guided ML pipelines (such as Tamr, Paxata, or Informatica—full disclosure: Ihab Ilyas is co-founder of Tamr) show how to leverage the available rules and human expertise to train scalable integration models that work on thousands of sources and large volumes of data. We discussed some of the challenges and enablers in using ML for this class of problems in an earlier post.

The class of data unification problems has its own characteristics in terms of solution complexity: (1) the problem is often quadratic in the size of the input (since we need to compare everything to everything else), and (2) the main ML task is fairly understood and is mainly determining if two “things” are the same. These characteristics have a considerable impact on the design of the solution. For example, a complex sophisticated model for finding duplicates or matching schema is the least of our worries if we cannot even enumerate all possible pairs that need to be checked. Effective solutions for data unification problems tend to be a serious engineering effort to: (1) prune the space of possible candidates; (2) interact effectively with experts to provide training data and validate the machine decision; and (3) keep rich lineage and provenance to track decisions back for auditing, revising, or reusing for future use cases. Due to the nature of the ML task (mainly Boolean classification here), and the richness of structure, most successful models tend to be the good old “shallow” models, such as random forest, with the help of simple language models (to help with strings data). See this article on data integration status for details.

Error detection, repairing and value imputation

Siloed or integrated data is often noisy, missing, and sometimes even has contradicting facts. Data cleaning is the class of data quality efforts that focuses on spotting and (hopefully) repairing such errors. Like data integration, data cleaning exercises often have been carried out with intensive labor work, or ad-hoc rule-based point solutions. However, this class has different complexities and characteristics that affect the design of the solution: the core ML task is often far more complex than a matching task, and requires building models that understand “how data was generated” and “how errors were introduced” to be able to reverse that process to spot and repair errors.

While data cleaning has long been a research topic in academia, it often has been looked at as a theoretical logic problem. This probably explains why none of the solutions have been adopted in industry. The good news is that researchers from academia recently managed to leverage that large body of work and combine it with the power of scalable statistical inference for data cleaning. The open source HoloClean probabilistic cleaning framework is currently the state-of-the-art system for ML-based automatic error detection and repair. HoloClean adopts the well-known “noisy channel” model to explain how data was generated and how it was “polluted.” It then leverages all known domain knowledge (such as available rules), statistical information in the data, and available trusted sources to build complex data generation and error models. The models are then used to spot errors and suggest the “most probable” values to replace.

Paying attention to scale is a requirement cleaning and integration have in common: building such complex models involves “featurizing” the whole data set via a series of operations—for example, to compute violations of rules, count co-occurrences, or build language models. Hence, an ML cleaning solution would need to be innovative on how to avoid the complexity of these operations. HoloClean, for example uses techniques to prune the domain of database cell and apply judicious relaxations to the underlying model to achieve the required scalability. Older research tools struggled with how to handle the various types of errors, and how to combine the heterogeneous quality input (e.g., business and quality rules, policies, statistical signals in the data, etc.). The HoloClean framework advances the state of the art in two fundamental ways: (1) combining the logical rules and the statistical distribution of the data into one coherent probabilistic model; and (2) scaling the learning and inference process via a series of system and model optimizations, which allowed it to be deployed in census organizations and large commercial enterprises.

Data programming

Increasing the quality of the available data via either unification or cleaning, or both, is definitely an important and a promising way forward to leverage enterprise data assets. However, the quest for more data is not over, for two main reasons:

  1. ML models for cleaning and unification often need training data and examples of possible errors or matching records. Depending completely on human labeling for these examples is simply a non-starter; as ML models get more complex and the underlying data sources get larger, the need for more data increases, the scale of which cannot be achieved by human experts.
  2. Even if we boosted the quality of the available data via unification and cleaning, it still might not be enough to power the even more complex analytics and predictions models (often built as a deep learning model).

An important paradigm for solving both these problems is the concept of data programming. In a nutshell, data programming techniques provide ways to “manufacture” data that we can feed to various learning and predictions tasks (even for ML data quality solutions). In practical terms, “data programming” unifies a class of techniques used for the programmatic creation of training data sets. In this category of tools, frameworks like Snorkel show how to allow developers and data scientists to focus on writing labeling functions to programmatically label data, and then model the noise in the labels to effectively train high-quality models. While using data programming to train high-quality analytics models might be clear, we find it interesting how it is used internally in ML models for the data unification and cleaning we mentioned earlier in this post. For example, tools like Tamr leverage legacy rules written by customers to generate a large amount of (programmatically) labeled data to power its matching ML pipeline. In a recent paper, the HoloClean project showed how to use “data augmentation” to generate many examples of possible errors (from a small seed) to power its automatic error detection model.

Market validation

The landscape of solutions we presented here for the quest for high-quality data have already been well validated in the market today.

  • ML solutions for data unification such as Tamr and Informatica have been deployed at a large number of Fortune-1000 enterprises.
  • Automatic data cleaning solutions such as HoloClean already have been deployed by multiple financial services and the census bureaus of various countries.
  • As the growing list of Snorkel users suggests, data programming solutions are beginning to change the way data scientists provision ML models.

As we get more mature in understanding the differences between the various problems of integration, cleaning, and automatic data generation, we will see real improvement in handling the valuable data assets in the enterprise.

Machine learning applications rely on three main components: models, data, and compute. A lot of articles are written about new breakthrough models, many of which are created by researchers who publish not only papers, but code written in popular open source libraries. In addition, recent advances in automated machine learning has resulted in many tools that can (partially) automate model selection and hyperparameter tuning. Thus, many cutting-edge models are now available to data scientists. Similarly, cloud platforms have made compute and hardware more accessible to developers.

Models are increasingly becoming commodities. As we noted in the survey results above, the reality is that a lack of high-quality training data remains the main bottleneck in most machine learning projects. We believe that machine learning engineers and data scientists will continue to spend most of their time creating and refining training data. Fortunately, help is on the way: as we’ve described in this post, we are finally beginning to see a class of technologies aimed squarely at the need for quality training data.

Related content:

Continue reading The quest for high-quality data.

Categories: Technology


Subscribe to LuftHans aggregator