You are here

Feed aggregator

Black box: How AI will amplify the best and worst of humanity

O'Reilly Radar - Thu, 2018/09/13 - 13:00

Jacob Ward reveals the relationship between the unconscious habits of our minds and the way that AI is poised to amplify them, alter them, maybe even reprogram them.

Continue reading Black box: How AI will amplify the best and worst of humanity.

Categories: Technology

Why you need customer development

O'Reilly Radar - Thu, 2018/09/13 - 04:40

We should invest at least as much time in understanding our customers as we do in optimizing our product development process.

It is humbling to see how bad experts are at estimating the value of features (us included). Despite our best efforts and pruning of ideas, most fail to show value when evaluated in controlled experiments.

The literature is filled with reports that success rates of ideas in the software industry are below 50%. Our experience at Microsoft is no different: only about a third of ideas improve the metrics they were designed to improve.

—Ronny Kohavi, Partner Architect at Microsoft

Nature hath given man one tongue but two ears, that we may hear from others twice as much as we speak.

—Epictetus

Customers are what make a product successful.

Without customers willing to buy, it doesn’t matter how good or innovative or beautiful or reasonably priced a product is: it will fail.

It makes no sense, then, that we spend most of our time and effort optimizing our product development process. What about customer development? Shouldn’t we invest at least as much time in understanding our customers, their needs and pain points, and how to deliver solutions to them?

Customer development is an approach for doing just that.

It’s a way to reduce your business risks by challenging your assumptions about who your customers are, what they need, and why and how they buy.

By applying the scientific method to learning about your customers, you can help confirm that you’re on track to a business model that works and a product that people want to buy.

Sounds great in theory, right?

But theory is useless if you can’t put it into practice. That’s why I’ve written this book—because I’ve worked with, mentored, and spoken to hundreds of companies who love the lean ideas and principles but struggle to make them work.

The First Challenge Is Inside the Building

Customer development is a big change for most organizations.

To many people, customer development sounds like saying, “Hey! You know that expertise that we’ve amassed over decades of experience, dozens of products, and millions of customers? Let’s shelve it and start from scratch.”

Of course that’s not what we’re saying. But as a pragmatist, I recognize that it’s difficult to correct a mistaken first impression. If your team doesn’t understand what customer development is and how it enhances (rather than replaces) your competencies, it’ll be far more difficult to get started.

Customer development is admittedly the new kid on the block. Everyone knows about the role of product development, marketing, customer support, and even user research in an organization. But customer development? You’re likely to encounter some skepticism.

Unless your team has been exposed to lean startup conferences or Steve Blank’s work, you may find yourself having to sell customer development to your organization before you can really get started.

This chapter takes a step back, explaining what customer development is (and isn’t), why you need it, and who can do it. It also offers responses to some common objections.

What Is Customer Development?

So let’s back up a minute and talk about definitions. What is customer development? What does it replace? What does it not replace?

The term customer development is meant to parallel product development. While everyone has a product development methodology, almost no one has a customer development methodology. And the truth is, if you don’t learn what customers really want, you’re at a very high risk of building something that no one wants to buy.

Customer development is a hypothesis-driven approach to understanding:1

  • Who your customers are

  • What problems and needs they have

  • How they are currently behaving

  • Which solutions customers will give you money for (even if the product is not built or completed yet)

  • How to provide solutions in a way that works with how your customers decide, procure, buy, and use

You probably have ideas or intuitions about all of these. Let’s identify what those really are: guesses. Let’s make it sound a bit better and call them hypotheses. Those hypotheses may be around forming a new company, building a new product, or even adding new features or capabilities to an existing product.

Everything you do in customer development is centered around testing hypotheses.

What Is Lean Customer Development?

You may have heard of customer development. So what’s the difference between “customer development” and “lean customer development”?

I call my approach to customer development “lean customer development.” I’m using “lean” as a synonym for pragmatic, approachable, and fast.

Lean customer development takes the heart of Steve Blank’s ideas and renders them into a simple process that works for both startups and established companies. It’s what I write about on my blog, speak about at tech events, and teach when I mentor companies.

Lean customer development can be done by anyone who speaks with customers or prospects. It works whether you’re a startup founder with no product and no customers, or at an established company with numerous products and customers. Now that I’ve explained my perspective on lean customer development, from here on out, I’m going to talk simply about customer development.

In my experience across multiple companies and in mentoring startups, every hour spent on customer development has saved 5, 10, or even more hours of writing, coding, and design (Figure 1-1). That doesn’t even include the harder-to-measure costs such as opportunity cost, snowballing code complexity, and eroding team morale from working hard on features that no one ends up using.

Figure 1-1. Talking to customers saves time and money

Customer development starts with a shift in mind-set. Instead of assuming that your ideas and intuitions are correct and embarking on product development, you will be actively trying to poke holes in your ideas, to prove yourself wrong, and to invalidate your hypotheses.

Every hypothesis you invalidate through conversations with prospective customers prevents you from wasting time building a product no one will buy.

Lean customer development is done in five steps:

  • Forming a hypothesis

  • Finding potential customers to talk to

  • Asking the right questions

  • Making sense of the answers

  • Figuring out what to build to keep learning

If your hypothesis is wrong or even partially wrong, you want to find out fast. If you can’t find customers, you modify your hypothesis. If customers contradict your assumptions, you modify your hypothesis. Those course corrections will lead to validating an idea that you know customers want and are willing to pay for.

What Customer Development Is Not

There are as many misunderstandings about what customer development isn’t as about what it is. Let’s clear those decks right now.

Customer Development Is Not Just for Startups

When The Lean Startup was published in 2009, many companies were slow to embrace the ideas it introduced. “We’re not a startup,” they replied.

Although Eric Ries uses the word “startup” in the title of his book and Steve Blank wrote specifically about customer development as it pertains to startups, startups are not the only companies that benefit from customer development. Startups certainly have a higher degree of uncertainty than mature companies; they are still searching for a business model, a distribution strategy, a customer base.

But larger, more mature companies also can’t assume that their models will remain static. Markets and technology change. In addition, larger companies often find it difficult to shift attention and resources away from profitable lines of business in order to explore new markets and areas of innovation—leaving them ripe for disruption. (Kodak, which I write about in not available, enjoyed over 100 years of success before missing the boat on digital imaging and declaring bankruptcy in 2012.)

Customer development, with its focus on small-batch learning and validation, can promote internal innovation. Intuit, for example, has launched multiple products using customer development—including SnapTax and Fasal. General Electric is using lean principles. So is Toyota, the New York Department of Education, and the White House’s Presidential Innovation Fellows program.

Much of the content in this book is applicable for readers from early-stage startups, massive established companies, and anything in between. When a section is more useful for one audience than the other, I have called that out.

Customer Development Is Not Product Development

Product development answers the question “When (and what) can they buy?”

Customer development answers the question “Will they buy it?”

Product development is the process of building a new product or service and (one hopes) bringing it to market. Start with a concept, define the requirements, build the requirements, test the near-finished product, refine it, and launch it.

How you develop a product varies tremendously based on the methodology your organization follows (e.g., Waterfall, Agile, Scrum, etc.). What all product development methodologies have in common is the desired outcome: a completed product for customers to buy.

But what if the product you build is not a product that customers will buy? Is “product” the biggest risk your team faces? What about market risk? As Marc Andressen said, “Market matters most. And neither a stellar team nor a fantastic product will redeem a bad market.”2

With customer development, you are building your customer base while you’re building a product or service that solves their specific problems. Customer development doesn’t replace product development; it’s a second process that you do in parallel with product development.

If you’ve done customer development alongside product development, you don’t need to wait until your product is launched to know whether customers will buy. You’ll know, because you will already have beta customers, evangelists, and paying customers.

Customer development and product development are two independent activities, and both are necessary to maximize your company’s chances for success.

Customer Development Does Not Replace Product Management

Some folks object, “Well, what’s left for product managers to do?”

Customer development does not replace product vision. Talking to your customers does not mean asking them what they want and writing it all down. Product management requires a disciplined approach to gathering information from a variety of sources, deciding which pieces to act upon, and figuring out how to prioritize them.

Customer development simply adds two components: a commitment to stating and challenging your hypotheses and a commitment to learning deeply about your customers’ problems and needs.

Customer development does not provide all the answers. Although it can replace many of your assumptions with actual information, it still requires a disciplined product manager to decide which pieces of information to act upon, how to prioritize them, and how to take what you’ve learned and turn it into a feature, product, or company.

Customer Development Is Not User Research

Your company may be conducting user research already. That doesn’t mean you’re practicing customer development.

Customer development does borrow from many of the techniques that have served user researchers well for decades. But the context, the practitioners, and the timing are very different.

User researchers often describe their work as “advocating for the user.” It is, unfortunately, still viewed in many companies as optional, something you should do because it delights customers.

Customer development is “advocating for the business.” It’s not something that you should do because it makes customers happy. It’s something you must do to build a sustainable business where people open their wallets and pay for your product or service.

Most new products (and companies) fail. The odds are against you. Around 75% of venture-backed startups fail.3 Anywhere from 40% to 90% of new products fail to gain significant market adoption.4

But surely, we think, we will be the exception. We like to think of building products as an art—something guided by our creativity, intuition, and intellect. We all know that there are good product managers (and designers and engineers and strategists) and mediocre ones. Maybe that’s what makes the difference between a failed product and a success?

Unfortunately not.

Universally, we’re just not very good at building products and companies solely based on creativity, intuition, and intellect. It’s not just a startup problem, either: in 1937, the companies that made up the S&P 500 had an average life expectancy of 75 years; recently that number has dropped to just 15 years.5

On a smaller scale, we’re not as good as we think we are, either. Most of our ideas don’t increase value for customers or companies—Microsoft estimates that only around one-third of their ideas improve the metrics they are intended to improve. Amazon tests every feature and fewer than 50% work; Yammer’s numbers are roughly the same. Netflix and Intuit don’t claim any higher proportion of successes.6

The truth is that it doesn’t matter how much companies research, how well they plan, how much money they spend, or how smart their employees are: the odds that they’ll avoid big mistakes are worse than a flip of a coin.

Not Just Software

I may be citing a lot of software companies, but the benefits of risk reduction and course correction are even greater for other businesses. Lines of code are far cheaper and faster to change than manufacturing setups, supplier contracts, and compliance approvals.

There’s limited opportunity to regain trust in a service that disappointed your customer, and no opportunity to alter a physical product once it’s in a customer’s hands.

For the makers of KRAVE jerky, it was critical to understand how customers defined a premium snack food (no nitrates, no artificial ingredients) before committing to a recipe and starting mass production.

For Romotive, a company that makes smartphone robots for learning, it was critical to understand the environments that their robots would be moving in. “The robot has to have good mobility and traction on carpets, hardware floors, or over grates. Also, kids drop things! A lot of what we’ve learned about how these robots will live has influenced our hardware decisions,” says marketer Charles Liu.

How Do We Improve Our Odds?

In part, we improve our odds by embracing the idea that building products is a systematic, repeatable process. There are tools that you can use, regardless of your company’s size, maturity, or industry, to help increase your chances of success. Customer development is one of those tools.

By practicing customer development as a parallel process in conjunction with product development, you can greatly maximize your learning and reduce your risks.

If you’ve read The Lean Startup, you’ll recognize the diagram on the left side of Figure 1-2 as the Build-Measure-Learn feedback loop. It’s meant to describe how your organization should be continuously learning and adapting based on the new information you get from measuring results and learning from customers. The diagram on the right side, the Think-Make-Check loop, is a variation coined by LUXr CEO Janice Fraser.

Figure 1-2. The Build-Measure-Learn feedback loop Ries described in The Lean Startup (left) and the Think-Make-Check cycle that Janice Fraser describes in her thinking on lean UX (right)

What’s the difference? Just the starting point. You don’t need to start with the Build phase—in fact, doing so is often an expensive way to experiment.

Customer development is an important part of the Think phase. It allows you to explore and iterate during the cheapest phase of development—before any code is written or mockups are created. Customer development gives you the necessary information to build the best possible first guess, which you will then validate.

I’ve talked about learning more and reducing risk—those are valuable gains, but they don’t feel very tangible. What else will you gain from practicing customer development?

  • You’ll get a richer picture of your customer and your competition (not just companies and products but established habits and routines)

  • You’ll uncover new opportunities for differentiation

  • You’ll reduce the amount of product you need to build

Yes, that’s right: you’ll almost certainly end up writing less code! This is a consistent benefit I’ve heard from development teams: the ability to make their minimum viable product (MVP) even smaller. By talking to customers, you’ll frequently find that customers really want only two of the five features you think you need (and they may want one more you hadn’t thought of).

Everything You Know Is Wrong

Well, not quite. But as humans, we are subject to a lot of cognitive biases: our brains take shortcuts that prevent us from seeing the world the way it truly is.

We tend to operate as though we are usually right, and we interpret neutral or ambiguous evidence as supporting our beliefs rather than challenging them. This is called confirmation bias, and it’s responsible for a huge percentage of product failures.

Confirmation bias is our innate tendency to pay more attention to information that confirms our beliefs.7 We’re more likely to ignore or downplay facts that contradict our beliefs, or interpret subjective information in a way that favors what we want to believe.

We don’t do this because we’re bad or egotistical people; it’s what our brains naturally want to do. Unfortunately, it leads us to subtly sabotage ourselves: to ignore the person whose feedback contradicts one of our main assumptions or to dismiss a person as a dumb user because he doesn’t understand or value our product.

Overcoming cognitive biases is difficult. What helps is simply writing things down. By objectively documenting our assumptions, as well as the input we get from customers, it is easier to spot the discrepancies and notice when the evidence is proving us wrong.

In 2009, I was lucky enough to join a startup called KISSmetrics, which had Eric Ries as an advisor. KISSmetrics had previously built two unsuccessful versions of a web analytics product. For both versions, the company had spent many months in development, only to launch and realize that their product wasn’t solving a problem that customers needed to solve.

KISSmetrics CEO Hiten Shah hired me to help them build the third version of their product in accordance with lean startup principles. This time, they wanted to build a version that would allow the team to get the maximum amount of validated learning about customers with the least amount of effort. My first task: figure out what should be in that MVP.

I spent the first month of that job on the phone, on IM, and drinking coffee with people. I was shocked to find that:

  • So many people were willing to talk to a total stranger who didn’t even have a product

  • The features that most people requested were far more ambitious than their current behaviors and tool usage

  • We’d be able to cut our product scope in half for our initial beta

The third version of KISSmetrics was built in a month.8 It was missing tons of features and included a lot of code that made our CTO cringe. But it was enough to provide value to customers and enough for us to glean valuable insights that shaped the future direction of the product.

Answering Common Objections

I’ll assume that you’re convinced of the value of customer development having read this far. But how can you respond to people who are not so convinced? Table 1-1 offers tactics for responding to common objections.

Table 1-1. Responding to common objections Objection to customer development Your response

If we talk about our future product ideas, what’s to stop someone from stealing them and launching them before we do?

First of all, we’re not telling people our product idea. That would bias what we hear from them. We’re talking to people who have a problem we hope we can solve. We’re talking to them about their problem, and how they’ve tried to solve it so far.

What if they figure out what our idea is and then steal it?

It’s extremely unlikely that anyone we talk to is in a position to act upon our ideas.

But even if someone was, a great idea is nothing without great execution. By talking to customers and understanding their needs and what makes them buy, we’ll be more likely to release a superior product.

What if we get bad press coverage because of this?

For startups:

We’re not at a place yet where anyone wants to give us press coverage of any kind.

For enterprise:

We’re talking to an extremely small sample size, and we will set expectations appropriately. If it makes you feel more comfortable, we can ask prospective customers to sign nondisclosure agreements (NDAs). But this hasn’t been a problem for GE, Intuit, or Microsoft...so it’s unlikely to be a problem for us.

How will we find people to talk to? We don’t have a product or customers yet.

We’ll have to figure this out once we have a product, won’t we? Come on, we know we’re solving a specific problem for a specific kind of person—we just need to figure out where those people are online or in the real world. (See not available.)

What if this damages our relationships with existing customers?

Customer development is actually an opportunity to build stronger relationships with some of our customers. We’ll choose those most likely to be receptive, and we’ll set expectations appropriately. (See not available.)

If we do customer development, what’s left for the product manager to do?

Customer development doesn’t mean asking customers what they want and building exactly that!

It’s a process for gathering information, and it will require a skilled product manager to prioritize that information and figure out what and how we respond to it. Customer development is just another tool to help our product managers do their jobs more effectively.

We already do market research and usability testing. How is this different?

Customer development gives us information on how individual customers behave and buy.

We don’t get that from market research—it’s more high-level, covering aggregate populations. We don’t get that from usability testing—that just tells us whether someone can use our product, not whether they would buy it. Market research and usability testing may still be valuable, but they serve different purposes from customer development.

Customer development is the best low-effort way to confirm our assumptions about who our customer is, what he needs, and what he’ll buy.

How can we justify taking time away from building our product?

If a few hours of customer development helps us discover that even one of our assumptions is flawed, that’s likely to save us weeks of coding and design time.

Plus, doing customer development doesn’t mean we can’t make progress on the product. We can—and should—do both in parallel.

Shouldn’t we let product managers, engineers, and designers focus on what they’re good at: building the product?

If the team wants our product to be successful, they should understand the problem the product is trying to solve!

But I understand that not everyone wants to spend all day talking to customers. We can involve folks in a very lightweight way so that they have a half-hour or an hour’s exposure to customers without killing their productivity.

Let’s Make This Work

In the next nine chapters, I’ll show you exactly how to do customer development. I’ll cover specific exercises, tools and templates, sample questions, and methods that you can immediately put into practice. I’ll also provide some necessary background in behavioral economics and social psychology research—not because I love theory, but because understanding why a technique works will help you adapt it to suit your needs and the needs of your organization.

You don’t need experience in market research or user research or even in talking to customers at all—all you need is an open mind and a willingness to challenge your ideas to make them stronger.

Next Step: Get Started

As I mentioned at the beginning of this chapter, everything you do in customer development is centered around testing hypotheses. Now it’s time to start forming those hypotheses. In not available, you’ll jump into exercises that help you identify your assumptions, the problem you’re solving, and who your customer is.

Key Takeaways
  • Every hour spent on customer development saves 5, 10, or even more hours of writing, coding, and design.

  • Your goal is to invalidate your assumptions about what customers want, so that you can focus on building what they will actually buy.

  • Customer development works for companies of all sizes, not just startups.

  • Customer development doesn’t replace product development. You are building your customer base while you’re building a product or service that solves their specific problems.

  • Customer development informs product management, which then decides what to build and how to prioritize features.

  • You have to work to disprove your assumptions. Cognitive bias causes you to naturally see what you want to see (what confirms your assumptions) and tune out what you don’t want to see (what invalidates your assumptions).

1If you’ve read Steve Blank’s The Four Steps to the Epiphany, you’ll recognize that this is not his original definition of customer development. Blank defined the four steps as customer discovery, customer validation, customer creation, and company building.

But The Four Steps was written explicitly for startups, and Blank is very clear that “a startup is not a small version of a big company.” Having worked for over a decade in startups and now being a part of Microsoft, I completely agree. They are very different beasts!

Since customer development works for both startups and larger enterprise companies, I’ve proposed a broader definition that works for companies of any size, at any stage of maturity.

2http://web.archive.org/web/20070701074943/, http://blog.pmarca.com/2007/06/the-pmarca-gu-2.html

3You’ll hear varying numbers. The National Venture Capital Association, for instance, estimates that only 25% to 30% of venture-backed startups fail completely. But the discrepancy may be due to different definitions of failure. Harvard Business School senior lecturer Shikhar Ghosh estimates that 30% to 40% of high-potential startups end up liquidating all assets—a failure by any definition. But if a startup failure is defined as not delivering the projected return on investment, then 95% of VC companies are failures, Ghosh said (http://www.inc.com/john-mcdermott/report-3-out-of-4-venture-backed-start-ups-fail.html).

4The number varies across product categories. Highly innovative products fare even worse. For more information, see http://www.cob.unt.edu/slides/paswan/MKTG4320/freepdfgrab.pdf.

5“What went wrong? [startup guru John Hagel III] argued that American companies and their leaders were essentially not prepared for a move away from a corporate model of ‘knowledge stocks’—developing a proprietary product breakthrough and then defending that innovative advantage against rival companies for as long as possible—and toward a more open and collaborative business model that he called ‘knowledge flows.’ The problem, he said, is that because of the increasingly global nature of business competition, the value of a major proprietary breakthrough or invention erodes in value much more quickly than in the mid-20th century” (http://knowledge.wharton.upenn.edu/article.cfm?articleid=2523).

6Numbers from a Microsoft ThinkWeek paper (http://ai.stanford.edu/~ronnyk/ExPThinkWeek2009Public.pdf).

7http://en.wikipedia.org/wiki/Confirmation_bias

8KISSmetrics CEO Hiten Shah talked about the failed first two versions of KISSmetrics at the first Startup Lessons Learned conference (now called The Lean Startup conference): http://www.slideshare.net/hnshah/kissmetrics-case-study-about-pivots.

Continue reading Why you need customer development.

Categories: Technology

Using machine learning to improve dialog flow in conversational applications

O'Reilly Radar - Thu, 2018/09/13 - 04:35

The O’Reilly Data Show Podcast: Alan Nichol on building a suite of open source tools for chatbot developers.

In this episode of the Data Show, I spoke with Alan Nichol, co-founder and CTO of Rasa, a startup that builds open source tools to help developers and product teams build conversational applications. About 18 months ago, there was tremendous excitement and hype surrounding chatbots, and while things have quieted lately, companies and developers continue to refine and define tools for building conversational applications. We spoke about the current state of chatbots, specifically about the types of applications developers are building today and how he sees conversational applications evolving in the near future.

As I described in a recent post, workflow automation will happen in stages. With that in mind, chatbots and intelligent assistants are bound to improve as underlying algorithms, technologies, and training data get better.

Continue reading Using machine learning to improve dialog flow in conversational applications.

Categories: Technology

Four short links: 13 September 2018

O'Reilly Radar - Thu, 2018/09/13 - 04:10

ML Inspection, Enterprise Architecture, n=1 Study, and EU Shoots Own Foot

  1. What If tool -- allows users to analyze a machine learning model without the need for writing any further code. Given pointers to a TensorFlow model and a data set, the What-If tool offers an interactive visual interface for exploring model results. Useful for non-coders to examine, evaluate, and debug machine learning systems.
  2. Archi Modeling Toolkit -- MIT-licensed software for enterprise architects to create their models and diagrams. (via JB Sarrodie)
  3. Does Knuckle-Cracking Lead to Rheumatoid Arthritis of the Fingers? -- n=1 50-year study. (via Nicholas A. Christakis)
  4. EU Approves Link Tax and Upload Filters (Verge) -- The fallout from this decision will be far-reaching and take a long time to settle. The directive itself still faces a final vote in January 2019 (although, experts say it’s unlikely it will be rejected). Cory explains and EFF vows to fight on. A major threat to the open and uncaptureable internet.

Continue reading Four short links: 13 September 2018.

Categories: Technology

Security Meeting Topic for Thursday 9/20

PLUG - Wed, 2018/09/12 - 16:22
Description:
Fear, Uncertainty, Doubt, and Disinformation exists around the modern computer processor. Revelations brought about from leaks has left the technology community concerned about what is being integrated into their hardware and what kind of vulnerabilities exist. This discussion will introduce concepts such as Active Management Technology, Trusted Execution Environments, and Out-Of-Band system access.

About Aaron:
Aaron is an experienced Linux user with several years of teaching experience. He works in the industry as a software developer while also providing consultancy on cyber security related topics. His discussions are AZ Post certified for training credit for law enforcement and he prides himself on providing quality educational material that is relevant and topical. He has a Masters Degree in Intelligence Analysis with a focus in Cyber Security, is a life long learner, and prides himself on staying up to date with the ever changing field of cyber security.

Topic for the Thursday 9/13 meeting

PLUG - Wed, 2018/09/12 - 16:11

This month der.hans will present "ActivityPub: Decentralized Social Networking" for us.

Description:
ActivityPub is a decentralized social networking protocol for allowing various services to interact.
For instance, a long form blog engine can use an external microblogging service for comments.
ActivityPub is based on the ActivityStream 2.0 data format. It provides client to server APIs and server to server APIs for server to server federating.

Attendees will learn about:

  • Why decentralized social networking is useful
  • The history of ActivityPub
  • The history of decentralized social networking
  • Overview of some activities available through ActivityPub
  • List of services implementing ActivityPub
  • Now what?

Biography:
der.hans is a Free Software community veteran, presenter and author. He is the founder of the Free Software Stammtisch, BoF organizer for the Southern California Linux Expo (SCaLE) and chairman of the Phoenix Linux User Group (PLUG).
As a technology and entrepreneurial veteran, roles have included director of engineering, engineering manager, IS manager, system administrator, community college instructor, developer and DBA.
He presents regularly at large community-led conferences (SCaLE, SeaGL, LibrePlanet, LFNW, Tübix) and many local groups. 

The answer to life, the universe, and everything: But can you get that into production?

O'Reilly Radar - Wed, 2018/09/12 - 13:00

Ted Dunning discusses how new tools can change the way production systems work.

Continue reading The answer to life, the universe, and everything: But can you get that into production?.

Categories: Technology

Preserving privacy and security in machine learning

O'Reilly Radar - Wed, 2018/09/12 - 13:00

Ben Lorica offers an overview of recent tools for building privacy-preserving and secure machine learning products and services.

Continue reading Preserving privacy and security in machine learning.

Categories: Technology

Leveraging the best of the past to power a better future

O'Reilly Radar - Wed, 2018/09/12 - 13:00

Drew Paroski and Aatif Din share how to develop modern database applications without sacrificing cost savings, data familiarity, and flexibility.

Continue reading Leveraging the best of the past to power a better future.

Categories: Technology

Highlights from the Strata Data Conference in New York 2018

O'Reilly Radar - Wed, 2018/09/12 - 13:00

Watch highlights from expert talks covering data science, machine learning, algorithmic accountability, and more.

People from across the data world came together in New York for the Strata Data Conference. Below you'll find links to highlights from the event.

Preserving privacy and security in machine learning

Ben Lorica offers an overview of recent tools for building privacy-preserving and secure machine learning products and services.

The future of data warehousing

Executives from Cloudera and PNC Bank look at the challenges posed by data-hungry organizations.

The answer to life, the universe, and everything: But can you get that into production?

Ted Dunning discusses how new tools can change the way production systems work.

Von Neumann to deep learning: Data revolutionizing the future

Jeffrey Wecker offers a deep dive on data in financial services, with perspectives on data science, alternative data, the importance of data centricity, and the future of machine learning and AI.

--> AI, ML, and the IoT will destroy the data center and the cloud (just not in the way you think)

DD Dasgupta covers the edge-cloud continuum, explaining how the roles of data centers and cloud infrastructure are redefined through the mainstream adoption of AI, ML, and IoT technologies.

The missing piece

Cassie Kozyrkov explores why businesses fail at machine learning despite its tremendous potential and excitement.

Leveraging the best of the past to power a better future

Drew Paroski and Aatif Din share how to develop modern database applications without sacrificing cost savings, data familiarity, and flexibility.

The power of Ethereum

Joseph Lubin explains how Ethereum can help with new innovations like cryptocurrencies, automated and self-executing legal agreements, and self-sovereign identity.

--> Sound design and the future of experience

Amber Case covers methods product designers and managers can use to improve interactions through an understanding of sound design.

Wait ... pizza is a vegetable? Decoding regulations using machine learning

Dinesh Nirmal explains how AI is helping supply school lunch and keep ahead of regulations.

Practical ML today and tomorrow

Hilary Mason explores the current state of AI and ML and what’s coming next in applied ML.

Derive value from analytics and AI at scale

Ziya Ma discusses how recent innovations from Intel in high-capacity persistent memory and open source software are accelerating production-scale deployments.

Quantifying forgiveness

Julia Angwin discusses what she's learned about forgiveness from her series of articles on algorithmic accountability and the lessons we all need to learn for the coming AI future.

Smarter cities through Geotab with BigQuery ML and geospatial analytics

Chad Jennings explains how Geotab's smart city application helps city planners understand traffic and predict locations of unsafe driving.

Brain-based human-machine interfaces: New developments, legal and ethical issues, and potential uses

Amanda Pustilnik highlights potential applications of data from new technologies that capture brain-based processes.

The data imperative

Ben Sharma shares how the best organizations immunize themselves against the plague of static data and rigid process

Black box: How AI will amplify the best and worst of humanity

Jacob Ward reveals the relationship between the unconscious habits of our minds and the way that AI is poised to amplify them, alter them, maybe even reprogram them.

Continue reading Highlights from the Strata Data Conference in New York 2018.

Categories: Technology

The missing piece

O'Reilly Radar - Wed, 2018/09/12 - 13:00

Cassie Kozyrkov explores why businesses fail at machine learning despite its tremendous potential and excitement.

Continue reading The missing piece.

Categories: Technology

AI, ML, and the IoT will destroy the data center and the cloud (just not in the way you think)

O'Reilly Radar - Wed, 2018/09/12 - 13:00

DD Dasgupta explores the edge-cloud continuum, explaining how the roles of data centers and cloud infrastructure are redefined through the mainstream adoption of AI, ML, and IoT technologies.

Continue reading AI, ML, and the IoT will destroy the data center and the cloud (just not in the way you think).

Categories: Technology

The future of data warehousing

O'Reilly Radar - Wed, 2018/09/12 - 13:00

Executives from Cloudera and PNC Bank look at the challenges posed by data-hungry organizations.

Continue reading The future of data warehousing.

Categories: Technology

Four short links: 12 September 2018

O'Reilly Radar - Wed, 2018/09/12 - 03:25

Millibytes, Webpage Bloat, Neuromorphic Computing, and UX Dark Patterns

  1. Measuring Information in Millibytes -- a cute conceit. Therefore, the information given by one passing test run [in our 1-in-90 failure scenario] is just a little over one millibyte.
  2. The Developer Experience Bait-and-Switch (Alex Russell) -- a pointed observation about bloat: If one views the web as a way to address a fixed market of existing, wealthy web users, then it’s reasonable to bias toward richness and lower production costs. If, on the other hand, our primary challenge is in growing the web along with the growth of computing overall, the ability to reasonably access content bumps up in priority.
  3. Brainchip Launches Spiking Neural Network Hardware -- Brainchip’s claim is that while a convolutional approach is more akin to modeling the neuron as a large filter with weights, the iterative linear algebra matrix multiplication on data within an activation layer and associated memory and MAC units yields a power-hungrier chip. Instead of this convolutional approach, an SNN models the neuron function with synapses and neurons with spikes between the neurons. The networks learn through reinforcement and inhibition of these spikes (repeating spikes are reinforcement).
  4. The Dark (Patterns) Side of UX Design -- We assembled a corpus of examples of practitioner-identified dark patterns and performed a content analysis to determine the ethical concerns contained in these examples. This analysis revealed a wide range of ethical issues raised by practitioners that were frequently conflated under the umbrella term of dark patterns, while also underscoring a shared concern that UX designers could easily become complicit in manipulative or unreasonably persuasive practices. We conclude with implications for the education and practice of UX designers, and a proposal for broadening research on the ethics of user experience.

Continue reading Four short links: 12 September 2018.

Categories: Technology

The real story on container, cloud, and data adoption

O'Reilly Radar - Tue, 2018/09/11 - 05:40

Poll results reveal where and why organizations choose to use containers, cloud platforms, and data pipelines.

Mesosphere conducted a poll of approximately 1,000 IT professionals to understand where they are on their container, cloud, and data adoption. Above all, the poll shows that companies are investing heavily in migrating to containers, running those containers in the cloud, and improving their data pipelines.

From outward appearances, these three pieces don’t appear related, but they go hand-in-hand. Organizations start by looking at how to improve their data pipelines. The ops team asks how they’re going to monitor hundreds of processes running on dozens of machines. Then someone says that you can monitor and separate those processes using containers. Finally, the manager looks at their hardware budget for the year and asks how much all of this new hardware will cost. The team tells the manager about the glorious land of the cloud where there aren’t upfront costs. Even better, costs can fluctuate with dynamic and usage-based allocation of resources. Everyone rejoices, pandemonium ensues, and then everyone realizes they’re nerds.

With paradisaic conditions like these, why would anyone not be using containers, cloud, and data pipelines? The results of the poll help us see why.

Data strategies

About half the people who responded said they’re either the decision maker or an influencer in the decision. By looking at the results, we’ll see why they’re making the technologies choices. We can see one driver for improved operations stems from the need for a better data pipeline and data strategy. This includes using the tools that have been around for a while such as MySQL (58%) and PostgreSQL (2%). We’re also seeing large upticks in adoption from newer technologies such as MongoDB (45%), Apache Kafka (41%), Hadoop (35%), and Apache Spark (33%).

Let’s take Kafka as example of a technology pushing the need for better operational efficiency. With Kafka, you’ll want to run producers, consumers, and producers/consumers. At scale, there can be hundreds or thousands of these processes running. How do we know all 100 or 1,000 of those processes are alive and healthy? How do we dynamically make sure there are the right number of processes running? Without the right processes and monitoring in place, your new data pipelines could be an albatross for your operations team.

I’ve taught Kafka extensively. This operational complexity is one of the first questions that companies need to solve for Kafka or other technology that’s facilitating their data strategy.

Containerization

When asked about their top initiatives this year, 61% of respondents said they are modernizing their infrastructure. The actual technologies and changes for “modernizing infrastructure” are wide-ranging. The poll dove even deeper into the sorts of projects people are undertaking. 59% said they’re trying to improve their developer productivity and 55% said they are embarking on a containerization project. There are the usual greenfield projects and 18% said they’re just going to containerize new things. Another 13% are going back to old applications to containerize them. Meanwhile, 34% are going to do both.

These organizations will face some challenges. It isn’t going to be easy to go back and update those old applications, as 47% of respondents indicated. Like many technology trends, the wetware (people) lags behind and 45% said keeping up with old applications is an issue.

Cloud

So, where do we run all of these new Kafka brokers and containers? The respondents’ companies had a wide range of annual revenues, but 44% had less than $1 billion in revenue. That size of company usually isn’t eager to expand their data center footprint. In fact, many are trying to reduce or eliminate their data centers. This is underscored by 51% of respondents saying they’re migrating to the cloud.

These moves prove difficult, too. Many applications running in the enterprise were never envisioned running anywhere outside of the organization's data center. Developers are grappling with the task of trying to run these legacy applications in the cloud and 50% responded that they’re facing this issue.

Speaking of the data center, there’s the organizations who have heard of the cloud and would like to use it. But 45% of respondents weren’t “all in” on the cloud, and those folks said they can’t replicate the levels of warm fuzzies, security, and compliance that their current data center gives them.

I was having a conversation about the cloud with a friend of mine. We were discussing if there’s a point in the future where everything will be in the cloud. His thought was that the cloud providers would have such economies of scale, no one could be persuaded to run their own data center. I contended that there won’t ever be a time where everything is in the cloud. There won’t be a cost difference high enough that a security/compliance person won’t say that any cost savings would be wiped away after the first security incident.

That’s mirrored by 28% of respondents saying they’re going to use a hybrid cloud. They’ll use cloud resources when it makes sense and use on premises resources when it makes sense. Still another 17% have a “cloud first” approach where everything goes into the cloud.

Going deeper: Next steps

If you’re reading this post to compare notes on what you should be doing, there are a few good pointers. Companies are modernizing and improving their data pipelines.

You might be reading this to double check that you’re on the right track and what to watch out for on the way. These efforts are well worth your time. My clients are experiencing great value in their data strategy and infrastructure modernization efforts.

If you’re interested in exploring container orchestration further, look no further than our free Kubernetes: Up and Running excerpt to get started.

This post is part of a collaboration between O'Reilly and Mesosphere. See our statement of editorial independence.

Continue reading The real story on container, cloud, and data adoption.

Categories: Technology

Progress for big data in Kubernetes

O'Reilly Radar - Tue, 2018/09/11 - 05:35

It has become much more feasible to run high-performance data platforms directly inside Kubernetes.

Kubernetes is really cool because managing services as flocks of little containers is a really cool way to make computing happen. We can get away from the idea that the computer will run the program and get into the idea that a service happens because a lot of little computing just happens. This idea is crucial to making reliable services that don’t require a ton of heroism to stand up or keep running.

But there is a dark side here. Containers want to be agile because that is the point of containers in the first place. We want containers because we want to make computing more like a gas made up of indistinguishable atoms instead of like a few billiard balls with colors and numbers on their sides. Stopping or restarting containers should be cheap so we can push flocks of containers around easily and upgrade processes incrementally. If ever a container becomes heavy enough that we start thinking about that specific container, the whole metaphor kind of dissolves.

So that metaphor depends on containers being lightweight. Or, at least, they have to be lightweight compared to the job they are doing. That doesn’t work out well if you have a lot of state in a few containers. The problem is that data lasts a long time and takes a long time to move. The life cycle of data is very different than the life cycle of applications. Upgrading an application is a common occurrence, but data has to live across multiple such upgrades.

Previous solutions

This all has typically meant that we had two possible options. One was to build a lot of state-handling services that each consisted of a few containers, each housing a fair bit of data. That doesn’t turn out well because these state-handling services cause problems. They don’t move easily, but because each service contains just a few containers, statistical variations in load create havoc for neighboring containers creating a need to move them. Because of poor multitenancy, managing state in containers often leads to yet another bespoke state management service for every few operational services. This is a problem because the load imposed by the services attached to each one of these stateful services is small, but the minimum number of containers required to safely manage state is typically five or more. I have heard stories of running 5,000 Kafka brokers distributed across hundreds of clusters, or hundreds of database services, or dozens of HDFS clusters. The twin problems here are that the cost of managing this cluster sprawl scales very poorly and the utilization of these machines is very poor since the load in each of these cases could typically be supported by a few dozen nodes.

The other major option has been to keep the state out of containers and put it onto a service entirely outside of Kubernetes. That can lead to grief in a few ways. First off, if your data is on a specialized storage appliance of some kind that lives in your data center, you have a boat anchor that is going to make it hard to move into the cloud. Even worse, none of the major cloud services will give you the same sort of storage, so your code isn’t portable any more. Each cloud provider has their own idiosyncratic version of storage, typically in the form of something designed to store large immutable blobs, which is only good for a few kinds of archival use. If you want anything else, you will find yourself locked into an array of the cloud provider’s specialized services.

These difficulties people are facing with containers and state have actually been very good for us at my day job because we build a system that provides a software-defined storage layer that can make a pretty good cloud-neutral distributed data platform. That’s great to have because you can use that storage platform to build a data fabric that extends from your on-premises systems into multiple cloud systems to get access to data at a performance level and with an API that you want. This data fabric can share the storage loads of a large number of applications and thus raise utilization dramatically. One particular win with this kind of design is that by putting all or most of your state in a single platform, you get a statistical leveling out of the loads imposed by the storage system, which makes managing the overall system much, much easier.

But that isn’t what we would like in the long run, either. Having stuff managed by Kubernetes and stuff that is outside Kubernetes is philosophically and practically unpleasant. It would be just so much better if we could run an advanced storage platform inside Kubernetes so that everything we do could be managed uniformly. What we need is something that would scale to handle the state required by most or all of the services we are running in all the different forms that we need, but still run in containers managed by Kubernetes. By having one service act as a data platform that handles all or most of the data for the rest of the service, we still face the problem that the containers that make up this service will have mass due to the data they are managing. If this service handles storage for many services, though, we can put the law of large numbers to work to our advantage and thus improve the manageability.

Recent advances in Kubernetes

This vision of having a high-performance storage platform that runs inside Kubernetes is becoming a very real possibility due to recent advances in how Kubernetes works. The basic idea of attaching containers to storage volumes has been around since Kubernetes 1.2, but until recently, this was mostly useful to attach to external storage systems. More recently, however, it has become much more feasible to run high-performance data platforms directly inside Kubernetes itself.

There are a few key problems to be solved to do this. These include:

  • It is common that there needs to be some machine-level (as opposed to pod-level) bits that run exactly once per host OS instance. This is particularly true for code that accesses physical resources such as the low-level code that implements the data platform itself, but you probably will need something like this for once-per-host client-side driver mechanisms as well. Daemonsets solve both of these problems by allowing you to define pods of containers that are guaranteed to run on every node. As you add nodes to your Kubernetes cluster, all applicable daemonsets will be started on these new nodes if they meet the specifications you set. Thus, if you add a node with storage resources, the low-level device interface code will run and will handle talking to the disks. As you add nodes that will have storage clients (potentially every node in the cluster), shared client side code will run.
  • You need to be able to use unmodified Docker containers, even though there may be some driver code required. For instance, you might want to run an official PostgreSQL database image completely unchanged. Sidecar containers are ideal for this along with the Flexvolume plugin. Until Kubernetes 1.8, it was typical to put vendor-specific drivers directly into the Kubenetes source code. Since then, the standard practice is to use the container storage interface (CSI) or Flexvolume plugins to deliver these drivers. The basic idea is that your pod will contain an extra container that talks to the client-side daemonset and causes the appropriate data to be mounted and exposed to the application containers in your pod.

This basic idea of pods running on two different nodes is illustrated in Figure 1. The low-level storage code is running on both nodes, while the shared client code is only running on a single node where an application is also running.

Figure 1. An example of how pods interact to provide access to a shared data platform in a Kubernetes system. Here, we have two nodes, both running storage services. One node also runs a shared client-access process that is used by an application pod to access data in the data platform formed by the low-level storage services. Figure by Ted Dunning.

The exciting thing about this architecture is that it permits independent evolution on several levels. The storage code can evolve independently of Kubernetes, the application container can access the data platform with no changes whatsoever, and any or all pieces of the system can be upgraded on the fly in a running production cluster.

One subtle point is that having a shared client-side daemon allows for more efficient access to network and storage services without necessarily imposing an extra copy of the data between the application and the disk or network.

The implications for big data

Ultimately, what matters is what can you do with a system like this, and that is where these new capabilities really shine. The goal is to build a system that is conceptually like what we see in Figure 2. Here, Kubernetes is in charge of managing and deploying applications, and these applications persist data to a data platform. To simplify the system, we want Kubernetes to manage the execution of the data platform itself, and we want the data platform to manage the complete life cycle of the application data. In fact, you could say that we want the data platform to be “like Kubernetes, but for data.”

Figure 2. In our ideal vision, applications would be able to communicate state by persisting data in any desired form, file, stream, or table via a data platform that manages this data in much the way Kubernetes manages applications using containers. Figure by Ted Dunning.

The ability to host a data platform in this way has been part of Kubernetes’ vision from nearly the beginning, but it is only now becoming possible to build out a system like this with all of the performance, security, and features needed to fully support a multitenant data platform.

Big data systems have always stressed storage systems. The original rationale for HDFS and higher performance follow-ons like MapR FS has always been that big data applications needed much more performance than dedicated storage appliances could deliver. This is more true than ever as modern hardware makes it possible to support enormous throughput. As it becomes possible to support this kind of performance entirely within Kubernetes, we are going to see enormous benefits, such as the ability to simultaneously run multiple versions of big data tool sets like Spark that interface transparently to machine learning systems like TensorFlow or H2O.

Future outlook

As exciting as these recent developments are, more is on the way. Most of the work so far in Kubernetes involves the use of file system APIs. Files are not the only way we want to persist data. For some applications, tables are much better; between services, message streams are often the right thing to use. Modern data platforms can support all of these persistence modes at the same time in a fully integrated way, but the Portable Operating System Interface (POSIX) API isn’t sufficient to access these systems, so systems like Flexvolumes aren’t sufficient. There isn’t, however, any real reason that Flexvolumes can’t provide alternative API access to persistence beyond simple files.

It is reasonable to expect that, before long, we will see ways of supporting more than just files via the same basic mechanisms we have discussed here. But even before that happens, it is already an amazing time to be working on big data under Kubernetes.

Continue reading Progress for big data in Kubernetes.

Categories: Technology

Four short links: 11 September 2018

O'Reilly Radar - Tue, 2018/09/11 - 05:10

Serverless, Predicting Personality, Broken Design, and Hamming Lectures

  1. Serverless Cold Start War -- hard numbers on the cold start time on different function-as-a-service providers.
  2. Eye Movements During Everyday Behavior Predict Personality Traits -- Using a state-of-the-art machine learning method and a rich set of features encoding different eye movement characteristics, we were able to reliably predict four of the big five personality traits (neuroticism, extraversion, agreeableness, conscientiousness) as well as perceptual curiosity, only from eye movements.
  3. Broken Product Design (We Make Money Not Art) -- Not only did he ask them to fabricate items that would be unusable but he also requested that each worker had full license to decide what the error, flaw, and glitch in the final product would be. Hutchison ended up with a collection of dysfunctional objects and prints of online exchanges with baffled factory managers.
  4. Learning to Learn (Richard Hamming) -- watch lectures in computer architecture, engineering, data, measurement, and quantum mechanics from a legend. (via Star Simpson)

Continue reading Four short links: 11 September 2018.

Categories: Technology

The ethics of data flow

O'Reilly Radar - Tue, 2018/09/11 - 04:00

If we’re going to think about the ethics of data and how it’s used, then we have to take into account how data flows.

Data, even “big data,” doesn’t stay in the same place: it wants to move. There’s a long history of language about moving data: we have had dataflow architectures, there's a great blog on visualization titled FlowingData, and Amazon Web Services has a service for moving data by the (literal) truckload. Although the scale and speed at which data moves has changed over the years, we’ve recognized the importance of flowing data ever since the earliest years of computing. If we’re going to think about the ethics of data and how it’s used, then, we can’t just think about the content of the data, or even its scale: we have to take into account how data flows.

In Privacy in Context, Helen Nissenbaum connects data’s mobility to privacy and ethics. For Nissenbaum, the important issue isn’t what data should be private or public, but how data and information flow: what happens to your data, and how it is used. Information flows are central to our expectations of privacy, and respecting those expectations is at the heart of data ethics. We give up our data all the time. It’s next to impossible to live in modern society without giving up data: we use credit cards to pay for groceries, we make reservations at restaurants, we fill prescriptions at pharmacies. And we usually have some sort of expectation that our data will be used. But those expectations include expectations about how the data will be used: who will have access to it, for what reason, and for what purposes.

Problems arise when those expectations are violated. As Nissenbaum writes, "What people care most about is not simply restricting the flow of information but ensuring that it flows appropriately." The infamous Target case, in which Target outed a pregnant teenager by sending ad circulars to her home, is a great example. We all buy things, and when we buy things, we know that data is used—to send bills and to manage inventory, if nothing else. In this case, the surprise was that Target used this customer's purchase history to identify her as pregnant, and send circulars advertising products for pregnant women and new mothers to her house. The problem isn't the collection of data, or even its use; the problem is that the advertising comes from, and produces, a different and unexpected data flow. The data that’s flowing isn’t just the feed to the marketing contractor. That ad circular, pushed into a mailbox (and read by the girl’s father) is another data flow, and one that’s not expected. To be even more precise: the problem isn’t even putting an ad circular in a mailbox, but that this data flow isn’t well defined. Once the circular goes in the mailbox, anyone can read it.

Facebook’s ongoing problems with the Cambridge Analytica case aren’t problems of data theft or intrusion; they’re problems of unexpected data flows. Customers who played the game "This is Your Digital Life" didn’t expect their data to be used in political marketing—to say nothing of their friend’s data, which was exposed even if they didn’t play. Facebook asked Cambridge Analytica to delete the data back in 2015, but apparently did nothing to determine whether the data was actually deleted, or shared further. Once data has started flowing, it is very difficult to stop it.

Data flows can be very complex. danah boyd, in the second chapter of It’s Complicated: The Social Lives of Networked Teens, describes the multiple contexts that teenagers use on social media, and their strategies for communicating within their groups in a public medium: in particular, their use of coded messages that are designed to be misunderstood by parents or others not in their group. They are creating strategies to control information flows that appear to be out of their control. Teens can’t prevent parents from seeing their Facebook feeds, but they can use a coded language to prevent their parents from understanding what they’re really saying.

Everyone who works with data knows that data becomes much more powerful when it is combined with data from other sources. Data that seems innocuous, like a grocery store purchase history, can be combined with geographic data, medical data, and other kinds of data to characterize users and their behavior with great precision. Knowing whether a person purchases cigarettes can be of great interest to an insurance company, as can knowing whether a cardiac patient is buying bacon. Increasing the police presence in some neighborhood areas inevitably leads to more arrests in those neighborhoods, creating the appearance of more crime. Data flows have complex topologies: multiple inputs, outputs, and feedback loops. The question isn’t just where your data goes and how it will be shared; it’s also what incoming data will be mixed with your data.

Nissenbaum argues that we shouldn’t be asking about absolute notions of what data should or shouldn’t be “private,” but about where the data can travel, our expectations about that travel, and what happens when data reaches its destination. That makes a lot of intuitive sense. A pharmacy or a grocery store collects a lot of data just to do business: again, it has to do billing, it has to manage stock. It has some control over how that data is remixed, shared, and commoditized. But it doesn't have control over how its partners ultimately use the data. It might be able to control what mailers its advertising agencies sends out—but who's going to raise a red flag about an innocent circular advertising baby products? It can't control what an insurance company, or even a government agency, might do with that data: deny medical benefits? Send a social worker? In many cases, consumers won't even know that their privacy has been violated, let alone how or why; they'll just know that something has happened.

As developers, how can we understand and manage data flows according to our users' expectations? That's a complex question, in part because our desires and expectations as both users and developers are different from our users’, and we can’t assume that users understand how their data might be put to work. Furthermore, enumerating and evaluating all possible flows, together with the consequences of those flows, is certainly NP-hard.

But we can start asking the difficult questions, recognizing that we’re neither omniscient nor infallible. The problem facing us isn’t that mistakes will be made, because they certainly will; the problem is that more mistakes will be made, and more damage will be done, if we don’t start taking responsibility for data flows. What might that responsibility mean?

Principles for ethical data handling (and human experimentation in general) always stress "informed consent"; Nissenbaum’s discussion about context suggests that informed consent is less about usage than about data flow. The right question isn't, "can our partners make you offers about products you may be interested in?" but, "may we share your purchase data with other businesses?" (If so, what businesses?) Or perhaps, “may we combine your purchase data with other demographic data to predict your future purchases?” (If so, what other demographic data?)

One way to prevent unexpected data flows is to delete the data before it has a chance to go anywhere. Deleted data is hard to abuse. A decade ago, data developers were saying "Save everything. Storage is cheap." We now understand that's naive. If data is collected for a purpose, it might be necessary to delete it when it has served its purpose—for example, most libraries delete records of the books a user has checked out after the books have been returned. Deleted data can’t be stolen, inadvertently shared, or demanded by a legal warrant. “Save everything” invites troublesome data flows.

But data deletion is easier said than done. The difficulty, as Facebook found out with Cambridge Analytica, is that asking someone to delete data doesn’t mean they will actually delete it. It isn’t easy to prove that data has been deleted; we don’t have auditing tools that are up to the task. In many cases, it’s not even clear what “deletion” means: does it mean that the data is removed from backups? Backups from which data is removed after-the-fact aren’t really backups; can they be trusted to restore the system to a known state? Reliable backups are an important (and infrequently discussed) part of ethical data handling, but they are also a path through which data can escape and continue to flow in the wild.

And deletion doesn’t always work in the users’ favor. Deleting data prematurely makes it difficult for a customer to appeal a decision; redress assumes we can reconstruct what happened to find an appropriate solution. Historically, it’s almost certainly true that more data has been deleted to preserve entrenched power than to preserve individual privacy. The ability to "undelete" is powerful, and shouldn't be underestimated. Data should be deleted as soon as it’s no longer needed, but no sooner—and determining when data really is no longer needed isn’t a trivial problem.

These aren’t problems to be solved in a short article. However, they are problems that we in the data community need to recognize and face. They won’t go away; they will become more serious and urgent as time goes on. How does data flow? What dams and levees can we create that will prevent data from flowing in unexpected or unwanted ways? And once we create those levees, what will happen when they break? That will inevitably be one of the most important stories of the next year.

Continue reading The ethics of data flow.

Categories: Technology

Four short links: 10 September 2018

O'Reilly Radar - Mon, 2018/09/10 - 04:25

Optoelectronics, Checked C, MagicScroll, Quantum AWS

  1. The Largest Cognitive Systems Will be Optoelectronic -- Electrons and photons offer complementary strengths for information processing. Photons are excellent for communication, while electrons are superior for computation and memory. Cognition requires distributed computation to be communicated across the system for information integration. We present reasoning from neuroscience, network theory, and device physics supporting the conjecture that large-scale cognitive systems will benefit from electronic devices performing synaptic, dendritic, and neuronal information processing operating in conjunction with photonic communication.
  2. Checked C -- This paper presents Checked C, an extension to C designed to support spatial safety, implemented in Clang and LLVM. Checked C’s design is distinguished by its focus on backward-compatibility, incremental conversion, developer control, and enabling highly performant code. Like past approaches to a safer C, Checked C employs a form of checked pointer whose accesses can be statically or dynamically verified. Performance evaluation on a set of standard benchmark programs shows overheads to be relatively low. More interestingly, Checked C introduces the notions of a checked region and bounds-safe interfaces. Here's the source.
  3. MagicScroll: A Rollable Display Device with Flexible Screen Real Estate and Gestural Input -- a rollable tablet with two concatenated flexible multitouch displays, actuated scrollwheels, and gestural input. When rolled up, MagicScroll can be used as a rolodex, smartphone, expressive messaging interface, or gestural controller. When extended, it provides full access to its 7.5-inch high-resolution multitouch display, providing the display functionality of a tablet device.
  4. Rigetti Launches Quantum Cloud Services (FastCompany) -- AWS-style cloud platform with a fast connection to 128-qubit computing. Grabbing land ahead of quantum computing actually being useful.

Continue reading Four short links: 10 September 2018.

Categories: Technology

Machine learning in the cloud

O'Reilly Radar - Fri, 2018/09/07 - 13:00

Hagay Lupesko explores key trends in machine learning, the importance of designing models for scale, and the impact that machine learning innovation has had on startups and enterprises alike.

Continue reading Machine learning in the cloud.

Categories: Technology

Pages

Subscribe to LuftHans aggregator