You are here

Feed aggregator

Four short links: 15 April 2019

O'Reilly Radar - Mon, 2019/04/15 - 04:15

Making a Group, Robot Arms, Human Contact, and a Personal Archive

  1. You Should Organize a Study Group/Book Club/Online Group/Event! Tips on How to Do It (Stephanie Hurlburt) -- good advice on how to get people together.
  2. Berkeley Open Arms -- Berkeley Open Arms manufactures the BLUE robot arm that was developed at UC Berkeley's Robot Learning Lab. Paper (arXiv link).
  3. Human Contact is a Luxury Good (NYT) -- Life for anyone but the very rich—the physical experience of learning, living, and dying—is increasingly mediated by screens. Not only are screens themselves cheap to make, but they also make things cheaper. [...] The rich do not live like this. The rich have grown afraid of screens. They want their children to play with blocks, and tech-free private schools are booming. Humans are more expensive, and rich people are willing and able to pay for them. Conspicuous human interaction—living without a phone for a day, quitting social networks and not answering email—has become a status symbol.
  4. ArchiveBox -- The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more.

Continue reading Four short links: 15 April 2019.

Categories: Technology

Strata San Francisco, 2019: Opportunities and Risks

O'Reilly Radar - Fri, 2019/04/12 - 04:00

Balancing risk and reward is a necessary tension we'll need to understand as we continue our journey into the age of data.

The Strata Data Conference in San Francisco was filled with speakers talking about opportunity. But those opportunities were balanced against risks—risks that loom large as we discover more powerful ways to apply data using machine learning and artificial intelligence. It's a necessary tension we'll need to understand as we continue on the journey into the age of data.

Cloudera's merger with HortonWorks demonstrates some of the opportunities. They "drank their own champagne" (a metaphor preferable to eating dog food) by using machine learning to merge the two companies: clustering similar customers, predicting sales opportunities, and integrating the two teams.

In his keynote, program co-chair Ben Lorica gave excellent advice for organizations that are just starting on the road to machine learning: companies that have been successful with machine learning have either built on existing data products or services, or used machine learning to modernize existing applications. Companies that attempt to make a leap into the void, working with data and services they don’t understand well, will have a rough time. Machine learning grows out of your current data practices. It may be revolutionary, but if you haven’t prepared for the revolution by developing your data sources, learning how to clean your data, preparing for data governance, and more, you’ll inevitably fall behind. Fortunately, there are tools—both open source and commercial—to help in all these areas.

Some of the most important opportunities are for democratizing data: not just making data accessible, but making it usable by everyone in the organization, even those without programming skills. Jeremy Howard's session showed how a subject expert with no prior programming knowledge can make an AI application. Howard told me about a dermatologist who has built an application that classifies burns. (He also recommended against watching the demo before lunch.) Efforts like this are key to building AI systems that create a better world. Emergency responders need tools that assist them in the field, tools that can be built into their phones, and let them make decisions without waiting for an MD.

According to Mike Olson, the most important thing we've learned from cloud computing is that "easy seriously matters." Easy doesn't just mean you can pay for computing with your credit card, or add and subtract servers at a moment's notice. And it doesn't just mean providing good tools for analytics. Easy applies to every aspect of computing, particularly self-service data. Easy means making tools for building data pipelines that don't care where the data is physically located (in a data center or the cloud), and that understand regulations governing that data and how it is used, and that make data accessible without requiring programming skills. These are tools that can be used by anyone, not just engineers and data analysts: managers, executives, and sales and marketing folks.

Moving data and computing to the cloud remains a tremendous opportunity. We're still in the early days of cloud computing: many companies that could move their data to the cloud haven't yet done so. Jordan Tigani of Google talked about the many opportunities the cloud represents, starting with decoupling data storage from computation, reducing administrative overhead, building real-time pipelines, eliminating silos, and enabling access for all users. All these benefits flow naturally from moving data to the cloud and relying on the scale of infrastructure that only cloud providers give you.

What about the risks? Several speakers, including Peter Singer and David Sanger, talked about the dangers of an increasingly militarized network. Peter Singer said: "There is no silver bullet. There will continue to be marketing, politics, wars, all taking place online. We need new strategies for dealing with it." These dangers increase as our tools become more powerful; Singer said that we can look forward to "deep face" (fake videos), and Elizabeth Svoboda discussed how neuroscience is already used to construct political messages that trigger fear responses.

We also heard about progress toward meeting these challenges. Shafi Goldwasser challenged developers to create “Safe ML”: machine learning that can’t be abused. Machine learning needs to ensure privacy, both of the training data and the model, and needs to be fair and invulnerable to tampering. The tools we need to create Safe ML have been under development among cryptographers for the past 30 years, well before modern machine learning became practical. The challenge facing machine learning developers is taking these tools—federated learning, multiparty cryptography, homomorphic encryption, and differential privacy—and putting them to use. Her points were echoed in several other sessions throughout the conference.

At the ethics summit, participants discussed the many problems in building software systems ethically. There are clearly dangers here: hardly a day goes by without news of data abuse. But perhaps the most interesting discussion was whether ethics is a zero sum game or a business opportunity. Does treating customers fairly and respecting their individuality and their privacy represent an opportunity? There are a lot of things you can say about Amazon's business practices, but almost nobody criticizes the ease with which you can return merchandise. What other opportunities are there? Many customers have become cynical, and expect to be treated badly; too few companies have thought seriously about using data to make their customers' lives better. That may be changing.

These themes were echoed in the Future of the Firm track, which focused on rethinking the corporation for the digital era. The future isn’t just about “implementing AI,” but about building organizations that work better: that support their employees’ training needs, that listen to their employees on ethical issues, that take a human-centered approach to AI. The future of the firm is about taking advantage of data—but it’s about taking advantage of data to build a better future for customers, employees, and investors.

Putting data to work is an opportunity; we’ve been making that point since the first Strata conference. The risks of a hostile, militarized network are real. But the opportunities—for corporations, for employees, for customers—are far greater.

Continue reading Strata San Francisco, 2019: Opportunities and Risks.

Categories: Technology

Four short links: 12 April 2019

O'Reilly Radar - Fri, 2019/04/12 - 04:00

Automating Statistical Analysis, Chinese AI, Data Sovereignty, and Open vs. Government Licensing

  1. Tea: A High-level Language and Runtime System for Automating Statistical Analysis -- In Tea, users express their study design, any parametric assumptions, and their hypotheses. Tea compiles these high-level specifications into a constraint satisfaction problem that determines the set of valid statistical tests, and then executes them to test the hypothesis. Open source.
  2. Chinese AI -- the things that you probably don't realize about Chinese AI, such as the language gap disadvantaging Western researchers. (via BoingBoing)
  3. It's Time to Think about Jurisdictional Data Sovereignty (Kris Constable) -- not something that Americans think about, but which the rest of the world is chewing on.
  4. The Curious Case of Public Sans (Matthew Butterick) -- Public Sans is a derivative work of Franklin Sans, which requires derivatives to be released under Open Font License (OFL). But work of a government employee or agency is in the public domain. Oof.

Continue reading Four short links: 12 April 2019.

Categories: Technology

PLUG Security Meeting 18th - Piratebox and active shooter situations

PLUG - Thu, 2019/04/11 - 08:56
This month Aaron Jones has a number of topic he will be covering at the PLUG Security meeting.

Part 1: In light of recent events occurring in New Zealand we will discuss active shooter situations, terrorism, and physical security. We will cover the OODA Loop, mental toughness, and steps you can take to prepare for emergency situations. Topics will include threat avoidance, the Marine Corp Combat Hunter Program concept of 'Left Of Bang', and basic combat life saver tools and techniques.   Part 2: PirateBox is a DIY anonymous offline file sharing device that you can create at home. The PirateBox is a simple method to get started with mesh networking and is an exciting way to provide decentralized communication in your local area. We will discuss how PirateBox and similar projects can improve your ability to communicate in a grid down situation, assist with file sharing at parties or events, and even work as a fun image board.

About Aaron:
Aaron is an experienced Linux user with several years of teaching experience. He works in the industry as a software developer while also providing consultancy on cyber security related topics. His discussions are AZ Post certified for training credit for law enforcement and he prides himself on providing quality educational material that is relevant and topical. He has a Masters Degree in Intelligence Analysis with a focus in Cyber Security, is a life long learner, and prides himself on staying up to date with the ever changing field of cyber security.

What data scientists and data engineers can do with current generation serverless technologies

O'Reilly Radar - Thu, 2019/04/11 - 04:30

The O’Reilly Data Show Podcast: Avner Braverman on what’s missing from serverless today and what users should expect in the near future.

In this episode of the Data Show, I spoke with Avner Braverman, co-founder and CEO of Binaris, a startup that aims to bring serverless to web-scale and enterprise applications. This conversation took place shortly after the release of a seminal paper from UC Berkeley (“Cloud Programming Simplified: A Berkeley View on Serverless Computing”), and this paper seeded a lot of our conversation during this episode.

Continue reading What data scientists and data engineers can do with current generation serverless technologies.

Categories: Technology

Four short links: 9 April 2019

O'Reilly Radar - Tue, 2019/04/09 - 04:20

From Chrome to Edge, Old Web, Public Sans, and The Feedback Fallacy

  1. What Microsoft Removed from Chrome to make Edge (The Verge) -- Microsoft has removed or replaced more than 50 of Google’s services that come as part of Chromium, including things like ad blocking, Google Now, Google Cloud Messaging, and Chrome OS-related services.
  2. It Seems that Google is Forgetting the Old Web -- it seems more correct to say that Google forgets stuff that is more than 10 years old. If this is the case, Google will remember and index a smaller part of the web every year. Google may do so simply because it would be impossible to do more, for economical and/or technological constraints, which sooner or later would also hit its competitors. But this only makes bigger the problem of what to remember, what to forget, and, above all, how and who should remember and forget.
  3. Public Sans -- Open source. A strong, neutral typeface for text or display. From USWDS.
  4. The Feedback Fallacy (HBR) -- identifies three theories underpinning coworker feedback, and shows how they're all wrong. What these three theories have in common is self-centeredness: they take our own expertise and what we are sure is our colleagues’ inexpertise as givens; they assume that my way is necessarily your way. But as it turns out, in extrapolating from what creates our own performance to what might create performance in others, we overreach. Research reveals that none of these theories is true. Gives advice on how to give feedback more effectively, too. At best, this fetish with feedback is good only for correcting mistakes—in the rare cases where the right steps are known and can be evaluated objectively. And at worst, it’s toxic.

Continue reading Four short links: 9 April 2019.

Categories: Technology

Why a data scientist is not a data engineer

O'Reilly Radar - Tue, 2019/04/09 - 04:00

Or, why science and engineering are still different disciplines.

"A scientist can discover a new star, but he cannot make one. He would have to ask an engineer to do it for him."

–Gordon Lindsay Glegg, The Design of Design (1969)

A few months ago, I wrote about the differences between data engineers and data scientists. I talked about their skills and common starting points.

An interesting thing happened: the data scientists started pushing back, arguing that they are, in fact, as skilled as data engineers at data engineering. That was interesting because the data engineers didn’t push back saying they’re data scientists.

So, I’ve spent the past few months gathering data and observing the behaviors of data scientists in their natural habitat. This post will offer more information about why a data scientist is not a data engineer.

Why does this even matter?

Some people complained that this data scientist versus data engineer is a mere focus on titles. “Titles shouldn’t hold people back from learning or doing new things,” they argued. I agree; learn as much as you can. Just know that your learning may only scratch the surface of what’s necessary to put something in production. Otherwise, this leads to failure with big data projects.

It’s also feeding into the management level at companies. They’re hiring data scientists expecting them to be data engineers.

I’ve heard this same story from a number of companies. They all play out the same: a company decides that data science is the way to get VC money, tons of ROI, mad street cred in their business circle, or some other reasons. This decision happens at C-level or VP-level. Let’s call this C-level person Alice.

The company goes on an exhaustive search to find the best data scientist ever. Let’s call this data scientist Bob.

It’s Bob’s first day. Alice comes up to Bob and excitedly tells him about all the projects she has in mind.

“That’s great. Where are these data pipelines and where is your Spark cluster?” Bob asks.

Alice responds, “That’s what we’re expecting you to do. We hired you to do data science.”

“I don’t know how to do any of that,” says Bob.

Alice looks at him quizzically, “But you’re a data scientist. Right? This is what you do.”

“No, I use the data pipelines and data products that are already created.”

Alice goes back to her office to figure out what happened. She stares at overly simplistic diagrams like the one shown in Figure 1 and can’t figure out why Bob can’t do the simple big data tasks.

Figure 1. Overly simplistic venn diagram with data scientists and data engineers. Illustration by Jesse Anderson, used with permission. The limelight

There are two questions that come out of these interactions:

  • Why doesn’t management understand that data scientists aren’t data engineers?
  • Why do some data scientists think they’re data engineers?

I’ll start with the management side. Later on, we’ll talk about the data scientists themselves.

Let’s face it. Data engineering is not in the limelight. It isn’t being proclaimed as the best job of the 21st century. It isn’t getting all of the media buzz. Conferences aren’t telling CxOs about the virtues of data engineering. If you only look at the cursory message, it’s all about data science and hiring data scientists.

This is starting to change. We have conferences on data engineering. There is a gradual recognition of the need for data engineering. I’m hoping pieces like this one shed light on this necessity. I’m hoping my body of work will educate organizations on this critical need.

Recognition and appreciation

Even when organizations have data science and data engineering teams, there is still a lack of appreciation for the work that went into the data engineering side.

You even see this lack of credit during conference talks. The data scientist is talking about what they’ve created. I can see the extensive data engineering that went into their model, but it’s never called out during the talk. I don’t expect the talk to cover it in detail, but it would be nice to acknowledge the work that went into enabling their creation. Management and beginners to data science perceive that everything was possible with the data scientist’s skill set.

How to get appreciation

Lately, I’ve been getting questions from data engineers on how to get into their company’s limelight. They’re feeling that when a data scientist goes to show their latest creation, they’re either taking all of the credit or they’re given all of the credit by the management. Their basic question is: “How can I get the data scientists to stop taking credit for something that was both of our work?”

That’s a valid question from what I’m seeing at companies. Management doesn’t realize (and it isn’t socialized) the data engineering work that goes into all things data science. If you’re reading this and you’re thinking:

  • My data scientists are data engineers
  • My data scientists are creating really complicated data pipelines
  • Jesse must not know what he’s talking about probably have a data engineer in the background who isn’t getting any limelight.

Similar to when data scientists quit without a data engineer, data engineers who don’t get recognition and appreciation will quit. Don’t kid yourself; there’s an equally hot job market for qualified data engineers as there is for data scientists.

Data science only happens with a little help from our friends Figure 2. Even the Italians knew the importance of data engineers in the 1400s. Image from the Met Museum, public domain.

You might have heard about the myth of Atlas. He was punished by having to hold up the world/sky/celestial spheres. The earth only exists in its current form because Atlas holds it up.

In a similar way, data engineers hold up the world of data science. There isn’t much thought or credit that goes to the person holding up the world, but there should be. All levels of an organization should understand that data science is only enabled through the work of the data engineering team.

Data scientists aren’t data engineers

That brings us to why data scientists think they’re data engineers.

A few caveats to head off comments before we continue:

  • I think data scientists are really smart, and I enjoy working with them.
  • I’m wondering if this intelligence causes a higher IQ Dunning-Kruger effect.
  • Some of the best data engineers I’ve known have been data scientists, though this number is very small.
  • There is a consistent overestimation when assessing our own skills.
Figure 3. Empirical diagram of data scientists’ perceived data engineering skills versus their actual skills. Illustration by Jesse Anderson, used with permission.

In talking to data scientists about their data engineering skills, I’ve found their self-assessments to vary wildly. It’s an interesting social experiment in biases. Most data scientists over assessed their own data engineering abilities. Some gave an accurate assessment, but none of them gave a lower assessment than their actual ability.

There are two things missing from this diagram:

  • What is the skill level of data engineers?
  • What is the skill level needed for a moderately complicated data pipeline?
Figure 4. Empirical diagram of data scientists’ and data engineers’ data engineering skills with the skill needed to create a moderately complicated data pipeline. Illustration by Jesse Anderson, used with permission.

From this figure, you can start to see the differences in the required data engineering abilities. In fact, I’m being more generous with the number of data scientists able to create a moderately complicated data pipeline. The reality may be that data scientists should be half of what the diagram shows.

Overall, it shows the approximate portions of the two groups who can and cannot create data pipelines. Yes, some data engineers can’t create a moderately complicated data pipeline. Conversely, most data scientists can’t, either. This comes back to the business issue at hand: organizations are giving their big data projects to individuals who lack the ability to succeed with the project.

What’s a moderately complicated data pipeline?

A moderately complicated data pipeline is one step above the bare minimum to create a data pipeline. An example of a bare minimum is processing text files stored in HDFS/S3 with Spark. An example of a moderately complicated data pipeline is to start optimizing your storage with a correctly used NoSQL database that uses a binary format like Avro.

I think data scientists are thinking that their simple data pipeline is what data engineering is. The reality is that they’re talking about hello-world levels and a far more complicated data pipeline is required. In the past, a data engineer performed the really difficult data engineering behind the scenes and the data scientists didn’t have to deal with it.

You might think, “Good, so 20% of my data scientists can actually do this. I don't need a data engineer after all.” First, remember this chart is being charitable in showing data scientists’ abilities. Remember that moderately complicated is still a pretty low bar. I need to create another diagram to show how few data scientists can handle the next step up in complexity. This is where the percentage drops to 1% or less of data scientists.

Why aren’t data scientists data engineers?

Sometimes I prefer to see the reflected manifestations of problems. These are a few examples of the manisted problems that make data scientists lack the data engineering skill set.

University and other courses

Data science is the hot new course out there for universities and online courses. There are all sorts of offerings out there, but virtually all of them have the same problem: they either completely lack or have one data engineering class.

When I see a new university’s data science curriculum announced, I take a look at it. Sometimes, I’ll be asked for comments on a university’s proposed data science curriculum. I give them same feedback: “Are you expecting expert programmers? Because there isn’t any coverage of the programming or systems required to even consume a data pipeline that’s been created.”

The course outlines generally focus on the statistics and math required. This reflects what companies and academics think data science should look like. The real world looks rather different. The poor students are left to fend for themselves for the rest of these non-trivial learnings.

We can take a step back and look at this academically by looking at course requirements for a master’s degree in distributed systems. Obviously, a data scientist doesn’t need this level of depth, but it helps show what’s missing and the big holes in a data scientist’s skill set. There are some major deficiencies.

Data engineering != Spark

A common misconception from data scientists—and management—is that data engineering is just writing some Spark code to process a file. Spark is a good solution for batch compute, but it isn’t the only technology you’ll need. A big data solution will require 10-30 different technologies all working together.

This sort of thinking lies at the heart of big data failures. Management thinks they have a new silver bullet to kill all of their big data problems. The reality is far more complicated than that.

When I mentor an organization on big data, I check for this misconception at all layers of the organization. If it does exist, I make sure I talk about all of the technologies they’ll need. This removes the misconception that there’s an easy button in big data and there’s a single technology to solve all of it.

Where is the code from?

Sometimes data scientists will tell me how easy data engineering is. I’ll get them to tell me how and why they think that. “I can get all the code I need from StackOverflow or Reddit. If I need to create something from scratch, I can copy someone’s design in a conference talk or a whitepaper.

To the non-engineer, this might seem OK. To the engineer, this starts major alarm bells. The legal issues aside, this isn’t engineering. There are very few cookie-cutter problems in big data. Everything after “hello world” has more complexity that needs a data engineer because there isn't a cookie-cutter approach to dealing with it. Getting your design copied from a white paper could lead to a poor performing design or worse.

I’ve dealt with a few data science teams who’ve tried this monkey-see-monkey-do approach. It doesn’t work well. This is due to big data’s spike in complexity and the extreme focus on use cases. The data science team will often drop the project as it exceeds their data engineering abilities.

Put simply, there's a big difference between “I can copy code from stackoverflow” or “I can modify something that's already been written” and “I can create this system from scratch.”

Personally, I’m worried that data science teams are going to be these sources of massive technical debt that squelches big data productivity in organizations. By the time it’s found out, the technical debt will be so high it might be infeasible to correct it.

What’s the longest their code has been in production?

A core difference for data scientists is their depth. This depth is shown in two ways. What’s the longest time their code been in production—or has it ever been in production? What is the longest, largest, or most complicated program they have ever written?

This isn’t about gamesmanship or who’s better; it’s showing if they know what happens when you put something in production and how to maintain code. Writing a 20-line program is comparatively easy. Writing 1,000 lines of code that’s maintainable and coherent is another situation all together. People who’ve never written more than 20 lines don’t understand the miles of difference in maintainability. All of their complaints about Java verbosity or why programming best practices need to be used come into focus with large software projects.

Moving fast and breaking things works well when evaluating and discovering data. It requires a different and more intense level when working with code that goes into production. It’s for reasons like these that most data scientist’s code gets rewritten before it goes into production.

When they design a distributed system

One way to know the difference between data scientists and data engineers is to see what happens when they write their own distributed systems. A data scientist will write one that is very math focused but performs terribly. A software engineer with a specialization in writing distributed systems will create one that performs well and is distributed (but seriously don’t write your own). I’ll share a few stories of my interactions with organizations where data scientists created a distributed system.

A business unit that was made up of data scientists at my customer’s company created a distributed system. I was sent in to talk to them and get an understanding of why they created their own system and what it could do. They were doing (distributed) image processing.

I started out by asking them why they created their own distributed system. They responded that it wasn’t possible to distribute the algorithm. To validate their findings, they contracted another data scientist with a specialty in image processing. The data scientist contractor confirmed that it wasn’t possible to distribute the algorithm.

In the two hours I spent with the team, it was clear that the algorithm could be distributed on a general-purpose compute engine, like Spark. It was also clear that the distributed system they wrote wouldn’t scale and had serious design flaws. By having another data scientist validate their findings instead of a qualified data engineer, they had another novice programmer validate their novice findings.

At another company run by mathematicians, they told me about the distributed system they wrote. It was written so that math problems could be run on other computers. A few things were clear after talking to them. They could have used a general-purpose compute engine and been better off. The way they were distributing and running jobs was inefficient. It was taking longer to do the RPC network traffic than it was to perform the calculation.

There are commonalities to all of these stories and others I didn’t tell:

  • Data scientists focus on the math instead of the system. The system is there to run math instead of running math efficiently.
  • Data engineers know the tricks that aren’t math. We’re not trying to cancel out infinities.
  • A data scientist asks, “how can I get a computer to do my math problems?” A data engineer asks, “how can I get a computer to do my math problems as fast and efficiently as possible?”
  • The organizations could have saved themselves time, money, and heartache by using a general-purpose engine instead of writing their own.
What’s the difference?

You’ve made it this far and I hope I’ve convinced you: data scientists are not data engineers. But really, what difference does all of this make?

The difference between a data scientist and a data engineer is the difference between an organization succeeding or failing in their big data project.

Data science from an engineering perspective

When I first started to work with data scientists, I was surprised at how little they begged, borrowed, and stole from the engineering side. On the engineering front, we have some well-established best practices that weren’t being used on the data science side. A few of these are:

  • Source control
  • Continuous integration
  • Project management frameworks like Agile or Scrum
  • IDEs
  • Bug tracking
  • Code reviews
  • Commenting code

You saw me offhandedly mention the technical debt I’ve seen in data science teams. Let me elaborate on why I’m so worried about this. When I start pushing on a data science team to use best practices, I get two answers: “we know and we’re going to implement that later” or “we don’t need these heavyweight engineering practices. We’re agile and nimble. These models won’t go into production yet.” The best practices never get implemented and that model goes straight into production. Each one of these issues leads to a compounding of technical debt.

Code quality

Would you put your intern’s code into production? If you’re in management, go ask your VP of engineering if they’ll put a second-year computer science student’s code into production. You might get a vehement no. Or they might say after the code was reviewed by other members of the team.

Are you going to put your data scientist’s code into production? Part of the thrust of this article is that data scientists are often novices at programming—at best—and their code is going into production. Take a look back up at the best practices that data science teams aren’t doing. There are no checks and balances to keep amateur code from going into production.

Why did they get good?

I want to end this by addressing the people who are still thinking their data scientists are data engineers. Or those data scientists who are also qualified data engineers. I want to restate that you can see from the figure it is possible, just not probable.

If this is true, I’d like you to think about why this happened.

In my experience, this happens when the ratio of data scientists to data engineers is well out of alignment. This happens when the ratio is inverted and there are zero data engineers in the organization. There should be more like two to five data engineers per data scientist. This ratio is needed because more time goes into the data engineering side than the data science.

When teams lack the right ratio, they’re making poor use of their data scientists’ time. Data scientists tend to get stuck on the programming parts that data engineers are proficient in. I’ve seen too many data scientists spend days on something that would take a data engineer an hour. This incorrectly perceived and solved problem leads organizations to hire more data scientists instead of hiring the right people who make the process more efficient.

Other times, they’re misunderstanding what a data engineer is. Having unqualified or the wrong type of data engineer is just as bad. You need to make sure you’re getting qualified help. This leads to the fallacy that you don’t need a data engineer because the ones you’ve worked with aren’t competent.

I’m often asked by management how they should get their data scientists to be more technically proficient. I respond that this is more a question of should the data scientists become more technically proficient. This is important for several reasons:

  • There’s a low point of diminishing returns for a data science team that isn’t very technical to begin with. They can study for months, but may never get much better.
  • It assumes that a data scientist is a data engineer and that isn’t correct. It would be better to target the one or two individuals on the data science team with the innate abilities to get better.
  • Is there an ROI to this improvement? If the data science team gets better, what could it do better or different?
  • It assumes the highest value is to improve the data science team. The better investment may be in improving the data engineering team and facilitating better communication and relations between the data science and data engineering teams.
  • It assumes that the data scientists actually want to improve technically. I’ve found that data scientists consider data engineering a means to an end. By doing the data engineering work, they get to do the fun data science stuff.
What should we do?

Given that a data scientist is not a data engineer, what should we do? First and foremost, we have to understand what data scientists and data engineers do. We have to realize this isn’t a focus on titles and limiting people based on that. This is about a fundamental difference in what each person is good at and their core strengths.

Having a data scientist do data engineering tasks is fraught with failure. Conversely, having a data engineer do data science is fraught with failure. If your organization is trying to do data science, you need both people. Each person fulfills a complementary and necessary role.

For larger organizations, you will start to see the need for people who split the difference between the data scientist and data engineer skill sets. I recommend the management team look at creating a machine learning engineer title and hiring for it.

Success with big data

As you’ve seen here, the path to success with big data isn’t just technical—there are critical management parts. Misunderstanding the nature of data scientists and data engineers is just one of those. If you’re having trouble with your big data projects, don’t just look for technical reasons. The underlying issue may be a management or team failure.

As you’re doing a root cause analysis of why a big data project stalled or failed, don’t just look at or blame the technology. Also, don’t just take the data science team’s explanation because they may not have enough experience to know or understand why it failed. Instead, you’ll need to go deeper—and often more painfully—to look at the management or team failings that led to a project failure.

Failures like these form a repeating and continuous pattern. You can move to the newest technology, but you’re just fixing the systemic issues. Only by fixing the root issue can you start to be successful.

Continue reading Why a data scientist is not a data engineer.

Categories: Technology

Four short links: 8 April 2019

O'Reilly Radar - Mon, 2019/04/08 - 05:10

Chinese Livestreaming, Tech and Teens, YouTube Professionalizing, and Inclusive Meetings

  1. Inside the Dystopian Reality of China's Livestreaming Craze -- Livestreaming exacts a huge mental toll on the people who do it. It’s easy money, but also toxic. Overeggs the dystopia (all interaction is a performance, professional interaction no less so), but is still a quick precis of where livestreaming is at in China. As for the toxic money, just ask Justin Kan.
  2. Screens, Teens, and Psychological Well-Being: Evidence From Three Time-Use-Diary Studies -- We found little evidence for substantial negative associations between digital-screen engagement—measured throughout the day or particularly before bedtime—and adolescent well-being.
  3. The Golden Age of YouTube is Over (The Verge) -- By promoting videos that meet certain criteria, YouTube tips the scales in favor of organizations or creators—big ones, mostly—that can meet those standards. My favorite part is where YouTube refers to the people who made it popular as our endemic creators, a phrase that'd make Orwell stabbier than usual.
  4. Inclusive Scientific Meetings -- This document presents some concrete recommendations for how to incorporate inclusion and equity practices into scientific meetings, from the ground up. This document includes three sections: planning the meeting; during the meeting; and assessing the meeting. A great cheatsheet that applies to non-science meetings, too.

Continue reading Four short links: 8 April 2019.

Categories: Technology

Four short links: 5 April 2019

O'Reilly Radar - Fri, 2019/04/05 - 04:05

DIY Bio, Perl, Knowledge Graph Learning, and Amazon Memos

  1. Engineering Proteins in the Cloud -- Amazingly, we're pretty close to being able to create any protein we want from the comfort of our Jupyter Notebooks, thanks to developments in genomics, synthetic biology, and most recently, cloud labs. In this article, I'll develop Python code that will take me from an idea for a protein all the way to expression of the protein in a bacterial cell, all without touching a pipette or talking to a human. The total cost will only be a few hundred dollars! Using Vijay Pande from A16Z's terminology, this is Bio 2.0.
  2. 93% of Paint Splatters are Valid Perl Programs (Colin McMillen) -- tongue-in-cheek, but clever. I, of course, am fluent in those paint splatters. Have written a best-selling book on executable paint splatters. I should feel called-out, I guess, but it's too funny for me to feel much pain.
  3. AmpliGraph -- Python library for representation learning on knowledge graphs. [...] Use AmpliGraph if you need to: (1) Discover new knowledge from an existing knowledge graph. (2) Complete large knowledge graphs with missing statements. (3) Generate stand-alone knowledge graph embeddings. (4) Develop and evaluate a new relational model.
  4. Writing Docs at Amazon -- how to write those famous six-page narrative memos as preparation for meeting with Jeff Bezos, from someone who was there. As much about the meetings as the memos, as it should be.

Continue reading Four short links: 5 April 2019.

Categories: Technology

Installfest on April 6th is Cancelled

PLUG - Thu, 2019/04/04 - 09:54
The location is not available for this Saturday's Installfest so we've regrettably needed to cancel this particular meeting. We hope to see you at the next one though.

Meeting Topics for April 11th - Robots and Relics

PLUG - Thu, 2019/04/04 - 09:10
We've got Robots and Relics for this months meeting.   William Lindley will give us a history of computers with various items of historical significance that he's acquired over time and then Scott and Wendy Rose will introduce us to SUNI the robot.

William Lindley: 75 Years of Computing in 60 Minutes

The roots of modern digital computing go back nearly two hundred years, and through a series of pass-around artifacts from the past century, we will explore how the pioneers of the field and their groundbreaking decisions and technologies have led us -- for better or worse -- to today's Internet-enabled world.

About William:
Mr. Lindley has been in the computer industry since he sold his first program (a printer driver for Heathkit HDOS) in 1980. He has used system from the earliest 8-bit microprocessors, through the PDP-11 and VAX, up to IBM mainframes, and has managed to write programs that did not crash on most of them. Mr. Lindley has been a GNU/Linux user since 1992 and has been free of proprietary software since 2001. Most recently he has been pleased to be an adjunct professor at Mesa Community College.

Scott & Wendy Rose: SUNI the Robot presents STEM in America open source addition

Team Suni will discuss the future of Robotics through Stem in America and the future jobs it will create. Discussion will center around Suni's open-source programming using Linux Ubuntu OS and Robot Operating System.

SUNI the robot is cutting-edge robotic technology from across planet internet, creating a truly unique educational experience. And the creative ideas that power SUNI’s adventures to AI are a worldwide collaboration.Team SUNI knows how to creatively inspire and we built SUNI the Robot to prove it.

STEM drives our world, our economy, our lives. It builds brighter futures for the 100% of us. STEM is all around us, every minute, of every hour, of every day. We’ve captured the magic of STEM within a robot as an instrument of inspiration, learning, and wonderment. Winning thousands of hearts and minds along the way. And SUNI’s adventures continue…

Four short links: 4 April 2019

O'Reilly Radar - Thu, 2019/04/04 - 04:15

Language Creators, Undersea Cable, Open Source Trends, Making Math Questions

  1. A Conversation with Language Creators: Guido, James, Anders, and Larry (YouTube) -- A lot of people make the mistake of thinking that languages move at the same speed as hardware or all of the other technologies we live with. But languages are much more like math and much more like the human brain, and they all have evolved slowly. And we're still programming in languages that were invented 50 years ago. All the principles of functional programming were thought of more than 50 years ago.
  2. Undersea Internet Cables and Big Internet Companies (APNIC) -- interesting numbers. Between 2016 and 2020, about 100 new cables have been laid or planned. [...] The unit cost is cheaper for new cables than old cables whose lit capacity is increased. [...] In the last five years, the cables that are partly owned by Google, Facebook, Microsoft, and Amazon have risen eight-fold, and there are more such cables in the pipeline. These content providers also consume over 50% of all international bandwidth, and TeleGeography projects that by 2027, they could consume over 80%.
  3. Making Sense of a Crazy Year in Open Source -- if you haven't kept your eye on the latest weirdness in open source licensing (as companies attempt to squeeze commercial leverage from licenses), this is a great intro. Elastic CEO Shay Banon summed it up, saying: “We now have three tiers: open source and free, free but under a proprietary license, and paid under a proprietary license.”
  4. Mathematics Data Set (GitHub) -- This data set code generates mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. This is designed to test the mathematical learning and algebraic reasoning skills of learning models. Not what Dan Meyer would call good problems, mind you!

Continue reading Four short links: 4 April 2019.

Categories: Technology

150+ live online training courses opened for April and May

O'Reilly Radar - Wed, 2019/04/03 - 05:05

Get hands-on training in TensorFlow, cloud computing, blockchain, Python, Java, and many other topics.

Learn new topics and refine your skills with more than 150 new live online training courses we opened up for April and May on the O'Reilly online learning platform.

AI and machine learning

Deep Learning from Scratch, April 19

Beginning Machine Learning with Pytorch, May 1

Intermediate Machine Learning with PyTorch, May 2

Natural Language Processing (NLP) from Scratch, May 6

Deep Learning Fundamentals, May 7

Hands-On Machine Learning with Python: Classification and Regression, May 9

Hands-On Machine Learning with Python: Clustering, Dimension Reduction, and Time Series Analysis, May 10

Sentiment Analysis for Chatbots in Python, May 14

Artificial Intelligence: An Overview of AI and Machine Learning, May 15

TensorFlow Extended: Data Validation and Transform, May 16

Deep Learning for Natural Language Processing (NLP), May 16

Getting Started with Machine Learning, May 20

TensorFlow Extended: Model Build, Analysis, and Serving, May 21

Essential Machine Learning and Exploratory Data Analysis with Python and Jupyter Notebook, May 28-29

Artificial Intelligence: Real-World Applications, May 29


Introducing Blockchain, May 3

Introduction to Distributed Ledger Technology for Enterprise, May 9

Blockchain and Cryptocurrency Essentials, May 16

Certified Blockchain Solutions Architect (CBSA) Certification Crash Course, May 22

An Introduction to Ethereum DApps, May 23


Spotlight on Cloud: The Future of Internet Security with Bruce Schneier, April 4

Product Management in 90 Minutes, April 11

Spotlight on Data: Creating Smart Products Requires Collaboration, with Gretchen Anderson , April 15

Agile for Everybody, April 18

Developing Your Coaching Skills, April 22

Spotlight on Innovation: Building Resilient Systems with Heidi Waterhouse and Chris Guzikowski, April 23

Managing Stress and Building Resiliency, May 2

Having Difficult Conversations , May 6

Unlock Your Potential, May 7

Spotlight on Innovation: A Trader’s Journey to Python and Beyond, with David Bednarczyk , May 8

60 Minutes to Better Product Metrics, May 9

Business Fundamentals, May 10

Building the Courage to Take Risks, May 13

Competing with Business Strategy, May 14

Better Business Writing, May 15

Leadership Communication Skills for Managers, May 16

Performance Goals for Growth, May 21

Introduction to Critical Thinking, May 22

Introduction to Delegation Skills, May 22

Negotiation Fundamentals, May 23

Giving a Powerful Presentation, May 28

Emotional Intelligence in the Workplace, May 31

Getting Unstuck, June 3

Introduction to Google Cloud Platform, June 3-4

Data science and data tools

Apache Hadoop, Spark, and Big Data Foundations, April 22

Fraud Analytics Using Python, April 30

Hands-On Algorithmic Trading With Python, May 1

Data Structures in Java, May 1

Cleaning Data at Scale, May 13

Big Data Modeling, May 13-14

Quantitative Trading with Python, May 15

Fundamentals of Data Architecture, May 20-21

Time Series Forecasting, May 22

Practical Data Cleaning with Python, May 22-23

Introduction to Google Cloud Platform, June 3-4


Java Full Throttle with Paul Deitel: A Code-Intensive One-Day Course, April 22

Scala Fundamentals: From Core Concepts to Real Code in 5 Hours, May 1

Hands-On Introduction to Apache Hadoop and Spark Programming, May 1-2

C# Programming: A Hands-On Guide, May 2

Java 8 Generics in 3 Hours, May 2

Learning Python 3 by Example, May 2

Getting Started with Java, May 3

Programming with Data: Foundations of Python and Pandas, May 6

SOLID Principles of Object-Oriented and Agile Design, May 7

Introduction to the Go Programming Language, May 7

Beginner’s Guide to Writing AWS Lambda Functions in Python, May 7

Getting Started with Pandas, May 7

Design Patterns Boot Camp, May 8-9

Modern JavaScript, May 13

Python Full Throttle with Paul Deitel, May 13

Reactive Spring Boot, May 13

Reactive Programming with Java Completable Futures, May 13

What's New In Java, May 14

Advanced SQL Series: Window Functions, May 14

Introduction to Python Programming, May 14

Next-Generation Java Testing with JUnit 5, May 15

Intro to Mathematical Optimization, May 15

Programming with Java Lambdas and Streams, May 16

IoT Fundamentals, May 16-17

Working with Dataclasses in Python 3.7, May 21

Test-Driven Development In Python, May 21

Rust Programming: A Crash Course, May 22

Pythonic Object-oriented Programming, May 22

Automating Go Projects, May 23

Python: The Next Level, May 23-24

Getting Started with OpenStack, May 24

Ground Zero Programming with JavaScript , May 28

OCA Java SE 8 Programmer Certification Crash Course, May 28-30

Mastering the Basics of Relational SQL Querying, May 29-30

Scalable Concurrency with the Java Executor Framework, May 30


CCNA Cyber Ops SECFND Crash Course 210-250, April 12

CCNA Cyber Ops SECFND Crash Course 210-250, April 12

CCNA Cyber Ops SECOPS crash course 210-255, April 22

Certified Ethical Hacker (CEH) Crash Course, May 2-3

Security Operation Center (SOC) Best Practices, May 3

Introduction to Encryption, May 22

Getting Started with Cyber Investigations and Digital Forensics, May 23

Cyber Security Fundamentals, May 23-24

CompTIA Security+ SY0-501 Crash Course , May 29-30

Ethical Hacking Bootcamp with Hands-on Labs, May 29-31

Systems engineering and operations

Analyzing Software Architecture, April 16

Automating Architectural Governance Using Fitness Functions, April 22

Next Level Git - Master Your Workflow, April 22

Continuous Delivery with Jenkins and Docker, April 24

Bootiful Testing, April 29

Comparing Service-Based Architectures, April 30

Getting Started with OpenShift, May 1

AWS Certified Developer Associate Crash Course, May 1-2

Linux Under the Hood, May 2

AWS Certified Solutions Architect Associate Crash Course, May 6-7

Building a Deployment Pipeline with Jenkins 2, May 8-9

AWS Account Setup Best Practices, May 10

How Networks Really Work, May 10

Introduction to Docker CI/CD, May 13

Linux Troubleshooting, May 13

Jenkins 2 - Up and Running, May 13

Introduction to Knative, May 14

AWS for Mobile App Developer, May 14

Google Cloud Certified Associate Cloud Engineer Crash Course, May 15-16

Getting Started with Cloud Computing, May 16

Managing Containers on Linux, May 16

AWS Certified SysOps Administrator (Associate) Crash Course, May 16-17

Google Cloud Platform (GCP) for AWS Professionals, May 17

Software Architecture by Example, May 17

Microservices Caching Strategies, May 17

Kubernetes in 4 Hours, May 17

Introduction to Docker Images, May 20

Chaos Engineering: Planning and Running Your First Game Day, May 20

AWS Managed Services, May 20-21

Kafka Fundamentals, May 20-21

Architecture for Continuous Delivery, May 21

9 Steps to Awesome with Kubernetes, May 21

AWS Certified Security - Specialty Crash Course, May 21-22

Istio on Kubernetes: Enter the Service Mesh, May 22

AWS Machine Learning Specialty Certification Crash Course, May 22-23

Getting Started with Amazon Web Services (AWS), May 22-23

Automating Architectural Governance Using Fitness Functions, May 23

Chaos Engineering: Planning, Designing, and Running Automated Chaos Experiment, May 23

Introduction to Docker Containers, May 24

Building and Managing Kubernetes Applications, May 24

Docker for JVM projects, May 28

Cloud Complexity Management, May 28

Ansible for Managing Network Devices, May 28

Understanding AWS Cloud Compute Options, May 28-29

Red Hat Certified System Administrator (RHCSA) Crash Course, May 28-31

Kubernetes Serverless with Knative, May 29

Next Level Git - Master Your Content, May 30

Learning MongoDB - A Hands-on Guide, May 30

AWS Certified Cloud Practitioner Exam Crash Course, May 30-31

Comparing Service-Based Architectures, May 31

Amazon Web Services (AWS) Technical Essentials, June 3

Implementing and Troubleshooting TCP/IP, June 3

Introduction to Google Cloud Platform, June 3-4

Introduction to Google Cloud Platform, June 3-4

Continue reading 150+ live online training courses opened for April and May.

Categories: Technology

Four short links: 3 April 2019

O'Reilly Radar - Wed, 2019/04/03 - 04:05

HTML DRM, Toxic Incentives, Moral Crumple Zones, and Stats + Symbols

  1. The Effects of HTML's DRM -- middlemen DRM vendors can say "no" to your software playing video.
  2. YouTube Executives Ignored Warnings, Letting Toxic Videos Run Rampant (Bloomberg) -- The company spent years chasing one business goal above others: “Engagement,” a measure of the views, time spent and interactions with online videos. Conversations with over 20 people who work at, or recently left, YouTube reveal a corporate leadership unable or unwilling to act on these internal alarms for fear of throttling engagement. How you incentivize your product managers matters.
  3. Moral Crumple Zones: Cautionary Tales in Human-Robot Interaction -- Just as the crumple zone in a car is designed to absorb the force of impact in a crash, the human in a highly complex and automated system may become simply a component—accidentally or intentionally—that bears the brunt of the moral and legal responsibilities when the overall system malfunctions.
  4. Combining Symbols and Statistics So Machines Can Reason About What They See (MIT) -- overview of a paper that combines reasoning (symbols) with perception (statistics). Combining the two is one piece of progressing AI.

Continue reading Four short links: 3 April 2019.

Categories: Technology

0x65: Linux Foundation's Community Bridge

FAIF - Tue, 2019/04/02 - 05:21

Bradley and Karen discuss and critique the new initiative by the Linux Foundation called CommunityBridge. The podcast includes various analysis that expands upon their blog post about Linux Foundation's CommunityBridge.

Show Notes: Segment 0 (00:36)
  • Conservancy helped Free Software Foundation and GNOME Foundation begin fiscal sponsorship work. (07:50)
  • Conservancy has always been very coordinated with Software in the Public Interest, which is a FOSS fiscal sponsor that predates Conservancy. (08:26)
  • Conservancy helped NumFocus get started as a fiscal sponsor by providing advice. (08:53)
  • The above are all 501(c)(3) charities, but there are also 501(c)(6) fiscal sponsors, such as Linux Foundation and Eclipse Foundation. (10:00)
  • Bradley mentioned that projects that are forks can end up in different fiscal sponsors, such as Hudson being in Eclipse Foundation, and Jenkins being associated with a Linux Foundation sub-org. (10:30)
  • Bradley mentioned that any project — be it SourceForge, GitHub, or Community Bridge — that attempts to convince FOSS developers to use proprietary software for their projects is immediately suspect (12:00)
  • Open Collective, a for-profit company seeking to do fiscal sponsorship (but attempting to release their code for it) is likely under the worst “competitive” threat from this initiative. (19:50)
Segment 1 (21:23) Segment 2 (36:54)
  • Karen mentioned that CommunityBridge also purports to address diversity and security issues for FOSS projects. (37:00)
  • Bradley mentioned the code hosted on and also the Reimbursenator project that PSU students wrote. (42:00)
Segment 3 (42:44)

Bradley and Karen discuss (or, possibly don't) discuss what's coming up on the next episode. Fact of the matter is that this announcement wasn't written yet when we recorded this episode and we weren't sure if 0x65 would be released before or after that announcement was released. We'll be discussing that topic on 0x66.

Send feedback and comments on the cast to <>. You can keep in touch with Free as in Freedom on our IRC channel, #faif on, and by following Conservancy on and and Twitter.

Free as in Freedom is produced by Dan Lynch of Theme music written and performed by Mike Tarantino with Charlie Paxson on drums.

The content of this audcast, and the accompanying show notes and music are licensed under the Creative Commons Attribution-Share-Alike 4.0 license (CC BY-SA 4.0).

Categories: Free Software

Specialized tools for machine learning development and model governance are becoming essential

O'Reilly Radar - Tue, 2019/04/02 - 04:00

Why companies are turning to specialized machine learning tools like MLflow.

A few years ago, we started publishing articles (see “Related resources” at the end of this post) on the challenges facing data teams as they start taking on more machine learning (ML) projects. Along the way, we described a new job role and title—machine learning engineer—focused on creating data products and making data science work in production, a role that was beginning to emerge in the San Francisco Bay Area two years ago. At that time, there weren’t any popular tools aimed at solving the problems facing teams tasked with putting machine learning into practice.

About 10 months ago, Databricks announced MLflow, a new open source project for managing machine learning development (full disclosure: Ben Lorica is an advisor to Databricks). We thought that given the lack of clear open source alternatives, MLflow had a decent chance of gaining traction, and this has proven to be the case. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow.

So, why is this new open source project resonating with data scientists and machine learning engineers? Recall the following key attributes of a machine learning project:

  • Unlike traditional software where the goal is to meet a functional specification, in ML the goal is to optimize a metric.
  • Quality depends not just on code, but also on data, tuning, regular updates, and retraining.
  • Those involved with ML usually want to experiment with new libraries, algorithms, and data sources—and thus, one must be able to put those new components into production.

MLflow’s success can be attributed to a lightweight “open interface” that allows users to hook up their favorite machine learning libraries, and the availability of three components that users can pick and choose from (i.e., they can use one, two, or all three of the following):

Figure 1. Image by Matei Zaharia; used with permission.

The fact that one can pick and choose which MLflow component(s) to use means the project is able to quickly serve the needs of a diverse set of users. Based on an upcoming survey we conducted of MLflow users, here are some of the most popular use cases:

  • Tracking and managing large numbers of machine learning experiments: MLflow is useful for an individual data scientist tracking his/her own experiments, but it is also designed to be used by companies with large teams of machine learning developers who are using it to track thousands of models.
  • MLflow is being used to manage multi-step machine learning pipelines.
  • Model packaging: companies are using MLflow to incorporate custom logic and dependencies as part of a model’s package abstraction before deploying it to their production environment (example: a recommendation system might be programmed to not display certain images to minors).

The upcoming 0.9.0 release has many new features, including support for database stores for the MLflow Tracking Server, which will make it easier for large teams to query and track ongoing and past experiments.

We are still in the early days for tools supporting teams developing machine learning models. Besides MLflow, there are startups like and that are building tools to ease similar pain points. As software development begins to resemble ML development over the next few years, we expect to see more investments in tools.

Model governance

Companies need to look seriously at the improved tools for developing machine learning models, many of which are part of more ambitious tools suites. Machine learning can’t be limited to researchers with Ph.D.s; there aren’t enough of them. Machine learning is in the process of democratization; tools that make it possible for software developers to build and train models are essential to this process.

We’ve also said the number of machine learning models that are deployed in production will increase dramatically: many applications will be built from many models, and many organizations will want to automate many different aspects of their business. And those models will age and need to be re-trained periodically. We’ve become accustomed to the need for data governance and provenance, understanding and controlling the many databases that are combined in a modern data-driven application. We’re now realizing the same is true for models, too. Companies will need to track the models they’re building and the models they have in production.

Startups like Datatron are beginning to use the term “model governance” to describe the task of tracking and managing models, and they are beginning to build model governance tools into their product suites. This term describes the processes that enterprises and large companies are starting to use to understand the many ML initiatives and projects teams are working on. Regulators are also signalling their interest in products that rely on AI and machine learning, thus systems for managing ML development are going to be required to comply with future legislation. Here are some of the elements that are going to play a role in building a model governance solution:

  • A database for authorization and security: who has read/write access to certain models
  • A catalog or a database that lists models, including when they were tested, trained, and deployed
  • Metadata and artifacts needed for audits: as an example, the output from the components of MLflow will be very pertinent for audits
  • Systems for deployment, monitoring, and alerting: who approved and pushed the model out to production, who is able to monitor its performance and receive alerts, and who is responsible for it
  • A dashboard that provides custom views for all principals (operations, ML engineers, data scientists, business owners)

Traditional software developers have long had tools for managing their projects. These tools serve functions like version control, library management, deployment automation, and more. Machine learning engineers know and use all those tools, but they’re not enough. We’re beginning to see the tool suits that provide the features that machine learning engineers need, including tools for model governance, tracking experiments, and packaging models so that results are repeatable. The next big step in the democratization of machine learning is making it more manageable: not simply hand-crafted artisanal solutions, but solutions that make machine learning manageable and deployable at enterprise scale.

Related resources:

Continue reading Specialized tools for machine learning development and model governance are becoming essential.

Categories: Technology

De-biasing language

O'Reilly Radar - Tue, 2019/04/02 - 04:00

The toughest bias problems are often the ones you only think you’ve solved.

In a recent paper, Hila Gonen and Yoav Goldberg argue that methods for de-biasing language models aren’t effective; they make bias less apparent, but don’t actually remove it. De-biasing might even make bias more dangerous by hiding it, rather than leaving it out in the open. The toughest problems are often the ones you only think you’ve solved.

Language models are based on “word embeddings,” which are essentially lists of word combinations derived from human language. There are some techniques for removing bias from word embeddings. I won’t go into them in detail, but for the sake of argument, imagine taking the texts of every English language newspaper and replacing “he,” “she,” and other gender-specific words with “they” or “them.” (Real techniques are more sophisticated, of course.) What Gonen and Goldberg show is that words still cluster the same way: stereotypically female professions still turn up as closely related (their example is nurse, caregiver, receptionist, and teacher).

I’m not at all surprised at the result. Stereotypes go deeper than pronouns and other obvious indications of gender, and are deeply embedded in the way we use language. Do nurse, teacher, and caregiver cluster together because they’re all about “caring,” and do traditionally masculine professions cluster differently? I suspect the connections are much more complex, but something along those lines is probably happening. It’s not a problem if “caring” professions cluster together—but what about the connections between these words and other words?

Gonen and Goldberg point out that explicit male/female associations aren’t really the issue: “algorithmic discrimination is more likely to happen by associating one implicitly gendered term with other implicitly gendered terms, or by picking up on gender-specific regularities in the corpus by learning to condition on gender-biased words, and generalizing to other gender-biased words (i.e., a resume classifier that will learn to favor male over female candidates based on stereotypical cues in an existing—and biased—resume data set, despite being “oblivious” to gender).” That is, an AI that screens job applications for a “programming” position could be biased against women without knowing anything explicit about gender; it would just know that “programmer” clusters with certain words that happen to appear more often in resumes that come from men.

So let’s ask some other difficult questions. Would de-biasing language around race and ethnicity achieve the same (lack of) result? I would like to see that studied; Robyn Speer’s work, described in “How To Make a Racist AI Without Really Trying,” suggests that de-biasing for race is at least partially successful; though, Speer asks: “Can we stop worrying about algorithmic racism? No. Have we made the problem a lot smaller? Definitely.” I hope she’s right; I’m less convinced now than I was a few months ago. I can certainly imagine racial stereotypes persisting even after bias has been removed. What about anti-semitic language? What about other stereotypes? One researcher I know discovered an enterprise security application that was biased against salespeople, who were considered more likely to engage in risky behavior.

We’re uncovering biases that are basic to our use of language. It shouldn’t surprise anyone that these biases are built in to the way we communicate, and that they go fairly deep. We can remove gender as a factor in word embeddings, but that doesn’t help much. Turkish, for example, doesn’t have gendered pronouns, a fact that has revealed gender bias in automated translation, where gender-neutral Turkish sentences are translated as gender-specific English sentences. But no one would claim that gender bias doesn’t exist in Turkish; it’s just encoded differently. Likewise, we can remove race and ethnicity as a factor in word embeddings, but that, at best, leaves us with a smaller problem. Language is only a symptom of bigger issues. These biases are part of what we are, and these word associations, including the associations we’d prefer to disown, are part of what makes language work.

The problem we’re facing in natural language processing (as in any application of machine learning) is that fairness is aspirational and forward looking; data can only be historical, and therefore necessarily reflects the biases and prejudices of the past. Learning how to de-bias our applications is progress, but the only real solution is to become better people. That’s more easily said than done; it’s not clear that being more conscious about how we talk will remove these hidden biases, any more than rewriting “he” and “she” as “they.” And being too conscious of how we talk can easily lead to a constant state of self-censorship that makes conversation impossible, and specifically the kinds of conversations we need to make progress.

If there’s any one thing that will remove those biases, it is being more aware of how we act. Returning to Gonem and Goldberg’s study of professional stereotypes: the way to change those problematic word embeddings permanently isn’t to tweak the data, but to make hiring decisions that aren’t based on stereotypes, and to treat the people we hire fairly and with respect, regardless of gender (or race or ethnicity). If we act differently, our language will inevitably change to reflect our actions.

I am hopeful that machine learning will help us leave behind a biased and prejudiced past, and build a future that’s more fair and equitable. But machine learning won’t make prejudice disappear by forcing us to rely on data when the data itself is biased. And the myth that the apparent abstraction and mathematical rationality of machine learning is unbiased only lends the prestige of math and science to prejudice and stereotype. If machine learning is going to help, we’ll need to understand that progress is incremental, not absolute. Machine learning is a tool, not a magic wand, and it’s capable of being misused. It can hold a mirror to our biases, or it can hide them. Real progress relies on us, and the road forward will be neither simple nor easy.

Continue reading De-biasing language.

Categories: Technology

Four short links: 2 April 2019

O'Reilly Radar - Tue, 2019/04/02 - 03:45

Content Moderation, Speech in 1.6kbps, Science is Hard, and Forensic Typography

  1. Your Speech, Their Rules: Meet the People Who Guard the Internet (Medium) -- Adam: “Six months ago we told you, ‘Don’t pave the city with banana peels.’ You decided, ‘Let’s see what happens if we pave the city with banana peels.’ We are now here to clean up the injuries.”
  2. A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet -- this is witchcraft. Skip straight to the demos and have your mind blown. 8kb/s used to be the norm for crappy audio, but this is better quality in 19% of the bandwidth.
  3. Statistically Controlling for Confounding Constructs Is Harder than You Think -- Counterintuitively, we find that error rates are highest—in some cases approaching 100%—when sample sizes are large and reliability is moderate. Our findings suggest that a potentially large proportion of incremental validity claims made in the literature are spurious.
  4. Forensic DEC CRT Typography -- recreating the real look of a VT100.

Continue reading Four short links: 2 April 2019.

Categories: Technology

Four short links: 1 April 2019

O'Reilly Radar - Mon, 2019/04/01 - 03:55

Communist RuneScape, API Versioning, Computer Graphics, User Stories

  1. The Communist Revolution inside RuneScape (Emilie Rākete) -- In 2007, a communist RuneScape clan was formed to bring proletarian rule to Server 32 of the world of Gielinor. In a context of scattered clan infighting, the RuneScape communist party was a rampantly victorious social force. Under the wise leadership of SireZaros, the communists waged a revolutionary struggle against reactionary and bourgeois clans that saw more than 5,000 player characters killed in the fighting.
  2. Back-end/Front-end Versioning (Christian Findlay) -- A submission can be rejected [from Google/Apple App Store] for any number of reasons, and it can take up to several days for any one submission to reach the store. On top of this, any user can choose to delay an upgrade, and many users will be on older phones that are not compatible with your current front-end API version. This leaves leaves a situation where front-end versions may be out of sync with each other, or out of sync with the latest back-end version. Here is a quick look at two patterns that might emerge as a strategy to solve the problem.
  3. Introduction to Computer Graphics -- a free, online textbook covering the fundamentals of computer graphics and computer graphics programming.
  4. Engineering Guide to Writing User Stories -- the headings are: Using consistent language; Users do not want your stuff; Removing technical details; Clarifying roles; Making user stories verifiable; Spotting the incompleteness; Ranking user stories.

Continue reading Four short links: 1 April 2019.

Categories: Technology

Four short links: 29 March 2019

O'Reilly Radar - Fri, 2019/03/29 - 01:00

Programming Languages, Asset Graphing, Statistical Tests, and Embeddable WebAssembly

  1. Programmer Migration Patterns -- I made a little flow chart of mainstream programming languages and how programmers seem to move from one to another.
  2. cartography -- a Python tool that consolidates infrastructure assets and the relationships between them in an intuitive graph view powered by a Neo4j database. Video.
  3. Common statistical tests are linear models (or: how to teach stats) -- the linear models underlying common parametric and non-parametric tests. Formulating all the tests in the same language highlights the many similarities between them.
  4. lucet -- a native WebAssembly compiler and runtime. It is designed to safely execute untrusted WebAssembly programs inside your application.. Open source, from Fastly. Announcement.

Continue reading Four short links: 29 March 2019.

Categories: Technology


Subscribe to LuftHans aggregator