You are here

Feed aggregator

Highlights from the O'Reilly Velocity Conference in San Jose 2019

O'Reilly Radar - Wed, 2019/06/12 - 13:00

Experts explore cloud native infrastructure, SRE, distributed systems, and more.

People from across the cloud native and distributed systems worlds came together in San Jose for the O'Reilly Velocity Conference. Below you'll find links to highlights from the event.

Scaling teams with technology (or is it the other way around?)

Chen Goldberg shares how Kubernetes, Istio, GKE, and Anthos can help build distributed systems and happy teams.

Which edge do you need: Managing multiple edges to deliver the next industrial revolution

Dave Andrews walks through the ideas that are informing Verizon Digital Media Services’ edge compute design, including multiedge scenarios, reliability, and failover.

--> Infrastructure first: Because solving complex problems needs more than technology

Everett Harper looks at the key practices that are crucial for solving our most critical challenges.

From puzzles to products

Jessica Kerr argues that most programming careers aren’t about writing software; they’re about changing it.

How do we heal?

Drawing inspiration from restorative justice practices and her own journey of healing, Alex Qin offers a hopeful vision for how we can come together and co-create the world we yearn for.

Cultivating production excellence

Liz Fong-Jones says management of complex distributed systems requires changing who's involved in production, how they collaborate, and how success is measured.

Kubernetes for the impatient

Bridget Kromhout looks over the cloud native landscape and talks about what’s new, what’s next, and what you need to get started with Kubernetes right now.

The SRE I aspire to be

Yaniv Aknin dives into the secret sauce for a successful SRE organization: high-quality measurements of reliability.

Channel into the universe of eventually perfect distributed systems

Modern distributed systems are immensely different from distributed systems of just a decade ago. Lena Hall looks at how our approaches and practices progress with time.

Community projects inform enterprise products

Lachlan Evenson and Bridget Kromhout discuss the journey to build Gatekeeper, a community-driven approach for enforcing policy on any Kubernetes cluster.

The cloud native elephant in the room

Bob Quillin outlines how the cloud native community can reduce complexity, be more inclusive to all teams, and create a more open, multicloud future.

Continue reading Highlights from the O'Reilly Velocity Conference in San Jose 2019.

Categories: Technology

From puzzles to products

O'Reilly Radar - Wed, 2019/06/12 - 13:00

Jessica Kerr argues that most programming careers aren’t about writing software, they’re about changing it.

Continue reading From puzzles to products.

Categories: Technology

Topic for the Security meeting on June 20th

PLUG - Wed, 2019/06/12 - 11:05

Aaron Jones: Introduction To Assembly - Part II

Description:
Introduction To Assembly Language is the second half to my reverse engineering course. This is part two of a multi part educational series on lower level software development and reverse engineering. Many developers and security experts are not familiar with assembly language or the lower levels of their hardware. This course is intended to begin familiarizing you with assembly language in order to build the foundation for future courses on reverse engineering.

Biography:
Aaron, the owner of Retro64XYZ, is a software developer who currently creates applications for law enforcement. He is also an AZ POST certified public speaker. He earned a B.Sc., in Computer Information Systems from Park University in 2013 and an M.A., in Intelligence Analysis with a focus in Cyber Security in 2014. During that period of his life he took a double course load and completed his Masters with a 3.695 GPA in a year. He has been the recipient of recognition from the El Paso Police Department, State Of Texas, Texas Military Forces, Chandler Police Department, and others.

Aaron is also active in the community as the founder of the Phoenix Linux Users Group Cyber Security Meetup and regularly teaches members of the public a myriad of topics related to Cyber Security. His audience includes students, teachers, law enforcement, military, government officials, and concerned members of the public with a strong desire to learn what is going on in the world of technology.

When Aaron isn’t teaching, working, or spending time with his family, he enjoys relaxing at the pond with a fishing pole while not catching fish, operating a pistol at the shooting range, or reading books. He owns a Sega Saturn and a Sega Dreamcast and his favorite video games are Panzer Dragoon, Road Rash, Phantasy Star Online 2, and Power Stone. He is currently engrossed in building content for his site and looking for more ways to reach the public. You should reach Aaron through his Mastodon or on Keybase. He would love to hear from you, answer your questions, or find out about the projects you are involved with.

 

Topics for Jun 13th's meeting

PLUG - Wed, 2019/06/12 - 10:48
Bill Lindley: Control The World (or whatever you want) With Linux and a Raspberry Pi

Description:
We will see how a $35 Raspberry Pi (or even the $9 version), a breadboard and a bit of circuitry connected to the SPI and I2C buses can control pretty much anything: a thermostat; a "self driving" model car; or a clock-radio that also displays software build progress and your corporate network uptime status.

Biography:
Mr. Lindley has been in the computer industry since he sold his first program (a printer driver for Heathkit HDOS) in 1980. He has used system from the earliest 8-bit microprocessors, through the PDP-11 and VAX, up to IBM mainframes, and has managed to write programs that did not crash on most of them. Mr. Lindley has been a GNU/Linux user since 1992 and has been free of proprietary software since 2001. Most recently he has been pleased to be an adjunct professor at Mesa Community College.


Rajendran Rathinasabapathy: In the crossroads of Agility, Microservices and Cloud computing

Description:
Being a project manager for a while evolving along with the evolution in the IT landscape of agility, microservices and cloud computer, I see a pattern in the painful ordeal teams go through, especially when collaboration is needed more than ever to build systems. The topic is very relevant when highly individualistic opensource programmers work together. In this presentation, as the subject is vast, I layout a list of critical blind spots and pitfalls the teams can avoid while designing and developing larger applications.

Biography:
Working in IT for the past 20 years from being a developer to a portfolio manager. Currently involved in strategizing CI/CD for a sub-domain in info security. with varied tech stacks. I am a proponent of microservices and also a scrum master. In my free time, I learn to make digital an analogue circuits. I teach programming and robotics to kids and I am learning Jiu-Jitsu :-)

Four short links: 12 June 2019

O'Reilly Radar - Wed, 2019/06/12 - 04:20

Serverless Microservice Patterns, Organizing Information, Internet Trends, and Fake Videos

  1. Serverless Microservice Patterns for AWS (Jeremy Daly) -- I’ve read a lot of posts that mention serverless microservices, but they often don’t go into much detail. I feel like that can leave people confused and make it harder for them to implement their own solutions. Since I work with serverless microservices all the time, I figured I’d compile a list of design patterns and how to implement them in AWS. I came up with 19 of them; though, I’m sure there are plenty more.
  2. Fans are Better Than Tech at Organizing Information Online (Wired) -- coverage of Archive Of Our Own (AO3), a fanfic archive which is nominated for a Hugo this year. AO3's trick is that it involves humans by design—around 350 volunteer tag wranglers in 2019, up from 160 people in 2012—who each spend a few hours a week deciding whether new tags should be treated as synonyms or subsets of existing tags, or simply left alone. AO3's Tag Wrangling Chairs estimate that the group is on track to wrangle over two million never-before-used tags in 2019, up from around 1.5 million in 2018.
  3. Mary Meeker's Internet Trends, 2019 Edition -- like April Fool's Day, it's a landmark in the industry, but fewer people look forward to it with glee these days. The big trends driving growth (Moore's Law, sales of mobile growth, people connected to the internet) have slowed down. Internet ad spend is still rising, customer acquisition costs are going up, etc. Two eye-watering facts: Americans are spending 6.3h on digital media/day, up 7% from the year before, and people are increasingly communicating in images --> 50% of Twitter impressions are of posts with media, which is startling for a medium that was originally SMS.
  4. Testing Facebook's Fake Video Policy (Vice) -- a fake video of Mark Zuckerberg was uploaded to test their policy. They're treating it like the earlier Pelosi video: Instead of deleting the video, the company chose to de-prioritize it, so that it appeared less frequently in users' feeds, and placed the video alongside third-party fact-checker information.

Continue reading Four short links: 12 June 2019.

Categories: Technology

Four short links: 11 June 2019

O'Reilly Radar - Tue, 2019/06/11 - 05:15

Premium Firefox, FPGAs for Graph Processing, Decision Framework, and The Online Experience of South Asian Women

  1. Possible Premium Firefox Coming (ZDNet) -- an interesting approach for Firefox, but I'd pay for something as good as Chrome that didn't have the mixed incentives for developers.
  2. Graph Processing on FPGAs: Taxonomy, Survey, Challenges -- Our survey describes and categorizes existing schemes and explains key ideas. Finally, we discuss research and engineering challenges to outline the future of graph computations on FPGAs.
  3. Decision Disagreement Framework: How We Encourage Disagreements at Matter -- we couldn’t find a framework for handling and supporting disagreements after decisions have been made, especially if you weren’t a part of making that decision. We took inspiration from existing frameworks to create the Decision Disagreement Framework.
  4. Understanding the Online Safety and Privacy Challenges Faced by South Asian Women -- This post, after providing a short background, covers the following topics: Device privacy challenges: This section outlines the privacy challenges faced by South Asian women when using their smartphones; Online safety challenges: Highlights the risks and abuse faced by South Asian women when using online services; Design considerations to promote gender equity: When building products, features that mitigate the risks would help to improve the safety of South Asian women. Ethnographic study that's super useful for systems designers who aren't South Asian women.

Continue reading Four short links: 11 June 2019.

Categories: Technology

AI adoption is being fueled by an improved tool ecosystem

O'Reilly Radar - Tue, 2019/06/11 - 04:00

We now are in the implementation phase for AI technologies.

In this post, I share slides and notes from a keynote that Roger Chen and I gave at the 2019 Artificial Intelligence conference in New York City. In this short summary, I highlight results from a — survey (AI Adoption in the Enterprise) and describe recent trends in AI. Over the past decade, AI and machine learning (ML) have become extremely active research areas: the web site arxiv.org had an average daily upload of around 100 machine learning papers in 2018. With all the research that has been conducted over the past few years, it’s fair to say that we now have entered the implementation phase for many AI technologies. Companies are beginning to translate research results and developments into products and services.

An early indicator of commercial activity and interest is the number of patent filings. I was fortunate enough to contribute to a recent research report from the World Intellectual Patent Office (WIPO) that examined worldwide patent filings in areas pertaining to AI and machine learning. One of their key findings is that the number of patent filings is growing fast: in fact, the ratio of patent filings to scientific publications indicates that patent filings are growing at a faster rate than publications.

Figure 1. A 2019 WIPO Study indicates that AI patent filings are growing at a faster rate than scientific publications. Source: Ben Lorica.

Looking more closely into specific areas, the WIPO study found that Computer Vision is mentioned in 49% of all AI-related patents (167,000+). In addition, the number of computer vision patent filings is growing annually by an average of 24%, with more than 21,000 patent applications filed in 2016 alone.

Figure 2. A 2019 WIPO Study details AI patent filings by area. Source: Ben Lorica.

It has been an extremely productive year for researchers in natural language. Every few months there seems to be new deep learning models that establish records in many different natural language tasks and benchmarks.

Figure 3. Recent progress in natural language models. Source: Ben Lorica.

Much of this research was done in the open, accompanied by open source code and pre-trained models. While applications of AI and machine learning and AI to text are not new, the accuracy of some of these models has drawn interest from practitioners and companies. Some of the most popular trainings, tutorials, and sessions at our AI conferences are ones that focus on text and natural language applications. It’s important to point out that, depending on your application or setting, you will likely need to tune these language models for your specific domain and application.

We continue to see improvements in tools for deep learning. Our surveys show that TensorFlow and PyTorch remain the most popular libraries. There are new open source tools like Ludwig and Analytics Zoo aimed at non-experts who want to begin using deep learning. We are also seeing tools from startups like Weights & Bias and Determined AI (full disclosure: I am an advisor to Determined AI), and open source tools like Nauta, designed specifically for companies with growing teams of deep learning engineers and data scientists. These tools optimize compute resources, automate various stages of model building, and help users keep track and manage experiments.

In our survey that drew more than 1,300 respondents, 22% signaled they are beginning to use reinforcement learning (RL), a form of ML that has been associated with recent prominent examples of “self-learning” systems. There are a couple of reasons for this. We are beginning to see more accessible tools for RL—open source, proprietary, and SaaS—and more importantly, companies like Netflix are beginning to share use cases for RL. Focusing on tooling for RL, there have been a variety of new tools that have come online over the last year. For example, Danny Lange and his team at Unity have released a suite of tools that enable researchers and developers to “test new AI algorithms quickly and efficiently across a new generation of robotics, games, and beyond.”

Let’s look at another one of these tools more closely. At our AI conferences, we’ve been offering a tutorial on an open source computing framework called Ray, developed by a team at UC Berkeley’s RISE Lab.

Figure 4. Reinforcement Learning (using RLlib) is one of the popular uses cases for Ray. Source: Ben Lorica.

As I noted in a previous post, Ray has grown across multiple fronts: number of users, contributors, and use cases. Ray’s support for both stateless and stateful computations, and fine-grained control over scheduling allows users to implement a variety of services and applications on top of it, including RL. The RL library on top of Ray—RLlib—provides both a unified API for different types of RL training, and all of its algorithms are distributed. Thus, both RL users and RL researchers are already benefiting from using RLlib.

There’s also exciting news on the hardware front. Last year we began tracking startups building specialized hardware for deep learning and AI for training and inference as well as for use in edge devices and in data centers. We already have specialized hardware for inference (and even training—TPUs on the Google Cloud Platform). Toward the latter part of this year, in the Q3/Q4 time frame, we expect more companies to begin releasing hardware that will greatly accelerate training and inference while being much more energy efficient. Given that we are in a highly empirical era for machine learning and AI, tools that can greatly accelerate training time while lowering costs will lead to many more experiments and potential breakthroughs.

In our survey, we found more than 60% of companies were planning to invest some of their IT budget into AI. But the level of investment depended on how much experience a company already had with AI technologies. As you can see in Figure 5, those with a mature practice plan to invest a sizable portion of their IT budget into AI. There’s a strong likelihood that the gap between AI leaders and laggards will further widen.

Figure 5. Planned investments in AI depends on the maturity level of a company. Source: Ben Lorica.

So, what is holding back adoption of AI? According to our survey, the answer depends on the maturity level of a company.

Figure 6. Key bottlenecks holding back adoption of AI. Source: Ben Lorica.

Those who are just getting started struggle with finding use cases or explaining the importance of AI. Also, we are far from General AI: we are at a stage where these technologies have to be tuned and targeted, and many AI systems work by augmenting domain experts. Thus, these technologies require training at all levels of an organization, not just in technical teams. It’s important that managers understand the capabilities and limitations of current AI technologies, and see how other companies are using AI. Take the case of robotic process automation (RPA), a hot topic among enterprises. It’s really the people closest to tasks (“bottoms up approach”) who can best identify areas where RPA is most suitable.

On the other hand, those with mature AI practices struggle with lack of data and lack of skilled people. Let’s look at the skills gap more closely in Figure 7.

Figure 7. Skills gap in AI. Source: Ben Lorica.

Skills requirements depend on the level of maturity as well. Companies with more mature AI practices have less trouble finding use cases and have less need for data scientists. However, the need for data and infrastructure engineers cuts across companies. It’s important to remember that much of AI today still requires large amounts of training data to train large models that require large amounts of compute resources. I recently wrote about the requisite foundational technologies needed to succeed in machine learning and AI.

As the use of AI technologies grows within companies, we will need better tools for machine learning model development, governance, and operations. We are beginning to see tools that can automate many stages of a machine learning pipeline, help manage the ML model development process, and search through the space of possible neural network architectures. Given the level of excitement around ML and AI, we expect tools in these areas to improve and gain widespread adoption.

Figure 8. Automation and model development tools are growing in popularity. Source: Ben Lorica.

With the growing interest in AI among companies, this is a great time to be building tools for ML. When we asked our survey respondents, “Which tools are you planning to incorporate into your ML workflows within the next 12 months?”, we found:

  • 48% wanted tools for model visualization
  • 43% needed tools for automated model search and hyperparameter tuning

Companies are realizing that ML and AI is much more than optimizing a business or a statistical metric. Over the past year, I’ve tried to summarize some of these considerations under the umbrella of “risk management,” a term and practice area many companies are already familiar with. Researchers and companies are beginning to release tools and frameworks to explain various techniques they are using to develop “responsible AI.” When we asked our survey respondents, “What kinds of risks do you check for during ML model building and deployment?”, we found the following:

  • 45% assessed model interpretability and explainability
  • 41% indicated that they had tests for fairness and bias
  • 35% checked for privacy
  • 34% looked into safety and reliability issues
  • 27% tested for security vulnerabilities

A word about data security. In the age of AI, there are situations where data integrity will be just as critical as data security. That’s because AI systems are highly dependent on data used for training. Building data infrastructure that can keep track of data governance and lineage will be very important, not only for security and quality assurance audits, but also for compliance with existing and future regulations.

Figure 9. Data privacy and security in the age of AI. Source: Ben Lorica.

We are very much in the implementation phase for machine learning and AI. The past decade has produced a flurry of research results, and we are beginning to see a wide selection of accessible tools aimed at companies and developers. But we are still in the early stages of AI adoption, and much work remains in many areas on the tooling front. With that said, many startups, companies, and researchers are hard at work to improve the ecosystems of tools for ML and AI. Over the next 12 months, I expect to see a lot of progress in tools that can ease ML development, governance, and operations.

Related content:

Continue reading AI adoption is being fueled by an improved tool ecosystem.

Categories: Technology

How to get started with site reliability engineering (SRE)

O'Reilly Radar - Mon, 2019/06/10 - 04:10

Google SRE Stephen Thorne shares best practices for starting an SRE team at your company.

Anyone can do site reliability engineering (SRE). Sure, Google pioneered the practice, but you don’t have to work for a tech giant to use SRE to increase reliability and improve system performance. At Google’s 2019 Cloud Next conference, I sat down with Stephen Thorne, site reliability engineer on Google’s customer reliability engineering team and co-author of The Site Reliability Workbook, to talk about how organizations, both large and small, can use SRE to reduce operational costs, improve reliability, and create productive cross-functional teams.

During the interview, we covered strategies for getting started with SRE, including how to get buy-in from the whole team, from management to operations. We also talked about potential hurdles to implementing SRE; why postmortems should always be blameless; what success looks like for an SRE team; and best practices for reducing toil, measuring reliability, moving to the cloud, and more.

Here are some highlights from our conversation:

Getting buy-in from management

When we're talking about how to get management buy-in, we see SRE as providing value. The value that SRE provides to a business might come in various different forms. It might be that you're currently having problems with reliability, with your operational load, your operational costs. There is something you need to do in order to be able to scale up and be more effective in your environment. SRE allows you to say, "Are we reliable enough? And if we're not, what are we going to do about it?"

Biggest roadblock to doing SRE

One of the things I see being a significant barrier is the psychological safety required in order to be confident working in production, and being responsible for production, and being responsible for engineering and production. At Google, we have this locked down. We've got the concept of blameless postmortems, but it's not just that. You have this confidence that if you're toiling too hard, you can go to your leadership and say, "Help." And your leadership will say, "Absolutely. That's a problem. We'll help you drive that down."

But in another organization, you might go to leadership and say, "Help, we have too much toil right now." They might say, "Okay, so you're going to work harder, aren't you?" I think one of the things we have at Google, which I would love to see in more organizations, is the implementation and the feeling of psychological safety. That if you have problems, it's not your fault that you have problems. You can go to leadership, you can go to your peers, you can go to your development teams, and you can say, "Let's work together to make this a better place for everyone."

Keeping the post-mortem blameless

The reason you really want a blameless postmortem is because as soon as you blame a system, or a human, or a thing that happened, you stop looking for all of those other causes for what went wrong.

Signs of a successful SRE team

What you want to see from a successful site reliability engineering team is that they know how reliable their system is. They have a plan for how to improve it over time, or reduce their toil over time; they're delivering on that plan; and those deliverables are actually causing a change.

So, a successful SRE team is able to demonstrate the impact of the work they're doing. If you have an SRE team that was running a reliable service last year and running a reliable service this year, but can't tell you what projects they completed in that time that actually had a measurable impact, it's like, what are we doing here?

Is it possible to automate yourself out of a job?

If you find SREs who have actually managed to automate themselves out of a job, you have struck gold. Because they now know how to do this for other teams, they know how to scale up their work, and you should grab onto these people with both hands and say, "You are the best SREs we have right now. Help everyone else achieve your success."

This post is a part of a collaboration between O’Reilly and Google. See our statement of editorial independence.

Continue reading How to get started with site reliability engineering (SRE).

Categories: Technology

Maximizing paper clips

O'Reilly Radar - Mon, 2019/06/10 - 04:00

We won’t get the chance to worry about artificial general intelligence if we don’t deal with the problems we have in the present.

In What's the Future, Tim O’Reilly argues that our world is governed by automated systems that are out of our control. Alluding to The Terminator, he says we’re already in a “Skynet moment,” dominated by artificial intelligence that can no longer be governed by its “former masters.” The systems that control our lives optimize for the wrong things: they’re carefully tuned to maximize short-term economic gain rather than long-term prosperity. The "flash crash" of 2010 was an economic event created purely by the software that runs our financial systems going awry. However, the real danger of the Skynet moment isn’t what happens when the software fails, but when it is working properly: when it’s maximizing short-term shareholder value, without considering any other aspects of the world we live in. Even when our systems are working, they’re maximizing the wrong function.

Charlie Stross makes a similar point in “Dude you broke the future,” arguing that modern corporations are “paper clip maximizers.” He’s referring to Nick Bostrom’s thought experiment about what could go wrong with an artificial general intelligence (AGI). If told to maximize the process of making paper clips, it could decide that humans were inessential. It was told to make paper clips, lots of them, and nothing is going to stop it. Like O’Reilly, Stross says the process is already happening: we’re already living in a world of “paper clip maximizers.” Businesses maximize stock prices without regard for cost, whether that cost is human, environmental, or something else. That process of optimization is out of control—and may well make our planet uninhabitable long before we know how to build a paper clip-optimizing AI.

The paper clip maximizer is a provocative tool for thinking about the future of artificial intelligence and machine learning–though not for the reasons Bostrom thinks. As O’Reilly and Stross point out, paper clip maximization is already happening in our economic systems, which have evolved a kind of connectivity that lets them work without oversight. It’s already happening in our corporations, where short-term profit creates a world that is worse for everyone. Automated trading systems largely predate modern AI, though they have no doubt incorporated it. Business systems that optimize profit—well, they're old-fashioned human wetware, collected in conference rooms and communicating via the ad-hoc neural network of economic exchange.

What frustrates me about Bostrom’s paper clip maximizer is that focusing on problems we might face in some far-off future diverts attention from the problems we’re facing now. We don’t have–and may never have–an artificial general intelligence, or even a more limited artificial intelligence that will destroy the world by maximizing paper clips. As Andrew Ng has said, we're being asked to worry about overpopulation on Mars. We have more immediate problems to solve. What we do have are organizations that are already maximizing their own paper clips, and that aren’t intelligent by any standard. That’s a concrete problem we need to deal with now. Talking about future paper clips might be interesting or thrilling, but in reality, it’s a way of avoiding dealing with our present paper clips. As Stross points out, Elon Musk is one of the recent popularizers of paper clip anxiety; yet, he has already built his own maximizers for batteries and space flights. It’s much easier to wax philosophical about a hypothetical problem than to deal with a planet that is gradually overheating. It’s a lot more fun, and a lot less threatening, to think about the dangers of a hypothetical future AI than to think about the economic, political, sociological, and environmental problems that face us now—even if those two sets of problems are really the same.

The argument that Stross and O’Reilly make is central to how we think about AI ethics—and not just AI ethics, but business ethics. I’m not terribly concerned about the things that could go wrong with an artificial general intelligence, at least in part because we won’t get the chance to worry about AGI if we don’t deal with the problems we have in the present. And if we do deal with the problems facing us now, Tim O’Reilly’s Skynet moment and Stross’s present-day paper clip maximizers, we will inevitably develop the tools we need to think about and manage the future’s paper clip maximizers. Getting our present systems back under control and contributing to human welfare is the only way to learn how to keep our future systems, whatever they might be, working for our collective good.

I can think of no better way to prepare for the future’s problems than to solve the present’s.

Continue reading Maximizing paper clips.

Categories: Technology

Four short links: 10 June 2019

O'Reilly Radar - Mon, 2019/06/10 - 03:55

Remote Code Development, PWA Builder, Why Platforms Fail, and Designing Rituals

  1. Visual Studio Code Remote Development May Change Everything (Scott Hanselman) -- Visual Studio Code Remote Development allows you to use a container, remote machine, or the Windows Subsystem for Linux (WSL) as a full-featured development environment. It effectively splits VS Code in half and runs the client part on your machine and the "VS Code Server" basically anywhere else. [...] As I mentioned, you can run within WSL, containers, or over SSH. It's early days, but it's extraordinarily clean. I'm really looking forward to seeing how far and effortless this style of development can go. There's so much less yak shaving! It effectively removes the whole setup part of your coding experience and you get right to it.
  2. PWA Universal Builder -- scaffolding for Progressive Web Apps with your choice of frameworks, get optimizations and presets for free.
  3. A Study of More Than 250 Platforms Reveals Why Most Fail (HBR) -- We grouped the most common mistakes into four categories: (1) mispricing on one side of the market, (2) failure to develop trust with users and partners, (3) prematurely dismissing the competition, and (4) entering too late. As always, the four categories aren't significant—how do you go broke? You run out of money by failing to keep enough of it, or by never getting enough users to have enough money in the first place. The individual tales are where juicy stories and interesting thoughts form.
  4. Friday Wins and a Case Study in Ritual Design (Kellan Elliott-McCrea) -- A standard piece of software development practice that many teams let lapse, or merely let lapse into being sub-optimal, is “Friday wins,” sometimes called sprint demos or sprint reviews. But you can take what can be a flaccid and repetitive meeting and make it a valuable ritual by grounding it in values.

Continue reading Four short links: 10 June 2019.

Categories: Technology

Four short links: 7 June 2019

O'Reilly Radar - Fri, 2019/06/07 - 04:30

Energy of Deep Learning, Open Source Game Clones, Better Batteries, and Video Magic

  1. Energy and Policy Considerations for Deep Learning in NLP -- training Transformer NLP model w/ neural architecture search is 626,155 lbs of CO2. Compare to Car, avg incl. fuel, 1 lifetime: 126,000 lbs. (via MIT TR)
  2. Open Source Game Clones -- This site tries to gather open source remakes of great old games in one place.
  3. A Glass Battery That Keeps Getting Better (IEEE Spectrum) -- grunty batteries without the fire would be a great thing, indeed, never mind one that got better. Goodenough and collaborators claimed they’d developed a non-flammable lithium battery (whose electrolyte was based on a glass powder) that had twice the energy density of traditional lithium-ion batteries. They also published a graph that showed an increase in capacity over more than 300 charge-discharge cycles. (This increase, however, pales in comparison to the cell's at least 23,000-cycle lifespan.)
  4. Text-Based Editing -- We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e., no jump cuts). You edit the script and the software edits the video so the person says what's in the script. (via Stanford)

Continue reading Four short links: 7 June 2019.

Categories: Technology

Bringing scalable real-time analytics to the enterprise

O'Reilly Radar - Thu, 2019/06/06 - 06:45

The O’Reilly Data Show Podcast: Dhruba Borthakur and Shruti Bhat on enabling interactive analytics and data applications against live data.

In this episode of the Data Show, I spoke with Dhruba Borthakur (co-founder and CTO) and Shruti Bhat (SVP of Product) of Rockset, a startup focused on building solutions for interactive data science and live applications. Borthakur was the founding engineer of HDFS and creator of RocksDB, while Bhat is an experienced product and marketing executive focused on enterprise software and data products. Their new startup is focused on a few trends I’ve recently been thinking about, including the re-emergence of real-time analytics, and the hunger for simpler data architectures and tools.  Borthakur exemplifies the need for companies to continually evaluate new technologies: while he was the founding engineer for HDFS, these days he mostly works with object stores like S3.

Continue reading Bringing scalable real-time analytics to the enterprise.

Categories: Technology

Four short links: 6 June 2019

O'Reilly Radar - Thu, 2019/06/06 - 04:20

Software Engineering for Machine Learning, Generalizations in Learning, Computer Dance, and Firefighting in Product Development

  1. Software Engineering for Machine Learning (Microsoft Research) -- We collected some best practices from Microsoft teams to address [several essential engineering challenges that organizations may face in creating large-scale AI solutions for the marketplace]. In addition, we have identified three aspects of the AI domain that make it fundamentally different from prior software application domains: 1) discovering, managing, and versioning the data needed for machine learning applications is much more complex and difficult than other types of software engineering, 2) model customization and model reuse require very different skills than are typically found in software teams, and 3) AI components are more difficult to handle as distinct modules than traditional software components—models may be “entangled” in complex ways and experience non-monotonic error behavior.
  2. Open Long-Tailed Recognition (Berkeley) -- A practical system shall be able to classify among a few common and many rare categories, to generalize the concept of a single category from only a few known instances, and to acknowledge novelty upon an instance of a never seen category. We define OLTR as learning from long-tail and open-end distributed data and evaluating the classification accuracy over a balanced test set which includes head, tail, and open classes in a continuous spectrum.
  3. Hype Cycle: Machine Learning (Vimeo) -- dance being changed by computers.
  4. Past the Tipping Point: The Persistence of Firefighting in Product Development -- In this paper, we try to answer three questions: (1) why does firefighting exist, (2) why does firefighting persist, and (3) what can managers do about it? The most important result of our studies is that product development systems have a tipping point. In models of infectious diseases, the tipping point represents the threshold of infectivity and susceptibility beyond which a disease becomes an epidemic. Similarly, in product development systems there exists a threshold for problem-solving activity that, when crossed, causes firefighting to spread rapidly from a few isolated projects to the entire development system. Our analysis also shows that the location of the tipping point, and therefore the susceptibility of the system to the firefighting phenomenon, is determined by resource utilization in steady state.

Continue reading Four short links: 6 June 2019.

Categories: Technology

219+ live online training courses opened for June and July

O'Reilly Radar - Wed, 2019/06/05 - 04:10

Get hands-on training in Docker, microservices, cloud native, Python, machine learning, and many other topics.

Learn new topics and refine your skills with more than 219 new live online training courses we opened up for June and July on the O'Reilly online learning platform.

AI and machine learning

AI-driven Future State Cloud Operations, June 7

Deep Learning with PyTorch, June 20

Deep Learning from Scratch, July 2

Introduction to Reinforcement Learning, July 8

Fundamentals of Machine Learning and Data Analytics, July 10-11

Essential Machine Learning and Exploratory Data Analysis with Python and Jupyter Notebook, July 11-12

Artificial Intelligence: An Overview of AI and Machine Learning, July 15

Real-Time Streaming Analytics and Algorithms for AI Applications, July 17

Hands-on Machine Learning with Python: Classification and Regression, July 17

Hands-on Machine Learning with Python: Clustering, Dimension Reduction, and Time Series Analysis, July 18

Deep Reinforcement Learning, July 18

Deep Learning for Natural Language Processing, July 25

Getting Started with Machine Learning, July 29

Inside Unsupervised Learning: Anomaly Detection using Dimensionality Reduction, August 6

Deploying Machine Learning Models to Production: A Toolkit for Real-World Success , August 7-8

Hands-on Adversarial Machine Learning, August 13

Inside Unsupervised Learning: Group Segmentation Using Clustering, August 13

Reinforcement Learning: Building Recommender Systems, August 16

Blockchain

Business Applications of Blockchain, July 17

Certified Blockchain Solutions Architect (CBSA) Certification Crash Course, July 25

Business

Ken Blanchard on Leading at a Higher Level: 4 Keys to Creating a High Performing Organization , June 13

Engineering Mentorship, June 24

Spotlight on Learning From Failure: Hiring Engineers with Jeff Potter, June 25

60 Minutes to a Better Prototype, June 25

Being a Successful Team Member, July 1

Spotlight on Data: Improving Uber’s Customer Support with Natural Language Processing and Deep Learning with Piero Molino, July 2

Getting S.M.A.R.T about Goals, July 9

Building the Courage to Take Risks, July 9

Spotlight on Innovation: Making Things Happen with Scott Berkun, July 10

Thinking Like a Manager, July 10

Better Business Writing, July 15

Spotlight on Data: Data Storytelling with Mico Yuk, July 15

Why Smart Leaders Fail, July 16

Product Management for Enterprise Software, July 18

Introduction to Critical Thinking, July 23

Negotiation Fundamentals, July 23

Spotlight on Learning from Failure: Corporate Disinformation and the Changing Face of Attacks with Renee DiResta and Robert Matney, July 23

Having Difficult Conversations, July 25

Giving a Powerful Presentation, July 25

The Power of Lean in Software Projects, July 25

Managing a Toxic Work Environment, July 25

Leadership Communication Skills for Managers, July 29

Emotional Intelligence in the Workplace, July 30

90 Minutes to Better Decision-Making, July 30

Performance Goals for Growth, July 31

Adaptive Project Management, July 31

Spotlight on Cloud: Mitigating Cloud Complexity to Ensure Your Organization Thrives with David Linthicum, August 1

How to Be a Better Mentor, August 5

Fundamentals of Learning: Learn Faster and Better Using Neuroscience, August 6

Introduction to Strategic Thinking Skills, August 6

Foundations of Microsoft Excel, August 6

Succeeding with Project Management, August 8

How to Give Great Presentations, August 13

60 minutes to Better User Stories and Backlog Management, August 13

Building Your LinkedIn Network, August 13

Understanding Business Strategy, August 14

Data science and data tools

Text Analysis for Business Analytics with Python, June 12

Business Data Analytics Using Python , June 25

Debugging Data Science, June 26

Programming with Data: Advanced Python and Pandas, July 9

Understanding Data Science Algorithms in R: Regression, July 12

Time Series Forecasting, July 15

Cleaning Data at Scale, July 15

Scalable Data Science with Apache Hadoop and Spark, July 16

Effective Data Center Design Techniques: Data Center Topologies and Control Planes, July 19

First Steps in Data Analysis, July 22

Inferential Statistics using R, July 24

Foundational Python for Data Science, July 24

Intermediate SQL for Data Analysis, July 29

Introduction to Pandas: Data Munging with Python, July 29-30

Data Analysis Paradigms in the Tidyverse, July 30

Intro to Mathematical Optimization, August 6

Getting Started with PySpark, August 8

Text Analysis for Business Analytics with Python, August 12

Real-time Data Foundations: Kafka, August 13

Introduction to Statistics for Data Analysis with Python, August 14

Understanding Data Science Algorithms in R: Scaling, Normalization and Clustering, August 14

Real-time Data Foundations: Spark, August 15

Visualization and Presentation of Data, August 15

Python Data Science Full Throttle with Paul Deitel: Introductory AI, Big Data and Cloud Case Studies, September 24

Design and product management

Introduction to UI & UX design, June 24

Programming

Discovering Modern Java, June 7

Design Patterns in Java, June 13-14

Java Testing with Mockito and the Hamcrest Matchers, June 19

Scaling Python with Generators, June 25

Pythonic Object-Oriented Programming, June 26

Python Advanced: Generators and Coroutines, June 26

Pythonic design patterns, June 27

Advanced Test-Driven Development (TDD), June 27

Test-Driven Development In Python, June 28

Learning Python 3 by Example, July 1

Getting Started with Spring and Spring Boot, July 2-3

Java 8 Generics in 3 Hours, July 5

Secure JavaScript with Node.js, July 10

Learn the Basics of Scala in 3 Hours, July 15

Quantitative Trading with Python, July 15

Advanced React.js, July 16

Next-generation Java Testing with JUnit 5 , July 16

Java Full Throttle with Paul Deitel: A One-Day, Code-Intensive Java, July 16

Modern JavaScript, July 17

Mastering the Basics of Relational SQL Querying, July 17-18

Getting Started with Python 3 , July 17-18

Building Applications with Apache Cassandra, July 19

Scala Fundamentals: From Core Concepts to Real Code in 5 Hours, July 19

Clean Code, July 23

Introduction to Python Programming, July 23

TypeScript Fundamentals, July 24

Rust Programming: A Crash Course, July 29

Python Data Science Full Throttle with Paul Deitel: Introductory AI, Big Data and Cloud Case Studies, July 30

Beyond Python Scripts: Logging, Modules, and Dependency Management, July 30

Advanced JavaScript, July 30

Beyond Python Scripts: Exceptions, Error Handling and Command-Line Interfaces, July 31

Introduction to TypeScript Programming, August 5

Getting Started with Python 3, August 5-6

Mastering Pandas, August 7

Advanced TypeScript Programming, August 13

Getting Started with React.js, August 14

SQL Fundamentals for Data, August 14-15

Testing Vue.js Applications, August 15

Getting Started with Python 3, August 15-16

Modern Java Exception Handling, August 22

Python: The Next Level, August 1-2

Security

Kubernetes Security, June 10

Defensive Cybersecurity Fundamentals , June 17

Understanding the Social Forces Affecting Cyberattackers, June 28

Ethical Hacking Bootcamp with Hands-on Labs, July 1-3

Cyber Security Defense, July 2

Getting Started with Cyber Investigations and Digital Forensics, July 8

Start Your Security Certification Career Today, July 11

Certified Ethical Hacker (CEH) Crash Course, July 11-12

AWS Security Fundamentals, July 15

Introduction to Encryption, July 16

CISSP Crash Course, July 17-18

CISSP Certification Practice Questions and Exam Strategies, July 18

Linux, Python, and Bash Scripting for Cybersecurity Professionals, July 19

Cyber Security Fundamentals, July 25-26

AWS Certified Security - Specialty Crash Course, July 25-26

Understanding the Social Forces Affecting Cyberattackers, August 5

CCNA Cyber Ops SECFND 210-250, August 13

CCNA Cyber Ops SECOPS 210-255, August 15

Systems engineering and operations

AWS Access Management, June 6

Google Cloud Platform – Professional Cloud Developer Crash Course, June 6-7

React Hooks in Action, June 14

Running MySQL on Kubernetes, June 19

CompTIA A+ Core 1 (220-1001) Certification Crash Course, June 19-20

Introducing Infrastructure as Code with Terraform, June 20

How Routers Really Work: Network Operating Systems and Packet Switching, June 21

Creating React Applications with GraphQL, June 24

Getting Started with Google Cloud Platform, June 24

AWS Certified Big Data - Specialty Crash Course, June 26-27

Building APIs with Django REST Framework , June 28

Hands-on Arista Networking Foundational Routing Topics: Learning Arista Networking Through Lab Exercises , June 28

Azure Architecture: Best Practices, June 28

Learn Linux in 3 Hours, July 1

Managing Containers on Linux, July 1

Getting Started with Amazon SageMaker on AWS, July 1

Ansible in 4 Hours, July 2

Automating with Ansible, July 2

Kubernetes in 4 Hours, July 3

Getting Started with OpenShift, July 5

Amazon Web Services (AWS) Security Crash Course, July 8

Microservices Architecture and Design, July 8-9

AWS Machine Learning Specialty Certification Crash Course, July 8-9

AWS Certified Solutions Architect Associate Crash Course, July 8-9

Google Cloud Platform Security Fundamentals, July 9

CCNA Routing and Switching 200-125 Crash Course , July 9, 11, 16, 18

Exam AZ-300: Microsoft Azure Architect Technologies Crash Course, July 11-12

IBM Blockchain Platform as a Service, July 11-12

Google Cloud Certified Associate Cloud Engineer Crash Course, July 15-16

Getting Started with Amazon Web Services (AWS), July 15-16

AWS for Mobile App Developers, July 16

9 Steps to Awesome with Kubernetes, July 16

Getting Started with Cloud Computing, July 16

Google Cloud Platform (GCP) for AWS Professionals, July 17

AWS Certified SysOps Administrator (Associate) Crash Course, July 17-18

Software Architecture Foundations: Characteristics and Tradeoffs, July 18

AWS Managed Services , July 18-19

Building Micro-frontends, July 22

Linux Performance Optimization, July 22

Linux Under the Hood, July 22

Practical Linux Command Line for Data Engineers and Analysts, July 22

Introduction to Kubernetes, July 22-23

Introduction to Docker images, July 23

Analyzing Software Architecture, July 23

Domain-driven design and event-driven microservices, July 23-24

Building a Cloud Roadmap, July 24

Software Architecture by Example, July 24

Introduction to Docker CI/CD, July 24

Automating Architectural Governance Using Fitness Functions, July 25

Exam MS-100: Microsoft 365 Identity and Services Crash Course, July 25-26

Linux Foundation System Administrator (LFCS) Crash Course, July 25-26

Architecture for Continuous Delivery, July 29

Introduction to Docker Containers, July 30

Implementing Evolutionary Architectures, July 30-31

Docker for JVM Projects, July 31

Getting Started with Continuous Delivery (CD), August 1

Implementing and Troubleshooting TCP/IP, August 5

Developing Incremental Architecture, August 5-6

Microservice Decomposition Patterns, August 6

From Developer to Software Architect, August 6-7

Systems Design for Site Reliability Engineers, August 7

Building and Managing Kubernetes Applications, August 7

Designing Serverless Architecture with AWS Lambda, August 7-8

Docker: Beyond the Basics (CI & CD), August 7-8

Introduction to Istio, August 8

Microservice Fundamentals, August 13

Getting Started with Google Cloud Platform, August 13

Microservices Caching Strategies, August 14

Practical Docker, August 14

Amazon Web Services (AWS) Technical Essentials, August 14

Kubernetes Security, August 14

AWS Design Fundamentals, August 15-16

Software Architecture by Example, August 16

Structural Design Patterns with Spring, August 20

Resilience and Fast Reroute in Computer Networks: Tools and Techniques to Optimize Network Performance, August 23

Continue reading 219+ live online training courses opened for June and July.

Categories: Technology

Four short links: 5 June 2019

O'Reilly Radar - Wed, 2019/06/05 - 04:00

Open Source, 3D Printer Wear, Multicore TCP, and Super-Resolution Images

  1. What's Driving Open Source Software in 2019 -- Results from our ranking of proposal phrases show the centrality of data to the open source community: “data” (the No. 5 term) outpacing “code” (the No. 14 term), the rise in AI/ML topics, and in the nascent cloud native paradigm where monitoring and analytics assume critical importance—highlighting the demand for skills in analytics, data acquisition, etc.
  2. Investigating 3D Printer Nozzle Wear (YouTube) -- great video with cross-sections of worn nozzles and discussion of different materials.
  3. mTCP -- high-performance user-level TCP stack for multicore systems. Scaling the performance of short TCP connections is fundamentally challenging due to inefficiencies in the kernel. mTCP addresses these inefficiencies from the ground up—from packet I/O and TCP connection management all the way to the application interface. Open source (modified BSD license).
  4. Handheld Multi-Frame Super-Resolution -- In this paper, we supplant the use of traditional demosaicing in single-frame and burst photography pipelines with a multi-frame super-resolution algorithm that creates a complete RGB image directly from a burst of CFA raw images. [...] Our algorithm is robust to challenging scene conditions: local motion, occlusion, or scene changes. It runs at 100 milliseconds per 12-megapixel RAW input burst frame on mass-produced mobile phones. Specifically, the algorithm is the basis of the Super-Res Zoom feature, as well as the default merge method in Night Sight mode (whether zooming or not) on Google’s flagship phone.

Continue reading Four short links: 5 June 2019.

Categories: Technology

Four short links: 4 June 2019

O'Reilly Radar - Tue, 2019/06/04 - 03:30

Paper vs. Implementation, Recommendations Run Amok, Copyright Law, and Engineering Management

  1. Everything You Know About word2vec Is Wrong -- The original word2vec C implementation does not do what's explained above, and is drastically different.
  2. On YouTube’s Digital Playground, an Open Gate for Pedophiles (NYT) -- YT's recommendation algorithm suggested home movies of families' kids to users who watched other videos of prepubescent, partially clothed children. (via BoingBoing)
  3. Canada's Review of Copyright Law (BoingBoing) -- sane proposals for fair dealing, safe harbour, TPMs, and lengthening copyright term.
  4. How to Size and Assess Teams From an Eng Lead at Stripe, Uber, and Digg -- In this exclusive interview, Larson digs into two critical components of organization design. Specifically, he shares his system for gauging the size and state of engineering teams—in not only a highly efficient and effective way, but also with a deeply empathetic and ethical approach.

Continue reading Four short links: 4 June 2019.

Categories: Technology

What’s driving open source software in 2019

O'Reilly Radar - Tue, 2019/06/04 - 03:00

Cloud native, AI/ML, and data tools and topics are areas of emphasis for the O’Reilly Open Source Software Conference.

Virtually every impactful socio-technical transformation of the last 20 years—Web 2.0, DevOps, cloud, big data, artificial intelligence (AI), and machine learning (ML)—is encoded in the record of speaker proposals from the O’Reilly Open Source Software Conference (OSCON). This record doesn’t merely reflect the salience of these and other trends, it anticipates this salience, sometimes by several years.

Our recent analysis of speaker proposals from the 2017-2019 editions of OSCON[1] yielded several intriguing findings:

Continue reading What’s driving open source software in 2019.

Categories: Technology

Reduce toil through better alerting

O'Reilly Radar - Mon, 2019/06/03 - 04:00

How SREs can use a hierarchy for mature alerts.

SRE best practices at Google advocate for building alerts based upon meaningful service-level objectives (SLOs) and service-level indicators (SLIs). In addition to an SRE book chapter, other site reliability engineers at Google have written on the topic of alerting philosophy. However, the nuances of how to structure well-reasoned alerting are varied and contentious. For example, traditional "wisdom" argues that cause-based alerts are bad, while symptom-based or SLO-based alerts are good.

Navigating the dichotomy of symptom-based and cause-based alerting adds undue toil to the process of writing alerts: rather than focusing on writing a meaningful alert that addresses a need for running the system, the dichotomy brings anxiety around deciding whether an alert condition falls on the “correct” side of this dichotomy.

Instead, consider approaching alerting as a hierarchy of the alerts available to you: reactive, symptom-based alerts—typically based on your SLOs—form the foundation of this hierarchy. As systems mature and achieve higher availability targets, other types of alerts can add to your system's overall reliability without adding excessive toil. Using this approach, you can identify value in different types of alerts, while aiming for a comprehensive alerting setup.

As detailed below, by analyzing their existing alerts and organizing them according to a hierarchy, then iterating as appropriate, service owners can improve the reliability of their systems and reduce the toil and overhead associated with traditional cause-based and investigative alerts.

Basic alerting concepts

This proposed alerting maturity hierarchy builds upon a few basic definitions of and assumptions about alerting.

What is an alert?

For the purposes of this discussion, we consider an alert to be a page, ticket, Slack notification, or automated email that's generated from the system. For example, a user escalation would not be considered an alert.

Why do we alert?

An alert calls for human action. The system can no longer handle a situation on its own, and requires the creative mind of a human. This criteria also covers situations in which the system could handle the situation on its own, but still needs to involve a human—for example, because unsupervised automatic resolution is too dangerous.

This also means that an alert might be benign. The system alerted a human because the system couldn’t determine what to do next. But a human might determine that everything is in order and no action should be taken. If we could automatically make that determination, we wouldn’t need the alert.

What is a good alert?

The valuation of whether or not an alert is "good" is inherently subjective—an alert is good if the human recipient finds it good, and every other evaluation criterion is a proxy. That being said, there are some fairly common proxies:

  • Timeliness: Did the alert arrive on time, rather than too long after the event to be useful?
  • Correct delivery: Was the alert sent to the correct human, or was it delivered to an adjacent, but incorrect team?
  • Documentation: Was the alert content relevant to the issue a human needs to evaluate, or did it contain a generic and unhelpful description?
  • Actionability: Did the alert prompt any action (other than simply acknowledging the alert)?
Proposed hierarchy

Now that we know why we alert and have a working definition of a good alert, we propose a pyramid-shaped hierarchy of alerting maturity. This pyramid is evocative of Maslow's hierarchy of needs and the service reliability hierarchy posed in the SRE book. In the proposed hierarchy shown in Figure 1, we reason about alerts in three categories: reactive alerts, proactive alerts, and investigative alerts. You should only implement proactive alerts once you have a solid foundation of reactive alerts, and you can't think about investigative alerts until both your reactive and proactive alerts are in a good state.

Figure 1. Alerting maturity hierarchy. Source: Google.

While thinking about alerts in terms of cause-based versus symptom-based is somewhat orthogonal to this hierarchy, it may be useful to think of symptom-based alerts as falling under the reactive alerts category and cause-based alerts as falling under the proactive and investigative alerts categories.

Types of alerts

You can use the descriptions of alert types and example scenarios below to reason about corresponding sets of alerts for your service.

Reactive alerts

Your business goals are currently in immediate danger. SLO-based alerting is a great example of reactive alerting.

Example scenario: Alert when you're in danger of breaching an SLO. A good reactive alert fires when (to the best of our knowledge) there is an imminent danger of an SLO breach. This allows the human recipient to react to the danger before all is lost.

If you haven't set up reactive alerts for your service, it's likely that most of your uptime-affecting incidents result from issues not captured in SLO alerts. After establishing a baseline of reactive alerts, your service should alert a human before the service exceeds its error budget. If an alert only fires when you have already violated your SLO, it's too late to take action.

Note that “to be in danger of” does not mean the anticipated event will definitely happen. For example, the SLO burndown can recover on its own. It only means that with information available at the point of the alert, there’s a reasonable expectation the SLO will be violated if nobody takes action.

Of course, crafting a comprehensive set of reactive alerts is not straightforward. Not every situation that could endanger your SLO is worth alerting on.

Proactive alerts

Your business goals are probably going to be in danger soon but aren't in immediate danger—you're on trend to hit problems. You aren’t burning error budget in a way that strongly indicates you’re in danger of breaching your SLO. Quota alerts are a good example of proactive alerts.

Example scenario: If you use a storage system that rejects write requests when you hit 100%, configure your alerting to warn you a week before you hit 100%, at which point you should proactively respond to the situation. For example, you could set up a garbage collection job to clear out extra quota usage. That way, your storage system never fills up completely and starts rejecting writes.

In this scenario and similar ones, a reactive alert might complement your proactive alert. For example, if you're in danger of hitting the 100% quota mark, the system would also trigger a reactive alert that your system is in danger of breaching its SLO.

Investigative alerts

Your system is in some sort of unusual state, which may contribute to a partial failure now and/or a threat to a business goal later. For example, a server binary is restarting occasionally without a known cause, but no traffic is being affected because of redundancy or retries. Investigative alerts are particularly useful when the problem could be a sudden shift in user behavior. More often, investigative alerts indicate a service failure that is out of scope for any SLOs.

Examples:

  • Fluctuations in the number of users loading photos on a photo sharing app
  • Sudden changes in usage patterns, such as a dramatic increase in 404 responses at the expense of all other response codes

Example scenario: Before developing investigative alerts, the system fails and sends an alert stating "the compute jobs are crash-looping." After writing investigative alerts, the system more clearly states “the rate of 500 HTTP errors in the front end is growing dramatically and the jobs are crash-looping,” providing more footing to investigate the symptoms of the incident.

Recall that alerts necessarily involve human judgement and uncertainty. The alert fires because the system does not know what step to take next.

How reactive, proactive, and investigative alerts work together

Think of these three alert types as a hierarchy. You should only move to higher levels of the pyramid if the foundations below are solid and not causing undue toil.

These alerts don’t need to be received by the same group of people. For example, in SRE-managed products, due to the reactive nature of oncall, it's clear that SREs should care about reactive alerts when they directly correspond to business needs. However, you might route proactive and investigative alerts across SRE and product development teams more liberally.

Not all alerts merit the same level of urgency or consistency. For example, to ensure a timely response, you may only want to fire investigative alerts during business hours.

Bad setups and their remedies

Each type of alert serves a purpose, but as with any powerful tool, there is a risk of misusing them. Below are some common anti-patterns and their remedies using a maturity hierarchy-based approach.

Only investigative alerts

In this scenario, investigative alerts account for most or all of the primary service alerting. This anti-pattern has led to much of the polarized language characterizing cause-based alerting as bad. As shown in Figure 2, these alerts aren't supported by a foundation of reactive and proactive alerts.

Figure 2. Only investigative alerts. Source: Google.

Symptoms:

  • Your service does not have SLOs/SLIs, or you do not have alerting on SLOs/SLIs.
  • Even if your service has SLOs/SLIs, the majority of incidents are discovered by alerts that are not SLO-based.
  • The majority of the alerts considered “investigative” are given a paging priority, rather than a priority indicating the investigation can wait until the next business hour.

Remedy:

Establish SLOs and SLIs that reflect your business goals. Alert on these SLOs/SLIs. These reactive alerts will form the base of your pyramid.

Total set of alerts incurs too much toil

In this situation, there may be a reasonable mix of alerts, but their combined volume causes operators to regularly disregard or fail to prioritize them.

Symptoms:

  • You receive too many paging alerts: your setup either has too many alerts, or a small pool of alerts that fire too often.
  • You receive too many non-paging alerts (for example, tickets), and frequently disregard them, declare “ticket bankruptcy” (close out a large chunk of tickets in order to start over from a clean slate), or otherwise allow them to fall into disrepair.

Remedy:

As shown in Figure 3, climb down the pyramid of alerting maturity by disabling alerts at higher levels of the hierarchy. Move toward the base as far as necessary to reach a manageable state. By mapping your alerts to the hierarchy, you will implicitly drop the less important alerts first.

Figure 3. Climb down the pyramid of alerting maturity. Source: Google.

Figure 3: Climb down the pyramid of alerting maturity

Inactionable alerts have a cost: if too many conditions trigger alerts, it becomes increasingly difficult to prioritize the most important alerts in a sea of noise. This results in worse uptime and more toil and burnout, harming both short- and long-term system reliability. You’ll want to climb up the pyramid again as you gain more control over the system, but only after you have reached and sustained a reasonable level of alerting—for example, for a month or longer.

SLO/SLI alerting is insufficient to discover common outages

Although you've set up SLO-based alerting, you're often notified of outages that threaten business goals by some other alerting—proactive alerts, investigative alerts, or even just user escalations.

Symptom:

More than one or two of your past 10 postmortems resulted from issues not captured in SLO alerts.

Remedy:

Resist the urge to craft more investigative and/or proactive alerts—your problem lies in the bottom layer of your alerting hierarchy, not the upper layers.

Instead, figure out why you experienced an outage that's not covered by your SLIs. Either:

  • You need to augment your SLIs to better cover the business aspects of your service, or
  • Your definition of an outage is overly sensitive and needs to be reexamined
Cause-based alerts aren’t wrong!

...they're just misunderstood. As described in the first anti-pattern above, cause-based alerts are problematic if they're the only type of alerting you use. When used in conjunction with other types of alerts, cause-based notifications can provide a deeper understanding of the system and improve its robustness. But your aggregated alerting must not lead to toil, and when it does, we recommend revisiting your alerting strategy.

All too frequently, we treat alerting as a haunted graveyard: we shy away from removing an alert (because what if it catches something important?!) just as much we shy away from enabling a new alert (because what if it causes toil?!).

Throughout SRE history, alerting setups have been built based mostly on opinion and conjecture, rather than by actually analyzing which setups best serve the end goal of running reliable services with minimal toil. An alerting hierarchy allows us to approach alerting as a path we can walk, describing what is good or bad in a given situation instead of in absolutes. With this framework, we can iterate and improve our alerting based on data, rather than sound bites.

Share your alerting experiences with us on Twitter at @googlesre.

This post is part of a collaboration between O’Reilly and Google. See our statement of editorial independence.

Continue reading Reduce toil through better alerting.

Categories: Technology

Four short links: 3 June 2019

O'Reilly Radar - Mon, 2019/06/03 - 03:55

Differential Privacy, Future Op-Ed, Spectroscopy, and Research Programming Environment

  1. Differential Privacy in the Census (Science) -- Differential privacy, first described in 2006, isn’t a substitute for swapping and other ways to perturb the data. Rather, it allows someone—in this case, the Census Bureau—to measure the likelihood that enough information will “leak” from a public data set to open the door to reconstruction. “Any time you release a statistic, you’re leaking something,” explains Jerry Reiter, a professor of statistics at Duke University in Durham, North Carolina, who has worked on differential privacy as a consultant with the Census Bureau. “The only way to absolutely ensure confidentiality is to release no data. So the question is, how much risk is OK? Differential privacy allows you to put a boundary” on that risk.
  2. It’s 2059, and the Rich Kids Are Still Winning (NYT) -- Ted Chiang's "Op-Ed From the Future." (via Slashdot)
  3. Classification of Household Materials via Spectroscopy -- we collected a data set of spectral measurements from two commercially available spectrometers during which a robotic platform interacted with 50 distinct objects, and we show that a residual neural network can accurately analyze these measurements. Due to the low variance in consecutive spectral measurements, our model achieved a material classification accuracy of 97.7% when given only one spectral sample per object.
  4. Flowsheets -- A research prototype programming environment for making programs while seeing the data the program outputs. See the demo video.

Continue reading Four short links: 3 June 2019.

Categories: Technology

Four short links: 31 May 2019

O'Reilly Radar - Fri, 2019/05/31 - 04:45

Google Blocking Ad Blocking, Security Checklist, Maturity Model, and Software Engineering

  1. Google to Restrict Modern Ad Blocking Chrome Extensions to Enterprise Users (9 to 5 Google) -- modern ad blockers, like uBlock Origin and Ghostery, use Chrome’s webRequest API to block ads before they’re even downloaded. With the Manifest V3 proposal, Google deprecates the webRequest API’s ability to block a particular request before it’s loaded. As you would expect, power users and extension developers alike criticized Google’s proposal for limiting the user’s ability to browse the web as they see fit. [...] "Google’s primary business is incompatible with unimpeded content blocking. Now that Google Chrome product has achieve high market share, the content blocking concerns as stated in its 10K filing are being tackled." See also Switch to Firefox.
  2. SaaS CTO Security Checklist -- useful security tips, arranged by stage of your startup.
  3. Proposing a Maturity Model for Digital Services (David Eaves) -- an interesting approach: describes the aspects of maturity (political environment, institutional capacity, delivery capability, skills and hiring, user centered design, cross-government platforms) and then has a rubric for the different aspects of each of them.
  4. Notes to Myself on Sofware Engineering (François Chollet) -- Technology is never neutral. If your work has any impact on the world, then this impact has a moral direction. The seemingly innocuous technical choices we make in software products modulate the terms of access to technology, its usage incentives, who will benefit, and who will suffer. Technical choices are also ethical choices. Thus, always be deliberate and explicit about the values you want your choices to support. Design for ethics. Bake your values into your creations. Never think, I’m just building the capability; that in itself is neutral. It is not because the way you build it determines how it will get used. The whole list is great, and resonates strongly with my experience.

Continue reading Four short links: 31 May 2019.

Categories: Technology

Pages

Subscribe to LuftHans aggregator