You are here

Feed aggregator

2021 Data/AI Salary Survey

O'Reilly Radar - Wed, 2021/09/15 - 04:32

In June 2021, we asked the recipients of our Data & AI Newsletter to respond to a survey about compensation. The results gave us insight into what our subscribers are paid, where they’re located, what industries they work for, what their concerns are, and what sorts of career development opportunities they’re pursuing.

While it’s sadly premature to say that the survey took place at the end of the COVID-19 pandemic (though we can all hope), it took place at a time when restrictions were loosening: we were starting to go out in public, have parties, and in some cases even attend in-person conferences. The results then provide a place to start thinking about what effect the pandemic had on employment. There was a lot of uncertainty about stability, particularly at smaller companies: Would the company’s business model continue to be effective? Would your job still be there in a year? At the same time, employees were reluctant to look for new jobs, especially if they would require relocating—at least according to the rumor mill. Were those concerns reflected in new patterns for employment?

Executive Summary
  • The average salary for data and AI professionals who responded to the survey was $146,000.
  • The average change in compensation over the last three years was $9,252. This corresponds to an annual increase of 2.25%. However, 8% of the correspondents reported decreased compensation, and 18% reported no change.
  • We don’t see evidence of a “great resignation.” 22% of respondents said they intended to change jobs, roughly what we would have expected. Respondents seemed concerned about job security, probably because of the pandemic’s effect on the economy.
  • Average compensation was highest in California ($176,000), followed by Eastern Seaboard states like New York and Massachusetts.
  • Compensation for women was significantly lower than for men (84%). Salaries were lower regardless of education or job title. Women were more likely than men to have advanced degrees, particularly PhDs.
  • Many respondents acquired certifications. Cloud certifications, specifically in AWS and Microsoft Azure, were most strongly associated with salary increases.
  • Most respondents participated in training of some form. Learning new skills and improving old ones were the most common reasons for training, though hireability and job security were also factors. Company-provided training opportunities were most strongly associated with pay increases.
Demographics

The survey was publicized through O’Reilly’s Data & AI Newsletter and was limited to respondents in the United States and the United Kingdom. There were 3,136 valid responses, 2,778 from the US and 284 from the UK. This report focuses on the respondents from the US, with only limited attention paid to those from the UK. A small number of respondents (74) identified as residents of the US or UK, but their IP addresses indicated that they were located elsewhere. We didn’t use the data from these respondents; in practice, discarding this data had no effect on the results.

Of the 2,778 US respondents, 2,225 (81%) identified as men, and 383 (14%) identified as women (as identified by their preferred pronouns). 113 (4%) identified as “other,” and 14 (0.5%) used “they.”

The results are biased by the survey’s recipients (subscribers to O’Reilly’s Data & AI Newsletter). Our audience is particularly strong in the software (20% of respondents), computer hardware (4%), and computer security (2%) industries—over 25% of the total. Our audience is also strong in the states where these industries are concentrated: 42% of the US respondents lived in California (20%), New York (9%), Massachusetts (6%), and Texas (7%), though these states only make up 27% of the US population.

Compensation Basics

The average annual salary for employees who worked in data or AI was $146,000. Most salaries were between $100,000 and $150,000 yearly (34%); the next most common salary tier was from $150,000 to $200,000 (26%). Compensation depended strongly on location, with average salaries highest in California ($176,000).

The average salary change over the past three years was $9,252, which is 2.25% per year (assuming a final salary equal to the average). A small number of respondents (8%) reported salary decreases, and 18% reported no change. Economic uncertainty caused by the pandemic may be responsible for the declines in compensation. 19% reported increases of $5,000 to $10,000 over that period; 14% reported increases of over $25,000. A study by the IEEE suggests that the average salary for technical employees increased 3.6% per year, higher than our respondents indicated.

39% of respondents reported promotions in the past three years, and 37% reported changing employers during that period. 22% reported that they were considering changing jobs because their salaries hadn’t increased during the past year. Is this a sign of what some have called a “great resignation”? Common wisdom has it that technical employees change jobs every three to four years. LinkedIn and Indeed both recommend staying for at least three years, though they observe that younger employees change jobs more often. LinkedIn elsewhere states that the annual turnover rate for technology employees is 13.2%—which suggests that employees stay at their jobs for roughly seven and a half years. If that’s correct, the 37% that changed jobs over three years seems about right, and the 22% who said they “intend to leave their job due to a lack of compensation increase” doesn’t seem overly high. Keep in mind that intent to change and actual change are not the same—and that there are many reasons to change jobs aside from salary, including flexibility around working hours and working from home.

64% of the respondents took part in training or obtained certifications in the past year, and 31% reported spending over 100 hours in training programs, ranging from formal graduate degrees to reading blog posts. As we’ll see later, cloud certifications (specifically in AWS and Microsoft Azure) were the most popular and appeared to have the largest effect on salaries.

The reasons respondents gave for participating in training were surprisingly consistent. The vast majority reported that they wanted to learn new skills (91%) or improve existing skills (84%). Data and AI professionals are clearly interested in learning—and that learning is self-motivated, not imposed by management. Relatively few (22%) said that training was required by their job, and even fewer participated in training because they were concerned about losing their job (9%).

However, there were other motives at work. 56% of our respondents said that they wanted to increase their “job security,” which is at odds with the low number who were concerned about losing their job. And 73% reported that they engaged in training or obtained certifications to increase their “hireability,” which may suggest more concern about job stability than our respondents would admit. The pandemic was a threat to many businesses, and employees were justifiably concerned that their job could vanish after a bad pandemic-influenced quarter. A desire for increased hireability may also indicate that we’ll see more people looking to change jobs in the near future.

Finally, 61% of the respondents said that they participated in training or earned certifications because they wanted a salary increase or a promotion (“increase in job title/responsibilities”). It isn’t surprising that employees see training as a route to promotion—especially as companies that want to hire in fields like data science, machine learning, and AI contend with a shortage of qualified employees. Given the difficulty of hiring expertise from outside, we expect an increasing number of companies to grow their own ML and AI talent internally using training programs.

Salaries by Gender

To nobody’s surprise, our survey showed that data science and AI professionals are mostly male. The number of respondents tells the story by itself: only 14% identified as women, which is lower than we’d have guessed, though it’s roughly consistent with our conference attendance (back when we had live conferences) and roughly equivalent to other technical fields. A small number (5%) reported their preferred pronoun as “they” or Other, but this sample was too small to draw any significant comparisons about compensation.

Women’s salaries were sharply lower than men’s salaries, averaging $126,000 annually, or 84% of the average salary for men ($150,000). That differential held regardless of education, as Figure 1 shows: the average salary for a woman with a doctorate or master’s degree was 82% of the salary for a man with an equivalent degree. The difference wasn’t quite as high for people with bachelor’s degrees or who were still students, but it was still significant: women with bachelor’s degrees or who were students earned 86% or 87% of the average salary for men. The difference in salaries was greatest between people who were self-taught: in that case, women’s salaries were 72% of men’s. An associate’s degree was the only degree for which women’s salaries were higher than men’s.

Figure 1. Women’s and men’s salaries by degree

Despite the salary differential, a higher percentage of women had advanced degrees than men: 16% of women had a doctorate, as opposed to 13% of men. And 47% of women had a master’s degree, as opposed to 46% of men. (If those percentages seem high, keep in mind that many professionals in data science and AI are escapees from academia.)

Women’s salaries also lagged men’s salaries when we compared women and men with similar job titles (see Figure 2). At the executive level, the average salary for women was $163,000 versus $205,000 for men (a 20% difference). At the director level, the difference was much smaller—$180,000 for women versus $184,000 for men—and women’s salaries were actually higher than those at the executive level. It’s easy to hypothesize about this difference, but we’re at a loss to explain it. For managers, women’s salaries were $143,000 versus $154,000 for men (a 7% difference).

Career advancement is also an issue: 18% of the women who participated in the survey were executives or directors, compared with 23% of the men.

Figure 2. Women’s and men’s salaries by job title

Before moving on from our consideration of the effect of gender on salary, let’s take a brief look at how salaries changed over the past three years. As Figure 3 shows, the percentage of men and women respondents who saw no change was virtually identical (18%). But more women than men saw their salaries decrease (10% versus 7%). Correspondingly, more men saw their salaries increase. Women were also more likely to have a smaller increase: 24% of women had an increase of under $5,000 versus 17% of men. At the high end of the salary spectrum, the difference between men and women was smaller, though still not zero: 19% of men saw their salaries increase by over $20,000, but only 18% of women did. So the most significant differences were in the midrange. One anomaly sticks out: a slightly higher percentage of women than men received salary increases in the $15,000 to $20,000 range (8% versus 6%).

Figure 3. Change in salary for women and men over three years Salaries by Programming Language

When we looked at the most popular programming languages for data and AI practitioners, we didn’t see any surprises: Python was dominant (61%), followed by SQL (54%), JavaScript (32%), HTML (29%), Bash (29%), Java (24%), and R (20%). C++, C#, and C were further back in the list (12%, 12%, and 11%, respectively).

Discussing the connection between programming languages and salary is tricky because respondents were allowed to check multiple languages, and most did. But when we looked at the languages associated with the highest salaries, we got a significantly different list. The most widely used and popular languages, like Python ($150,000), SQL ($144,000), Java ($155,000), and JavaScript ($146,000), were solidly in the middle of the salary range. The outliers were Rust, which had the highest average salary (over $180,000), Go ($179,000), and Scala ($178,000). Other less common languages associated with high salaries were Erlang, Julia, Swift, and F#. Web languages (HTML, PHP, and CSS) were at the bottom (all around $135,000). See Figure 4 for the full list.

Figure 4. Salary vs. programming language

How do we explain this? It’s difficult to say that data and AI developers who use Rust command a higher salary, since most respondents checked several languages. But we believe that this data shows something significant. The supply of talent for newer languages like Rust and Go is relatively small. While there may not be a huge demand for data scientists who use these languages (yet), there’s clearly some demand—and with experienced Go and Rust programmers in short supply, they command a higher salary. Perhaps it is even simpler: regardless of the language someone will use at work, employers interpret knowledge of Rust and Go as a sign of competence and willingness to learn, which increases candidates’ value. A similar argument can be made for Scala, which is the native language for the widely used Spark platform. Languages like Python and SQL are table stakes: an applicant who can’t use them could easily be penalized, but competence doesn’t confer any special distinction.

One surprise is that 10% of the respondents said that they didn’t use any programming languages. We’re not sure what that means. It’s possible they worked entirely in Excel, which should be considered a programming language but often isn’t. It’s also possible that they were managers or executives who no longer did any programming.

Salaries by Tool and Platform

We also asked respondents what tools they used for statistics and machine learning and what platforms they used for data analytics and data management. We observed some of the same patterns that we saw with programming languages. And the same caution applies: respondents were allowed to select multiple answers to our questions about the tools and platforms that they use. (However, multiple answers weren’t as frequent as for programming languages.) In addition, if you’re familiar with tools and platforms for machine learning and statistics, you know that the boundary between them is fuzzy. Is Spark a tool or a platform? We considered it a platform, though two Spark libraries are in the list of tools. What about Kafka? A platform, clearly, but a platform for building data pipelines that’s qualitatively different from a platform like Ray, Spark, or Hadoop.

Just as with programming languages, we found that the most widely used tools and platforms were associated with midrange salaries; older tools, even if they’re still widely used, were associated with lower salaries; and some of the tools and platforms with the fewest users corresponded to the highest salaries. (See Figure 5 for the full list.)

The most common responses to the question about tools for machine learning or statistics were “I don’t use any tools” (40%) or Excel (31%). Ignoring the question of how one does machine learning or statistics without tools, we’ll only note that those who didn’t use tools had an average salary of $143,000, and Excel users had an average salary of $138,000—both below average. Stata ($120,000) was also at the bottom of the list; it’s an older package with relatively few users and is clearly falling out of favor.

The popular machine learning packages PyTorch (19% of users, $166,000 average salary), TensorFlow (20%, $164,000), and scikit-learn (27%, $157,000) occupied the middle ground. Those salaries were above the average for all respondents, which was pulled down by the large numbers who didn’t use tools or only used Excel. The highest salaries were associated with H2O (3%, $183,000), KNIME (2%, $180,000), Spark NLP (5%, $179,000), and Spark MLlib (8%, $175,000). It’s hard to trust conclusions based on 2% or 3% of the respondents, but it appears that salaries are higher for people who work with tools that have a lot of “buzz” but aren’t yet widely used. Employers pay a premium for specialized expertise.

Figure 5. Average salary by tools for statistics or machine learning

We see almost exactly the same thing when we look at data frameworks (Figure 6). Again, the most common response was from people who didn’t use a framework; that group also received the lowest salaries (30% of users, $133,000 average salary).

In 2021, Hadoop often seems like legacy software, but 15% of the respondents were working on the Hadoop platform, with an average salary of $166,000. That was above the average salary for all users and at the low end of the midrange for salaries sorted by platform.

The highest salaries were associated with Clicktale (now ContentSquare), a cloud-based analytics system for researching customer experience: only 0.2% of respondents use it, but they have an average salary of $225,000. Other frameworks associated with high salaries were Tecton (the commercial version of Michelangelo, at $218,000), Ray ($191,000), and Amundsen ($189,000). These frameworks had relatively few users—the most widely used in this group was Amundsen with 0.8% of respondents (and again, we caution against reading too much into results based on so few respondents). All of these platforms are relatively new, frequently discussed in the tech press and social media, and appear to be growing healthily. Kafka, Spark, Google BigQuery, and Dask were in the middle, with a lot of users (15%, 19%, 8%, and 5%) and above-average salaries ($179,000, $172,000, $170,000, and $170,000). Again, the most popular platforms occupied the middle of the range; experience with less frequently used and growing platforms commanded a premium.

Figure 6. Average salary by data framework or platform Salaries by Industry

The greatest number of respondents worked in the software industry (20% of the total), followed by consulting (11%) and healthcare, banking, and education (each at 8%). Relatively few respondents listed themselves as consultants (also 2%), though consultancy tends to be cyclic, depending on current thinking on outsourcing, tax law, and other factors. The average income for consultants was $150,000, which is only slightly higher than the average for all respondents ($146,000). That may indicate that we’re currently in some kind of an equilibrium between consultants and in-house talent.

While data analysis has become essential to every kind of business and AI is finding many applications outside of computing, salaries were highest in the computer industry itself, as Figure 7 makes clear. For our purposes, the “computer industry” was divided into four segments: computer hardware, cloud services and hosting, security, and software. Average salaries in these industries ranged from $171,000 (for computer hardware) to $164,000 (for software). Salaries for the advertising industry (including social media) were surprisingly low, only $150,000.

Figure 7. Average salary by industry

Education and nonprofit organizations (including trade associations) were at the bottom end of the scale, with compensation just above $100,000 ($106,000 and $103,000, respectively). Salaries for technical workers in government were slightly higher ($124,000).

Salaries by State

When looking at data and AI practitioners geographically, there weren’t any big surprises. The states with the most respondents were California, New York, Texas, and Massachusetts. California accounted for 19% of the total, with over double the number of respondents from New York (8%). To understand how these four states dominate, remember that they make up 42% of our respondents but only 27% of the United States’ population.

Salaries in California were the highest, averaging $176,000. The Eastern Seaboard did well, with an average salary of $157,000 in Massachusetts (second highest). New York, Delaware, New Jersey, Maryland, and Washington, DC, all reported average salaries in the neighborhood of $150,000 (as did North Dakota, with five respondents). The average salary reported for Texas was $148,000, which is slightly above the national average but nevertheless seems on the low side for a state with a significant technology industry.

Salaries in the Pacific Northwest were not as high as we expected. Washington just barely made it into the top 10 in terms of the number of respondents, and average salaries in Washington and Oregon were $138,000 and $133,000, respectively. (See Figure 8 for the full list.)

The highest-paying jobs, with salaries over $300,000, were concentrated in California (5% of the state’s respondents) and Massachusetts (4%). There were a few interesting outliers: North Dakota and Nevada both had very few respondents, but each had one respondent making over $300,000. In Nevada, we’re guessing that’s someone who works for the casino industry—after all, the origins of probability and statistics are tied to gambling. Most states had no respondents with compensation over $300,000.

Figure 8. Average salary by state

The lowest salaries were, for the most part, from states with the fewest respondents. We’re reluctant to say more than that. These states typically had under 10 respondents, which means that averaging salaries is extremely noisy. For example, Alaska only had two respondents and an average salary of $75,000; Mississippi and Louisiana each only had five respondents, and Rhode Island only had three. In any of these states, one or two additional respondents at the executive level would have a huge effect on the states average. Furthermore, the averages in those states are so low that all (or almost all) respondents must be students, interns, or in entry-level positions. So we don’t think we can make any statement stronger than “the high paying jobs are where you’d expect them to be.”

Job Change by Salary

Despite the differences between states, we found that the desire to change jobs based on lack of compensation didn’t depend significantly on geography. There were outliers at both extremes, but they were all in states where the number of respondents was small and one or two people looking to change jobs would make a significant difference. It’s not terribly interesting to say that 24% of respondents from California intend to change jobs (only 2% above the national average); after all, you’d expect California to dominate. There may be a small signal from states like New York, with 232 respondents, of whom 27% intend to change jobs, or from a state like Virginia, with 137 respondents, of whom only 19% were thinking of changing. But again, these numbers aren’t much different from the total percentage of possible job changers.

If intent to change jobs due to compensation isn’t dependent on location, then what does it depend on? Salary. It’s not at all surprising that respondents with the lowest salaries (under $50,000/year) are highly motivated to change jobs (29%); this group is composed largely of students, interns, and others who are starting their careers. The group that showed the second highest desire to change jobs, however, had the highest salaries: over $400,000/year (27%). It’s an interesting pairing: those with the highest and lowest salaries were most intent on getting a salary increase.

26% of those with annual salaries between $50,000 and $100,000 indicated that they intend to change jobs because of compensation. For the remainder of the respondents (those with salaries between $100,000 and $400,000), the percentage who intend to change jobs was 22% or lower.

Salaries by Certification

Over a third of the respondents (37%) replied that they hadn’t obtained any certifications in the past year. The next biggest group replied “other” (14%), meaning that they had obtained certifications in the past year but not one of the certifications we listed. We allowed them to write in their own responses, and they shared 352 unique answers, ranging from vendor-specific certifications (e.g., DataRobot) to university degrees (e.g., University of Texas) to well-established certifications in any number of fields (e.g., Certified Information Systems Security Professional a.k.a. CISSP). While there were certainly cases where respondents used different words to describe the same thing, the amount of unique write-in responses reflects the great number of certifications available.

Cloud certifications were by far the most popular. The top certification was for AWS (3.9% obtained AWS Certified Solutions Architect-Associate), followed by Microsoft Azure (3.8% had AZ-900: Microsoft Azure Fundamentals), then two more AWS certifications and CompTIA’s Security+ certification (1% each). Keep in mind that 1% only represents 27 respondents, and all the other certifications had even fewer respondents.

As Figure 9 shows, the highest salaries were associated with AWS certifications, the Microsoft AZ-104 (Azure Administrator Associate) certification, and the CISSP security certification. The average salary for people listing these certifications was higher than the average salary for US respondents as a whole. And the average salary for respondents who wrote in a certification was slightly above the average for those who didn’t earn any certifications ($149,000 versus $143,000).

Figure 9. Average salary by certification earned

Certifications were also associated with salary increases (Figure 10). Again AWS and Microsoft Azure dominate, with Microsoft’s AZ-104 leading the way, followed by three AWS certifications. And on the whole, respondents with certifications appear to have received larger salary increases than those who didn’t earn any technical certifications.

Figure 10. Average salary change by certification

Google Cloud is an obvious omission from this story. While Google is the third-most-important cloud provider, only 26 respondents (roughly 1%) claimed any Google certification, all under the “Other” category.

Among our respondents, security certifications were relatively uncommon and didn’t appear to be associated with significantly higher salaries or salary increases. Cisco’s CCNP was associated with higher salary increases; respondents who earned the CompTIA Security+ or CISSP certifications received smaller increases. Does this reflect that management undervalues security training? If this hypothesis is correct, undervaluing security is clearly a significant mistake, given the ongoing importance of security and the possibility of new attacks against AI and other data-driven systems.

Cloud certifications clearly had the greatest effect on salary increases. With very few exceptions, any certification was better than no certification: respondents who wrote in a certification under “Other” averaged a $9,600 salary increase over the last few years, as opposed to $8,900 for respondents who didn’t obtain a certification and $9,300 for all respondents regardless of certification.

Training

Participating in training resulted in salary increases—but only for those who spent more than 100 hours in a training program. As Figure 11 shows, those respondents had an average salary increase of $11,000. This was also the largest group of respondents (19%). Respondents who only reported undertaking 1–19 hours of training (8%) saw lower salary increases, with an average of $7,100. It’s interesting that those who participated in 1–19 hours of training saw smaller increases than those who didn’t participate in training at all. It doesn’t make sense to speculate about this difference, but the data does make one thing clear: if you engage in training, be serious about it.

Figure 11. Average salary change vs. hours of training

We also asked what types of training respondents engaged in: whether it was company provided (for which there were three alternatives), a certification program, a conference, or some other kind of training (detailed in Figure 12). Respondents who took advantage of company-provided opportunities had the highest average salaries ($156,000, $150,000, and $149,000). Those who obtained certifications were next ($148,000). The results are similar if we look at salary increases over the past three years: Those who participated in various forms of company-offered training received increases between $11,000 and $10,000. Salary increases for respondents who obtained a certification were in the same range ($11,000).

Figure 12. Average salary change vs. type of training The Last Word

Data and AI professionals—a rubric under which we include data scientists, data engineers, and specialists in AI and ML—are well-paid, reporting an average salary just under $150,000. However, there were sharp state-by-state differences: salaries were significantly higher in California, though the Northeast (with some exceptions) did well.

There were also significant differences between salaries for men and women. Men’s salaries were higher regardless of job title, regardless of training and regardless of academic degrees—even though women were more likely to have an advanced academic degree (PhD or master’s degree) than were men.

We don’t see evidence of a “great resignation.” Job turnover through the pandemic was roughly what we’d expect (perhaps slightly below normal). Respondents did appear to be concerned about job security, though they didn’t want to admit it explicitly. But with the exception of the least- and most-highly compensated respondents, the intent to change jobs because of salary was surprisingly consistent and nothing to be alarmed at.

Training was important, in part because it was associated with hireability and job security but more because respondents were genuinely interested in learning new skills and improving current ones. Cloud training, particularly in AWS and Microsoft Azure, was the most strongly associated with higher salary increases.

But perhaps we should leave the last word to our respondents. The final question in our survey asked what areas of technology would have the biggest effect on salary and promotions in the coming year. It wasn’t a surprise that most of the respondents said machine learning (63%)—these days, ML is the hottest topic in the data world. It was more of a surprise that “programming languages” was noted by just 34% of respondents. (Only “Other” received fewer responses—see Figure 13 for full details.) Our respondents clearly aren’t impressed by programming languages, even though the data suggests that employers are willing to pay a premium for Rust, Go, and Scala.

There’s another signal worth paying attention to if we look beyond the extremes. Data tools, cloud and containers, and automation were nearly tied (46, 47, and 44%). The cloud and containers category includes tools like Docker and Kubernetes, cloud providers like AWS and Microsoft Azure, and disciplines like MLOps. The tools category includes tools for building and maintaining data pipelines, like Kafka. “Automation” can mean a lot of things but in this context probably means automated training and deployment.

Figure 13. What technologies will have the biggest effect on compensation in the coming year?

We’ve argued for some time that operations—successfully deploying and managing applications in production—is the biggest issue facing ML practitioners in the coming years. If you want to stay on top of what’s happening in data, and if you want to maximize your job security, hireability, and salary, don’t just learn how to build AI models; learn how to deploy applications that live in the cloud.

In the classic movie The Graduate, one character famously says, “There’s a great future in plastics. Think about it.” In 2021, and without being anywhere near as repulsive, we’d say, “There’s a great future in the cloud. Think about it.”

Categories: Technology

Topic for metting on 9/9

PLUG - Wed, 2021/09/08 - 11:01

This month we'll have "Thank You" from der.hans.
Attend the meeting on by visiting: https://lufthans.bigbluemeeting.com/b/plu-yuk-7xx on the 8th of July at 7pm MST

Description
Thank you to all the developers, translators, documentors, artists and maintainers who make Free Software.
Thank you to all the project managers, mailing list admins, SREs and toolsmiths who enable the making.
Thanks to the people who use Free Software and do amazing things with it.
This talk will use stories and anecdotes to explore the thanks we can have for the Free Software community.

About der.hans:
der.hans works remotely in the US.
In the FLOSS community he is active with conferences and local groups.
He's chairman of PLUG, chair for SeaGL Finance committee, founder of SeaGL Career Expo, RaiseMe career counselor, BoF organizer for the Southern California Linux Expo (SCaLE) and founder of the Free Software Stammtisch.
He speaks regularly at community-led FLOSS conferences such as SeaGL, Penguicon, SCaLE, LCA, FOSSASIA, Tübix, CLT, Kieler Linux Tage, LFNW, OLF, SELF and GeekBeacon Fest.
Hans is available to speak remotely for local groups.
Currently a Customer Data Engineer at Object Rocket. Public statements are not representative of $dayjob.
Fediverse/Mastodon - https://floss.social/@FLOX_advocate

Radar trends to watch: September 2021

O'Reilly Radar - Wed, 2021/09/01 - 05:18

Let’s start with a moment of silence for O’Reilly Author Toby Segaran, who passed away on August 11, 2021.  Toby was one of the people who got the Data Science movement started. His book, Programming Collective Intelligence, taught many how to start using their data. Throughout his career, he mentored many, and was particularly influential in mentoring young women interested in science and technology. Toby is greatly missed by everyone in the Data Science community.

AI and Data
  • Margaret Mitchell joins HuggingFace to create tools to help build fair algorithms.
  • Embedded Machine Learning for Hard Hat Detection is an interesting real-world application of AI on the edge. Wearing hard hats is essential to work site safety; this project developed a model for detecting whether workers were wearing hard hats that could easily be deployed without network connectivity. It also goes into rebalancing datasets–in this case, public datasets with too few hard hats, but this technique is applicable to other instances of bias.
  • Liquid Neural Networks are neural networks that can adapt in real time to incoming data.  They are particularly useful for time series data–which, as the author points out, is almost all data.
  • US Government agencies plan to increase their use of facial recognition, in many cases for law enforcement, despite well-known accuracy problems for minorities and women.  Local bans on face recognition cannot prohibit federal use.
  • Data and Politics is an ongoing research project that studies how political organizations are collecting and using data.
  • FaunaDB is a distributed document database designed for serverless architectures. It comes with REST API support, GraphQL, built-in attribute based access control, and a lot of other great features.
  • Facial expression recognition is being added to a future version of Android as part of their accessibility package. Developers can create applications where expressions (smiles, etc.) can be used as commands.
  • Open AI’s Codex (the technology behind Copilot) takes the next step: translating English into runnable code, rather than making suggestions.  Codex is now in private beta.
  • Who is responsible for publicly available datasets, and how do you ensure that they’re used appropriately? Margaret Mitchell suggests organizations for data stewardship. These would curate, maintain, and enforce legal standards for the use of public data.
  • An AI system can predict race accurately based purely on medical images, with no other information about the subject. This creates huge concerns about how bias could enter AI-driven diagnostics; but it also raises the possibility that we might discover better treatments for minorities who are underserved (or badly served) by the medical industry.
  • DeepMind has made progress in building a generalizable AI: AI agents that can solve problems that they have never seen before, and transfer learning from one problem to another. They have developed XLand, an environment that creates games and problems, to enable this research.
  • GPT-J is one of a number of open source alternatives to Github Copilot. It is smaller and faster, and appears to be at least as good.
  • Master faces” are images generated by adversarial neural networks that are capable of passing facial recognition tests without corresponding to any specific face.
  • Researchers have created a 3D map of a small part of a mouse’s brain. This is the most detailed map of how neurons connect that has ever been made.  The map contains 75,000 neurons and 523 million synapses; the map and the data set have been released to the public.
Robotics
  • Robotic chameleons (or chameleon robotics): Researchers have developed a robotic “skin” that can change color in real time to match its surroundings.
  • Elon Musk announces that Tesla will release a humanoid robot next year; it will be capable of performing tasks like going to the store. Is this real, or just a distraction from investigations into the safety of Tesla’s autonomous driving software?
  • According to the UN, lethal autonomous robots (robots capable of detecting and attacking a target without human intervention) have been deployed and used by the Libyan government.
  • A new generation of warehouse robots is capable of simple manipulation (picking up and boxing objects); robots capable of more fine-grained manipulation are coming.
Security
  • The end of passwords draws even closer. GitHub is now requiring 2-factor authentication, preferably using WebAuthn or Yubikey. Amazon will be giving free USB authentication keys to some customers (root account owners spending over $100/month).
  • There are many vulnerabilities in charging systems for electric vehicles. This is sad, but not surprising: the automotive industry hasn’t learned from the problems of IoT security.
  • Advances in cryptography may make it more efficient to do computation without decrypting encrypted data.
  • Amazon is offering store credit to people who give them their palm prints, for use in biometric checkout at their brick-and-mortar stores.
  • Amazon, Google, Microsoft, and others join the US Joint Cyber Defense Collaborative to fight the spread of ransomware.
  • Apple will be scanning iPhones for images of child abuse.  Child abuse aside, this decision raises questions about cryptographic backdoors for government agencies and Apple’s long-standing marketing of privacy. If they can monitor for one thing, they can monitor for others, and can presumably be legally forced to do so.
  • Automating incident response: self-healing auto-remediation could be the next step in automating all the things, building more reliable systems, and eliminating the 3AM pager.
Hardware
  • Hearables are very small computers, worn in the ear, for which the only interface is a microphone, a speaker, and a network. They may have applications in education, music, real time translation (like Babelfish), and of course, next-generation hearing aids.
  • Timekeeping is an old and well-recognized problem in distributed computing. Facebook’s Time cards are an open-source (code and hardware) solution for accurate time keeping. The cards are PCIe bus cards (PC standard) and incorporate a satellite receiver and an atomic clock.
  • A new cellular board for IoT from Ray Ozzie’s company Blues Wireless is a very interesting product. It is easy to program (JSON in and out), interfaces easily to Raspberry Pi and other systems, and $49 includes 10 years of cellular connectivity.
Social Media
  • Researchers are using Google Trends data to identify COVID symptoms as a proxy for hospital data, since hospital data isn’t publicly available. The key is distinguishing between flu-like flu symptoms and flu-like COVID symptoms.
  • A topic-based approach to targeted advertising may be Google’s new alternative to tracking cookies, replacing the idea of assigning users to cohorts with similar behavior.
  • Facebook shares a little information about what’s most widely viewed on their network. It only covers the top 20 URLs and, given Facebook’s attempts to shut down researchers studying their behavior, qualifies as transparency theater rather than substance.
  • As an experiment, Twitter is allowing certain users to mark misleading content.  They have not (and presumably won’t) specified how to become one of these users. The information they gain won’t be used directly for blocking misinformation, but to study how it propagates.
  • Banning as a service: It’s now possible to hire a company to get someone banned from Instagram and other social media. Not surprisingly, these organizations may be connected to organizations that specialize in restoring banned accounts.
  • Facebook may be researching ways to use some combination of AI and homomorphic encryption to place targeted ads on encrypted messages without decrypting them.
  • Inspired by the security community and bug bounties, Twitter offers a bounty to people who discover algorithmic bias.
Work
  • Facebook’s virtual reality workrooms could transform remote meetings by putting all the participants in a single VR conference room–assuming that all the participants are willing to wear goggles.
  • A survey shows that 70% of employees would prefer to work at home, even if it costs them in benefits, including vacation time and salaries.  Eliminating the commute adds up.
Cloud
  • Sky computing–the next step towards true utility computing–is essentially what we now call “multi cloud,” but with an inter-cloud layer that provides interoperability between cloud providers.
  • Thoughts on the future of the data stack as data starts to take advantage of cloud: how do organizations get beyond “lift and shift” and other early approaches to use clouds effectively?
Networks
  • Matrix is another protocol for decentralized messaging (similar in concept to Scuttlebutt) that appears to be getting some enterprise traction.
  • Using federated learning to build decentralized intelligent wireless communications systems that predict traffic patterns to help traffic management may be part of 6G.
  • How do you scale intelligence at the edge of the network? APIs, industrially hardened Linux systems, and Kubernetes adapted to small systems (e.g., K3S).
Miscellaneous
  • The EU is considering a law that would require cryptocurrency transactions to be traceable.  An EU-wide authority to prevent money laundering would have authority over cryptocurrencies.
  • Autocorrect errors in Excel are a problem in genomics: autocorrect modifies gene names, which are frequently “corrected” to dates.
  • Google may have created the first time crystals in a quantum computer. Time crystals are a theoretical construct that has a structure that constantly changes but repeats over time, without requiring additional energy.
Categories: Technology

Rebranding Data

O'Reilly Radar - Tue, 2021/08/24 - 07:16

There’s a flavor of puzzle in which you try to determine the next number or shape in a sequence. We’re living that now, but for naming the data field.  “Predictive analytics.” “Big Data.” “Data science.” “Machine learning.” “AI.” What’s next?

It’s hard to say.  These terms all claim to be different, but they are very much the same.  They are supersets, subsets, and Venn diagrams with a lot of overlap.  Case in point: machine learning used to be considered part of data science; now it’s seen as a distinct (and superior) field.  What gives?

Since the promise of “analyzing data for fun and profit” has proven so successful, it’s odd that the field would feel the need to rebrand every couple of years.  You’d think that it would build on a single name, to drive home its transformative power.  Unless, maybe, it’s not all it claims to be?

Resetting the hype cycle

In a typical bubble—whether in the stock market, or the Dot-Com era—you see a large upswing and then a crash.  The upswing is businesses over-investing time, money, and effort in The New Thing. The crash happens when those same groups realize that The New Thing won’t ultimately help them, and they suddenly stop throwing money at it.

In finance terms, we’d say that the upswing represents a large and growing delta between the fundamental price (what The New Thing is actually worth) and the observed price (what people are spending on it, which is based on what they think it’s worth).  The ensuing crash represents a correction: a sharp, sudden reduction in that delta, as the observed price falls to something closer to the fundamental price.

Given that, we should have seen the initial Big Data hype bubble expand and then burst once businesses determined that this would only help a very small number of companies.  Big Data never crashed, though. Instead, we saw “data science” take off.  What’s weird is that companies were investing in roughly the same thing as before. It’s as though the rebranding was a way of laundering the data name, so that businesses and consumers could more easily forget that the previous version didn’t hold up to its claims.  This is the old “hair of the dog” hangover cure.

And it actually works.  Until it doesn’t.

Data success is not dead; it’s just unevenly distributed

This isn’t to say that data analysis has no value. The ability to explore massive amounts of data can be tremendously useful.  And lucrative.  Just not for everyone.

Too often, companies look to the FAANGs—Facebook, Amazon, Apple, Netflix, Google: the businesses that have clearly made a mint in data analysis—and figure they can copycat their way to the same success.  Reality’s harsh lesson is that it’s not so simple.  “Collect and analyze data” is just one ingredient of a successful data operation. You also need to connect those activities to your business model, and hand-waving over that part is only a temporary solution. At some point, you need to actually determine whether the fancy new thing can improve your business.  If not, it’s time to let it go.

We saw the same thing in the 1990s Dot-Com bust. The companies that genuinely needed developers and other in-house tech staff continued to need them; those that didn’t, well, they were able to save money by shedding jobs that weren’t providing business value.

Maybe data’s constant re-branding is the lesson learned from the 1990s? That if we keep re-branding, we can ride the misplaced optimism, and we’ll never hit that low point?

Why it matters

If the data world is able to sustain itself by simply changing its name every few years, what’s the big deal? Companies are making money, consumers are happy with claims of AI-driven products, and some people have managed to find very lucrative jobs.  Why worry about this now?

This quote from Cem Karsan, founder of Aegea Capital Management, sums it up well.  He’s talking about flows of money on Wall St. but the analogy applies just as well to the AI hype bubble:

If you’re on an airplane, and you’re 30,000 feet off the ground, that 30,000 feet off the ground is the valuation gap.  That’s where valuations are really high. But if those engines are firing, are you worried up in that plane about the valuations?  No!  You’re worried about the speed and trajectory of where you’re going, based on the engines.  […]  But, when all of the sudden, those engines go off, how far off the ground you are is all that matters.


—Cem Karsan, from Corey Hoffstein’s Flirting with Models podcast, S4E1 (2021/05/03), starting 37:30

Right now most of AI’s 30,000-foot altitude is hype. When the hype fades—when changing the name fails to keep the field aloft—that hype dissipates.  At that point you’ll have to sell based on what AI can really do, instead of a rosy, blurry picture of what might be possible.

This is when you might remind me of the old saying: “Make hay while the sun shines.”  I would agree, to a point.  So long as you’re able to cash out on the AI hype, even if that means renaming the field a few more times, go ahead.  But that’s a short-term plan.  Long-term survival in this game means knowing when that sun will set and planning accordingly.  How many more name-changes do we get?  How long before regulation and consumer privacy frustrations start to chip away at the façade?  How much longer will companies be able to paper over their AI-based systems’ mishaps?

Where to next?

If you’re building AI that’s all hype, then these questions may trouble you.  Post-bubble AI (or whatever we call it then) will be judged on meaningful characteristics and harsh realities: “Does this actually work?” and “Do the practitioners of this field create products and analyses that are genuinely useful?”  (For the investors in the crowd, this is akin to judging a company’s stock price on market fundamentals.)  Surviving long-term in this field will require that you find and build on realistic, worthwhile applications of AI.

Does our field need some time to sort that out?  I figure we have at least one more name change before we lose altitude.  We’ll need to use that time wisely, to become smarter about how we use and build around data.  We have to be ready to produce real value after the hype fades.

That’s easier said than done, but it’s far from impossible. We can start by shifting our focus to the basics, like reviewing our data and seeing whether it’s any good.  Accepting the uncomfortable truth that BI’s sums and groupings will help more businesses than AI’s neural networks. Evaluating the true total cost of AI, such that each six-figure data scientist salary is a proper business investment and not a very expensive lottery ticket.

We’ll also have to get better about folding AI into products (and understanding the risks in doing so), which will require building interdisciplinary, cognitively-diverse teams where everyone gets a chance to weigh in. Overall, then, we’ll have to educate ourselves and our customers on what data analysis can really achieve, and then plan our efforts accordingly.

We can do it. We’ll pretty much have to do it.  The question is: will we start before the plane loses altitude?

Categories: Technology

A Way Forward with Communal Computing

O'Reilly Radar - Tue, 2021/08/17 - 05:45

Communal devices in our homes and offices aren’t quite right. In previous articles, we discussed the history of communal computing and the origin of the single user model. Then we reviewed the problems that arise due to identity, privacy, security, experience, and ownership issues. They aren’t solvable by just making a quick fix. They require a huge reorientation in how these devices are framed and designed.

This article focuses on modeling the communal device you want to build and understanding how it fits into the larger context. This includes how it interoperates with services that are connected, and how it communicates across boundaries with other devices in peoples’ homes. Ignore these warnings at your own peril. They can always unplug the device and recycle it.

Let’s first talk about how we gain an understanding of the environment inside homes and offices.

Mapping the communal space

We have seen a long list of problems that keep communal computing from aligning with people’s needs. This misalignment arises from the assumption that there is a single relationship between a person and a device, rather than between all the people involved and their devices.

Dr. S.A. Applin has referred to this assumption as “design individualism”; it is a common misframing used by technology organizations. She uses this term most recently in the paper “Facebook’s Project Aria indicates problems for responsible innovation when broadly deploying AR and other pervasive technology in the Commons:”

“Unfortunately, this is not an uncommon assumption in technology companies, but is a flaw in conceptual modelling that can cause great problems when products based on this ‘design individualism’ are deployed into the Commons (Applin, 2016b). In short, Facebook acknowledges the plural of ‘people’, but sees them as individuals collectively, not as a collective that is enmeshed, intertwined and exists based on multiple, multiplex, social, technological, and socio-technological relationships as described through [PolySocial Reality].”

PolySocial Reality (PoSR) is a theory described in a series of papers by Applin and Fisher (2010-ongoing) on the following:

“[PoSR] models the outcomes when all entities in networks send both synchronous and asynchronous messages to maintain social relationships. These messages can be human-to-human, human-to-machine, and machine-to-machine. PoSR contains the entirety of all messages at all times between all entities, and we can use this idea to understand how various factors in the outcomes from the way that messages are sent and received, can impact our ability to communicate, collaborate, and most importantly, cooperate with each other.”

In the case of PoSR, we need to consider how agents make decisions about the messages between entities. The designers of these non-human entities will make decisions that impact all entities in a system.

The reality is that the “self” only exists as part of a larger network. It is the connections between us and the rest of the network that is meaningful. We pull all of the pseudo identities for those various connections together to create our “one” self.

The model that I’ve found most helpful to address this problem attempts to describe the complete environment of the communal space. It culminates in a map of the connections between nodes, or relationships between entities. This web of interactions includes all the individuals, the devices they use, and the services that intermediate them. The key is to understand how non-human entities intermediate the humans, and how those messages eventually make it to human actors.



The home is a network, like an ecosystem, of people, devices, and services all interacting to create an experience. It is connected with services, people, and devices outside the home as well as my mom, my mom’s picture frame, and Google’s services that enable it.

To see why this map is helpful, consider an ecosystem (or food web). When we only consider interactions between individual animals, like a wolf eating a sheep, we ignore how the changes in population of each animal impacts other actors in the web: too many wolves mean the sheep population dies off. In turn, this change has an impact on other elements of the ecosystem like how much the grass grows. Likewise, when we only consider a single person interacting with one device, we find that most interactions are simple: some input from the user is followed by a response from the device. We often don’t consider other people interacting with the device, nor do we consider how other personal devices exist within that space. We start to see these interactions when we consider other people in the communal space, the new communal device, and all other personal devices. In a communal map, these all interact.

These ecosystems already exist within a home or office. They are made up of items ranging from refrigerator magnets for displaying physical pictures to a connected TV, and they include personal smartphones. The ecosystem extends to the services that the devices connect to outside the home, and to the other people whom they intermediate. We get an incomplete picture if we don’t consider the entire graph. Adding a new device isn’t about filling a specific gap in the ecosystem. The ecosystem may have many problems or challenges, but the ecosystem isn’t actively seeking to solve them. The new device needs to adapt and find its own niche. This includes making the ecosystem more beneficial to the device, something that evolutionary biologists call ‘niche expansion’. Technologists would think about this as building a need for their services.

Thinking about how a device creates a space within an already complex ecosystem is key to understanding what kinds of experiences the team building the device should create. It will help us do things like building for everyone and evolving with the space. It will also help us to avoid the things we should not do, like assuming that every device has to do everything.

Do’s and don’ts of building communal devices

With so much to consider when building communal devices, where do you start? Here are a few do’s and don’ts:

Do user research in the users’ own environment

Studying and understanding expectations and social norms is the key discovery task for building communal devices. Expectations and norms dictate the rules of the environment into which your device needs to fit, including people’s pseudo-identities, their expectations around privacy, and how willing they are to deal with the friction of added security. Just doing a survey isn’t enough.  Find people who are willing to let you see how they use these devices in their homes, and ask lots of questions about how they feel about the devices.

“If you are going to deal with social, people, communal, community, and general sociability, I would suggest hiring applied anthropologists and/or other social scientists on product teams. These experts will save you time and money, by providing you with more context and understanding of what you are making and its impact on others. This translates into more accurate and useful results.”

– Dr. S.A. Applin

Observing where the devices are placed and how the location’s use changes over time will give you fascinating insights about the context in which the device is used. A living room may be a children’s play area in the morning, a home office in the middle of the day, and a guest bedroom at night. People in these contexts have different sets of norms and privacy expectations.

As part of the user research, you should be building an ecosystem graph of all people present and the devices that they use. What people not present are intermediated by technology? Are there stories where this intermediation went wrong? Are there frictions that are created between people that your device should address? Are there frictions that the device should get out of the way of?

Do build for everyone who might have access

Don’t focus on the identity of the person who buys and sets up the device. You need to consider the identity (or lack) of everyone who could have access. Consider whether they feel that information collected about them violates their desire to control the information (as in Contextual Integrity). This could mean you need to put up walls to prevent users from doing something sensitive without authorization. Using the Zero Trust framework’s “trust engine” concept, you should ask for the appropriate level of authentication before proceeding.

Most of today’s user experience design is focused on making frictionless or seamless experiences. This goal doesn’t make sense when considering a risk tradeoff. In some cases, adding friction increases the chance that a user won’t move forward with a risky action, which could be a good thing. If the potential risk of showing a private picture is high, you should make it harder to show that picture.

Realize you may not always understand the right context. Having good and safe default states for those cases is important. It is your job to adjust or simplify the model so that people can understand and interpret why the device does something.

Do consider pseudo-identities for individuals and groups

Avoid singular identities and focus on group pseudo-identities. If users don’t consider these devices their own, why not have the setup experience mirror those expectations? Build device setup, usage, and management around everyone who should have a say in the device’s operation.

Pseudo-identities become very interesting when you start to learn what certain behaviors mean for subgroups. Is this music being played for an individual with particular tastes? Or does the choice reflect a compromise between multiple people in the room? Should it avoid explicit language since there are children present?

Group norms and relationships need to be made more understandable. It will take technology advances to make these norms more visible. These advances include using machine learning to help the device understand what kind of content it is showing, and who (or what) is depicted in that content. Text, image, and video analysis needs to take place to answer the question: what type of content is this and who is currently in that context? It also means using contextual prediction to consider who may be in the room, their relationship to the people in the content, and how they may feel about the content. When in doubt, restrict what you do.

Do evolve with the space

As time goes on, life events will change the environment in which the device operates. Try to detect those changes and adapt accordingly. New pseudo-identities could be present, or the identity representing the group may shift. It is like moving into a new home. You may set things up in one way only to find months later there is a better configuration. Be aware of these changes and adapt.

If behavior that would be considered anomalous becomes the norm, something may have changed about the use of that space. Changes in use are usually led by a change in life–for example, someone moving in or out could trigger a change in how a device is used. Unplugging the device and moving it to a different part of the room or to a different shelf symbolizes a new need for contextual understanding. If you detect a change in the environment but don’t know why the change was made, ask.

Do use behavioral data carefully, or don’t use it at all

All communal devices end up collecting data. For example, Spotify uses what you are listening to when building recommendation systems. When dealing with behavioral information, the group’s identity is important, not the individual’s. If you don’t know who is in front of the device, you should consider whether you can use that behavioral data at all. Rather than using an individual identity, you may want to default to the group pseudo-identity’s recommendations. What does the whole house usually like to listen to?



When the whole family is watching, how do we find a common ground based on all of our preferences, rather than the owner’s? Spotify has a Premium Family package where each person gets a recommended playlist based on everyone’s listening behavior called a Family Mix, whereas Netflix requires users to choose between individual profiles.

Spotify has family and couple accounts that allow multiple people to have an account under one bill. Each person gets their own login and recommendations. Spotify gives all sub-accounts on the subscription access to a shared playlist (like a Family Mix) that makes recommendations based on the group’s preferences.

Spotify, and services like it, should go a step further to reduce the weight of a song in their recommendations algorithm when it is being played on a shared device in a communal place–a kitchen, for example. It’s impossible to know everyone who is in a communal space. There’s a strong chance that a song played in a kitchen may not be preferred by anyone that lives there. To give that particular song a lot of weight will start to change recommendations on the group members’ personal devices.

If you can’t use behavioral data appropriately, don’t bring it into a user’s profile on your services. You should probably not collect it at all until you can handle the many people who could be using the device. Edge processing can allow a device to build context that respects the many people and their pseudo-identities that are at play in a communal environment. Sometimes it is just safer to not track.

Don’t assume that automation will work in all contexts

Prediction technology helps communal devices by finding behavior patterns. These patterns allow the device to calculate what content should be displayed and the potential trust. If a student always listens to music after school while doing homework, the device can assume that contextual integrity holds if the student is the only person there. These assumptions get problematic when part of the context is no longer understood, like when the student has other classmates over. That’s when violations of norms or of privacy expectations are likely to occur. If other people are around, different content is being requested, or if it is a different time of day, the device may not know enough to predict the correct information to display.

Amazon’s Alexa has started wading into these waters with their Hunches feature. If you say “good night” to Alexa, it can decide to turn off the lights. What happens if someone is quietly reading in the living room when the lights go out?  We’ve all accidentally turned the lights out on a friend or partner, but such mistakes quickly become more serious when they’re made by algorithm.

When the prediction algorithm’s confidence is low, it should disengage and try to learn the new behavior. Worst case, just ask the user what is appropriate and gauge the trust vs risk tradeoff accordingly. The more unexpected the context, the less likely it is that the system should presume anything. It should progressively restrict features until it is at its core: for home assistants, that may just mean displaying the current time.

Don’t include all service functionality on the device

All product teams consider what they should add next to make a device “fully functional” and reflect all of the service possibilities. For a communal device, you can’t just think about what you could put there; you also have to consider what you will never put there. An example could be allowing access to Gmail messages from a Google Home Hub. If it doesn’t make sense for most people to have access to some feature, it shouldn’t be there in the first place. It just creates clutter and makes the device harder to use. It is entirely appropriate to allow people to change personal preferences and deal with highly personal information on their own, private devices. There is a time and place for the appropriate content.

Amazon has considered whether Echo users should be allowed to complete a purchase, or limit them to just adding items to a shopping list. They have had to add four digit codes and voice profiles. The resulting interface is complex enough to warrant a top level help article on why people can’t make the purchases.

If you have already built too much, think about how to sunset certain features so that the value and differentiator of your device is clearer. Full access to personal data doesn’t work in the communal experience. It is a chance for some unknown privacy violation to occur.

Don’t assume your devices will be the only ones

Never assume that your company’s devices will be the only ones in the space. Even for large companies like Amazon, there is no future in which the refrigerator, oven, and TV will all be Amazon devices (even if they are trying really hard). The communal space is built up over a long time, and devices like refrigerators have lifetimes that can span decades.

Think about how your device might work alongside other devices, including personal devices. To do this, you need to integrate with network services (e.g. Google Calendar) or local device services (e.g. Amazon Ring video feed). This is the case for services within a communal space as well. People have different preferences for the services they use to communicate and entertain themselves. For example, Snapchat’s adoption by 13-24 year olds (~90% in the US market) accounts for 70% of its usage. This means that people over 24 years old are using very different services to interact with their family and peers.

Apple’s iOS has started to realize that apps need to ask for permission before collecting information from other devices on a local network. It verifies that an app is allowed to access other devices on the network. Local network access is not a foregone conclusion either: different routers and wifi access points are increasingly managed by network providers.

Communal device manufacturers must build for interoperability between devices whether they like it or not, taking into account industry standards for communicating state, messaging, and more. A device that isn’t networked with the other devices in the home is much more likely to be replaced when the single, non-networked use is no longer valid or current.

Don’t change the terms without an ‘out’ for owners

Bricking a device because someone doesn’t want to pay for a subscription or doesn’t like the new data use policy is bad. Not only will it create distrust in users but it violates the idea that they are purchasing something for their home.

When you need to change terms, allow owners to make a decision about whether they want new functionality or to stop getting updates. Not having an active subscription is no excuse for a device to fail, since devices should be able to work when a home’s WiFi is down or when AWS has a problem that stops a home’s light bulbs from working. Baseline functionality should always be available, even if leading edge features (for example, features using machine learning) require a subscription. “Smart” or not, there should be no such thing as a light bulb that can’t be turned on.

When a company can no longer support a device–either because they’re sunsetting it or, in the worst case, because they are going out of business–they should consider how to allow people to keep using their devices. In some cases, a motivated community can take on the support; this happened with the Jibo community when the device creator shut down.

Don’t require personal mobile apps to use the device

One bad limitation that I’ve seen is requiring an app to be installed on the purchaser’s phone, and requiring the purchaser to be logged in to use the device. Identity and security aren’t always necessary, and being too strict about identity tethers the device to a particular person’s phone.

The Philips Hue smart light bulbs are a way to turn any light fixture into a component in a smart lighting system. However, you need one of their branded apps to control the lightbulbs. If you integrate your lighting system with your Amazon or Google accounts, you still need to know what the bulb or “zone” of your house is called. As a host you end up having to take the action for someone else (say by yelling at your Echo for them) or put a piece of paper in the room with all of the instructions. We are back in the age of overly complicated instructions to turn on a TV and AV system.

In addition to making sure you can integrate with other touch and voice interfaces, you need to consider physical ways to allow anyone to interact. IoT power devices like the VeSync Smart Plug by Etekcity (I have a bunch around the house) have a physical button to allow manual switching, in addition to integrating with your smart home or using their branded apps. If you can’t operate the device manually if you are standing in front of it, is it really being built for everyone in the home?

How do you know you got this right?

Once you have implemented all of the recommendations, how do you know you are on the right track?

A simple way to figure out whether you are building a communal-friendly device is to look for people adding their profiles to the device. This means linking their accounts to other services like Spotify (if you allow that kind of linking). However, not everyone will want to or be able to add their accounts, especially people who are passing through (guests) or who cannot legally consent (children).

Using behavior to detect whether someone else is using the device can be difficult. While people don’t change their taste in music or other interests quickly, they slowly drift through the space of possible options. We seek things that are similar to what we like, but just different enough to be novel. In fact, we see that most of our music tastes are set in our teenage years. Therefore, if a communal device is asked to play songs in a different language or genre whereas a personal device does not, it’s more likely that someone new is listening than that the owner has suddenly learned a new language. Compare what users are doing on your device to their behavior on other platforms (for example, compare a Google Home Hub in the kitchen to a personal iPhone) to determine whether new users are accessing the platform.

Behavioral patterns can also be used to predict demographic information. For example, you may be able to predict that someone is a parent based on their usage patterns. If this confidence is high, and you only see their interests showing up in the behavioral data, that means that other people who are around the device are not using it.

Don’t forget that you can ask the users themselves about who is likely to use the device. This is information that you can collect during initial setup. This can help ensure you are not making incorrect assumptions about the placement and use of the device.

Finally, consider talking with customers about how they use the device, the issues that come up, and how it fits into their lives. Qualitative user research doesn’t end after the initial design phase. You need to be aware of how the device has changed the environment it fits into. Without social scientists you can’t know this.

Is everything a communal experience?

Up until this point we have been talking about devices that are part of the infrastructure of a home, like a smart screen or light switch. Once we realize that technology serves as an intermediary between people, everything is communal.

Inside of a home, roommates generally have to share expenses like utilities with each other. Companies like Klarna and Braid make finances communal. How you pay together is an important aspect to harmony within a home.

You are also part of communities in your neighborhoods. Amazon Sidewalk extends your devices into the neighborhood you live in. This mesh technology starts to map and extend further with each communal space. Where does your home’s communal space end? If you misplaced your keys a block away, a Tile could help you find them. It could also identify people in your neighborhood without considering your neighbors’ privacy expectations.

Communities aren’t just based on proximity. We can extend the household to connect with other households far away. Amazon’s Drop In has started their own calling network between households. Loop, a new startup, is focused on building a device for connecting families in their own social network.

Google/Alphabet’s Sidewalk Labs has taken on projects that aim to make the connected world part of the cityscape. An early project called LinkNYC (owned through a shell corporation) was digital signage that included free calling and USB hubs. This changed how homeless people used the built environment. When walking down the street you could see people’s smartphones dangling from a LinkNYC while they were panhandling nearby. Later, a district-wide project called Sidewalk Toronto was rejected by voters. Every object within the urban environment becomes something that not only collects data but that could be interactive.


The town square and public park has been built to be welcoming to people and set expectations of what they do there, unlike online social media. New Public is taking cues from this type of physical shared space for reimagining the online public square.

Taking cues from the real world, groups like New Public are asking what would happen if we built social media the same way we build public spaces. What if social media followed the norms that we have in social spaces like the public parks or squares?

A key aspect to communal computing is the natural limitations of physical and temporal use. Only so many people can fit inside a kitchen or a meeting room. Only so many people can use a device at once, even if it is a subway ticket machine that services millions of people per month. Only so many can fit onto a sidewalk. We need to consider the way that space and time play a part in these experiences.

Adapt or be unplugged

Rethinking how people use devices together inside our homes, offices, and other spaces is key to the future of ubiquitous computing. We have a long way to go in understanding how context changes the expectations and norms of the people in those spaces. Without updating how we design and build these devices, the device you build will just be one more addition to the landfill.

To understand how devices are used in these spaces, we need to expand our thinking beyond the single owner and design for communal use from the start. If we don’t, the devices will never fit properly into our shared and intimate spaces. The mismatch between expectations and what is delivered will grow greater and lead to more dire problems.

This is a call for change in how we consider devices integrated into our lives. We shouldn’t assume that because humans are adaptive, we can adapt to the technologies built. We should design the technologies to fit into our lives, making sure the devices understand the context in which they’re working.

The future of computing that is contextual, is communal.

Thanks

Thanks to Adam Thomas, Mark McCoy, Hugo Bowne-Anderson, and Danny Nou for their thoughts and edits on the early draft of this. Also, Dr. S.A. Applin for all of the great work on PoSR. Finally, from O’Reilly, Mike Loukides for being a great editor and Susan Thompson for the art.

Categories: Technology

Defending against ransomware is all about the basics

O'Reilly Radar - Tue, 2021/08/10 - 05:18

The concept behind ransomware is simple. An attacker plants malware on your system that encrypts all the files, making your system useless, then offers to sell you the key you need to decrypt the files. Payment is usually in bitcoin (BTC), and the decryption key is deleted if you don’t pay within a certain period. Payments have typically been relatively small—though that’s obviously no longer true, with Colonial Pipeline’s multimillion-dollar payout.

Recently, ransomware attacks have been coupled with extortion: the malware sends valuable data (for example, a database of credit card numbers) back to the attacker, who then threatens to publish the data online if you don’t comply with the request.  

A survey on O’Reilly’s website1 showed that 6% of the respondents worked for organizations that were victims of ransomware attacks. How do you avoid joining them? We’ll have more to say about that, but the tl;dr is simple: pay attention to security basics. Strong passwords, two-factor authentication, defense in depth, staying on top of software updates, good backups, and the ability to restore from backups go a long way. Not only do they protect you from becoming a ransomware victim, but those basics can also help protect you from data theft, cryptojacking, and most other forms of cybercrime. The sad truth is that few organizations practice good security hygiene—and those that don’t end up paying the price.

But what about ransomware? Why is it such an issue, and how is it evolving? Historically, ransomware has been a relatively easy way to make money: set up operations in a country that’s not likely to investigate cybercrime, attack targets that are more likely to pay a ransom, keep the ransom small so it’s easier to pay than to restore from backup, and accept payment via some medium that’s perceived as anonymous. Like most things on the internet, ransomware’s advantage is scale: The WannaCry attack infected around 230,000 systems. If even a small percentage paid the US$300 ransom, that’s a lot of money.

Early on, attacks focused on small and midsize businesses, which often have limited IT staff and no professional security specialists. But more recently, hospitals, governments, and other organizations with valuable data have been attacked. A modern hospital can’t operate without patient data, so restoring systems is literally a matter of life and death. Most recently, we’ve seen attacks against large enterprises, like Colonial Pipeline. And this move toward bigger targets, with more valuable data, has been accompanied by larger ransoms.

Attackers have also gotten more sophisticated and specialized. They’ve set up help desks and customer service agents (much like any other company) to help customers make their payments and decrypt their data. Some criminal organizations offer “ransomware as a service,” running attacks for customers. Others develop the software or create the attacks that find victims. Initiating an attack doesn’t require any technical knowledge; it can all be contracted out, and the customer gets a nice dashboard to show the attack’s progress.

While it’s easy to believe (and probably correct) that government actors have gotten into the game, it’s important to keep in mind that attribution of an attack is very difficult—not least because of the number of actors involved. An “as a service” operator really doesn’t care who its clients are, and its clients may be (willingly) unaware of exactly what they’re buying. Plausible deniability is also a service.

How an attack begins

Ransomware attacks frequently start with phishing. An email to a victim entices them to open an attachment or to visit a website that installs malware. So the first thing you can do to prevent ransomware attacks is to make sure everyone is aware of phishing, very skeptical of any attachments they receive, and appropriately cautious about the websites they visit. Unfortunately, teaching people how to avoid being victimized by a phish is a battle you’re not likely to win. Phishes are getting increasingly sophisticated and now do a good job of impersonating people the victim knows. Spear phishing requires extensive research, and ransomware criminals have typically tried to compromise systems in bulk. But recently, we’ve been seeing attacks against more valuable victims. Larger, more valuable targets, with correspondingly bigger payouts, will merit the investment in research.

It’s also possible for an attack to start when a victim visits a legitimate but compromised website. In some cases, an attack can start without any action by the victim. Some ransomware (for example, WannaCry) can spread directly from computer to computer. One recent attack started through a supply chain compromise: attackers planted the ransomware in an enterprise security product, which was then distributed unwittingly to the product’s customers. Almost any vulnerability can be exploited to plant a ransomware payload on a victim’s device. Keeping browsers up-to-date helps to defend against compromised websites.

Most ransomware attacks begin on Windows systems or on mobile phones. This isn’t to imply that macOS, Linux, and other operating systems are less vulnerable; it’s just that other attack vectors are more common. We can guess at some reasons for this. Mobile phones move between different domains, as the owner goes from a coffee shop to home to the office, and are exposed to different networks with different risk factors. Although they are often used in risky territory, they’re rarely subject to the same device management that’s applied to “company” systems—but they’re often accorded the same level of trust. Therefore, it’s relatively easy for a phone to be compromised outside the office and then bring the attacker onto the corporate network when its owner returns to work.

It’s possible that Windows systems are common attack vectors just because there are so many of them, particularly in business environments. Many also believe that Windows users install updates less often than macOS and Linux users. Microsoft does a good job of patching vulnerabilities before they can be exploited, but that doesn’t do any good if updates aren’t installed. For example, Microsoft discovered and patched the vulnerability that WannaCry exploited well before the attacks began, but many individuals, and many companies, never installed the updates.

Preparations and precautions

The best defense against ransomware is to be prepared, starting with basic security hygiene. Frankly, this is true of any attack: get the basics right and you’ll have much less to worry about. If you’ve defended yourself against ransomware, you’ve done a lot to defend yourself against data theft, cryptojacking, and many other forms of cybercrime.

Security hygiene is simple in concept but hard in practice. It starts with passwords: Users must have nontrivial passwords. And they should never give their password to someone else, whether or not “someone else” is on staff (or claims to be).

Two-factor authentication (2FA), which requires something in addition to a password (for example, biometric authentication or a text message sent to a cell phone) is a must. Don’t just recommend 2FA; require it. Too many organizations buy and install the software but never require their staff to use it. (76% of the respondents to our survey said that their company used 2FA; 14% said they weren’t sure.)

Users should be aware of phishing and be extremely skeptical of email attachments that they weren’t expecting and websites that they didn’t plan to visit. It’s always a good practice to type URLs in yourself, rather than clicking on links in email—even those in messages that appear to be from friends or associates. Users should be aware of phishing and be extremely skeptical of email attachments that they weren’t expecting and websites that they didn’t plan to visit. It’s always a good practice to type URLs in yourself, rather than clicking on links in email—even those in messages that appear to be from friends or associates.

Backups are absolutely essential. But what’s even more important is the ability to restore from a backup. The easiest solution to ransomware is to reformat the disks and restore from backup. Unfortunately, few companies have good backups or the ability to restore from a backup—one security expert guesses that it’s as low as 10%. Here are a few key points:

  • You actually have to do the backups. (Many companies don’t.) Don’t rely solely on cloud storage; backup on physical drives that are disconnected when a backup isn’t in progress. (70% of our survey respondents said that their company performed backups regularly.)
  • You have to test the backups to ensure that you can restore the system. If you have a backup but can’t restore, you’re only pretending that you have a backup. (Only 48% of the respondents said that their company regularly practiced restoring from backups; 36% said they didn’t know.)
  • The backup device needs to be offline, connected only when a backup is in progress. Otherwise, it’s possible for the ransomware attack to encrypt your backup.

Don’t overlook testing your backups. Your business continuity planning should include ransomware scenarios: how do you continue doing business while systems are being restored? Chaos engineering, an approach developed at Netflix, is a good idea. Make a practice of breaking your storage capability, then restoring it from backup. Do this monthly—if possible, schedule it with the product and project management teams. Testing the ability to restore your production systems isn’t just about proving that everything works; it’s about training staff to react calmly in a crisis and resolve the outage efficiently. When something goes bad, you don’t want to be on Stack Overflow asking how to do a restore. You want that knowledge imprinted in everyone’s brains.

Keep operating systems and browsers up-to-date. Too many have become victims because of a vulnerability that was patched in a software update that they didn’t install. (79% of our survey respondents said that their company had processes for updating critical software, including browsers.)

An important principle in any kind of security is “least privilege.” No person or system should be authorized to do anything it doesn’t need to do. For example, no one outside of HR should have access to the employee database. “Of course,” you say—but that includes the CEO. No one outside of sales should have access to the customer database. And so on. Least privilege works for software too. Services need access to other services—but services must authenticate to each other and should only be able to make requests appropriate to their role. Any unexpected request should be rejected and treated as a signal that the software has been compromised. And least privilege works for hardware, whether virtual or physical: finance systems and servers shouldn’t be able to access HR systems, for example. Ideally, they should be on separate networks. You should have a “defense in depth” security strategy that focuses not only on keeping “bad guys” out of your network but also on limiting where they can go once they’re inside. You want to stop an attack that originates on HR systems from finding its way to the finance systems or some other part of the company. Particularly when you’re dealing with ransomware, making it difficult for an attack to propagate from one system to another is all-important.

Attribute-based access control (ABAC) can be seen as an extension of least privilege. ABAC is based on defining policies about exactly who and what should be allowed to access every service: What are the criteria on which trust should be based? And how do these criteria change over time? If a device suddenly moves between networks, does that represent a risk? If a system suddenly makes a request that it has never made before, has it been compromised? At what point should access to services be denied? ABAC, done right, is difficult and requires a lot of human involvement: looking at logs, deciding what kinds of access are appropriate, and keeping policies up-to-date as the situation changes. Working from home is an example of a major change that security people will need to take into account. You might have “trusted” an employee’s laptop, but should you trust it when it’s on the same network as their children? Some of this can be automated, but the bottom line is that you can’t automate security.

Finally: detecting a ransomware attack isn’t difficult. If you think about it, this makes a lot of sense: encrypting all your files requires a lot of CPU and filesystem activity, and that’s a red flag. The way files change is also a giveaway. Most unencrypted files have low entropy: they have a high degree of order. (On the simplest level, you can glance at a text file and tell that it’s text. That’s because it has a certain kind of order. Other kinds of files are also ordered, though the order isn’t as apparent to a human.) Encrypted files have high entropy (i.e., they’re very disordered)—they have to be; otherwise, they’d be easy to decrypt. Computing a file’s entropy is simple and for these purposes doesn’t require looking at the entire file. Many security products for desktop and laptop systems are capable of detecting and stopping a ransomware attack. We don’t do product recommendations, but we do recommend that you research the products that are available. (PC Magazine’s 2021 review of ransomware detection products is a good place to start.)

In the data center or the cloud

Detecting ransomware once it has escaped into a data center, whether in the cloud or on-premises, isn’t a fundamentally different task, but commercial products aren’t there yet. Again, prevention is the best defense, and the best defense is strong on the fundamentals. Ransomware makes its way from a desktop to a data center via compromised credentials and operating systems that are unpatched and unprotected. We can’t say this too often: make sure secrets are protected, make sure identity and access management are configured correctly, make sure you have a backup strategy (and that the backups work), and make sure operating systems are patched—zero-trust is your friend.

Amazon Web Services, Microsoft Azure, and Google Cloud all have services named “Identity and Access Management” (IAM); the fact that they all converged on the same name tells you something about how important it is. These are the services that configure users, roles, and privileges, and they’re the key to protecting your cloud assets. IAM doesn’t have a reputation for being easy. Nevertheless, it’s something you have to get right; misconfigured IAM is at the root of many cloud vulnerabilities. One report claims that well over 50% of the organizations using Google Cloud were running workloads with administrator privileges. While that report singles out Google, we believe that the same is true at other cloud providers. All of these workloads are at risk; administrator privileges should only be used for essential management tasks. Google Cloud, AWS, Azure, and the other providers give you the tools you need to secure your workloads, but they can’t force you to use them correctly.

It’s worth asking your cloud vendor some hard questions. Specifically, what kind of support can your vendor give you if you are a victim of a security breach? What can your vendor do if you lose control of your applications because IAM has been misconfigured? What can your vendor do to restore your data if you succumb to ransomware? Don’t assume that everything in the cloud is “backed up” just because it’s in the cloud. AWS and Azure offer backup services; Google Cloud offers backup services for SQL databases but doesn’t appear to offer anything comprehensive. Whatever your solution, don’t just assume it works. Make sure that your backups can’t be accessed via the normal paths for accessing your services—that’s the cloud version of “leave your physical backup drives disconnected when not in use.” You don’t want an attacker to find your cloud backups and encrypt them too. And finally, test your backups and practice restoring your data.

Any frameworks your IT group has in place for observability will be a big help: Abnormal file activity is always suspicious. Databases that suddenly change in unexpected ways are suspicious. So are services (whether “micro” or “macroscopic”) that suddenly start to fail. If you have built observability into your systems, you’re at least partway there.

How confident are you that you can defend against a ransomware attack? In our survey, 60% of the respondents said that they were confident; another 28% said “maybe,” and 12% said “no.” We’d give our respondents good, but not great, marks on readiness (2FA, software updates, and backups). And we’d caution that confidence is good but overconfidence can be fatal. Make sure that your defenses are in place and that those defenses work.

If you become a victim

What do you do? Many organizations just pay. (Ransomwhe.re tracks total payments to ransomware sites, currently estimated at $92,120,383.83.) The FBI says that you shouldn’t pay, but if you don’t have the ability to restore your systems from backups, you might not have an alternative. Although the FBI was able to recover the ransom paid by Colonial Pipeline, I don’t think there’s any case in which they’ve been able to recover decryption keys.

Whether paying the ransom is a good option depends on how much you trust the cybercriminals responsible for the attack. The common wisdom is that ransomware attackers are trustworthy, that they’ll give you the key you need to decrypt your data and even help you use it correctly. If the word gets out that they can’t be trusted to restore your systems, they’ll find fewer victims willing to pay up. However, at least one security vendor says that 40% of ransomware victims who pay never get their files restored. That’s a very big “however,” and a very big risk—especially as ransomware demands skyrocket. Criminals are, after all, criminals. It’s all the more reason to have good backups.

There’s another reason not to pay that may be more important. Ransomware is a big business, and like any business, it will continue to exist as long as it’s profitable. Paying your attackers might be an easy solution short-term, but you’re just setting up the next victim. We need to protect each other, and the best way to do that is to make ransomware less profitable.

Another problem that victims face is extortion. If the attackers steal your data in addition to encrypting it, they can demand money not to publish your confidential data online—which may leave you with substantial penalties for exposing private data under laws such as GDPR and CCPA. This secondary attack is becoming increasingly common.

Whether or not they pay, ransomware victims frequently face revictimization because they never fix the vulnerability that allowed the ransomware in the first place. So they pay the ransom, and a few months later, they’re attacked again, using the same vulnerability. The attack may come from the same people or it may come from someone else. Like any other business, an attacker wants to maximize its profits, and that might mean selling the information they used to compromise your systems to other ransomware outfits. If you become a victim, take that as a very serious warning. Don’t think that the story is over when you’ve restored your systems.

Here’s the bottom line, whether or not you pay. If you become a victim of ransomware, figure out how the ransomware got in and plug those holes. We began this article by talking about basic security practices. Keep your software up-to-date. Use two-factor authentication. Implement defense in depth wherever possible. Design zero-trust into your applications. And above all, get serious about backups and practice restoring from backup regularly. You don’t want to become a victim again.

Thanks to John Viega, Dean Bushmiller, Ronald Eddings, and Matthew Kirk for their help. Any errors or misunderstandings are, of course, mine.

Footnote
  1. The survey ran July 21, 2021, through July 23, 2021, and received more than 700 responses.
Categories: Technology

Radar trends to watch: August 2021

O'Reilly Radar - Mon, 2021/08/02 - 07:27

Security continues to be in the news: most notably the Kaseya ransomware attack, which was the first case of a supply chain ransomware attack that we’re aware of. That’s new and very dangerous territory. However, the biggest problem in security remains simple: take care of the basics. Good practices for authentication, backups, and software updates are the best defense against ransomware and many other attacks.

Facebook has said that it is now focusing on building the virtual reality Metaverse, which will be the successor to the web. To succeed, VR will have to get beyond ultra geeky goggles. But Google Glass showed the way, and that path is being followed by Apple and Facebook in their product development.

AI and Data
  • There’s a new technique for protecting natural language systems from attack by misinformation and malware bots: using honeypots to capture attackers’ key phrases proactively, and incorporate defenses into the training process.
  • DeepMind’s AlphaFold has made major breakthroughs in protein folding. DeepMind has released the source code for AlphaCode 2.0 on GitHub. DeepMind will also release the structure of every known protein. The database currently includes over 350,000 protein structures, but is expected to grow to over 100,000,000. This is of immense importance to research in biology and medicine.
  • Google searches can now tell you why a given result was included. It’s a minor change, but we’ve long argued that in AI, “why” may give you more information than “what.”
  • Researchers have been able to synthesize speech using the brainwaves of a patient who has been paralyzed and unable to talk. The process combines brain wave detection with models that predict the next word.
  • The National Institute of Standards (NIST) tests systems for identifying airline passengers for flight boarding.  They claim that they have achieved 99.87% accuracy, without significant differences in performance between different demographic groups.
  • An attempt at adding imagination to AI works has been made by combining different attributes of known objects. Humans are good at this: we can imagine a green dog, for example.
  • Phase precession is a recently discovered phenomenon by which neurons encode information in the timing of their firing.  It may relate to humans’ ability to learn on the basis of a small number of examples.
  • Yoshua Bengio, Geoff Hinton, and Yann LeCun give an assessment of the state of Deep Learning, its future, and its ability to solve problems.
  • AI is learning to predict human behavior from videos (e.g., movies). This research attempts to answer the question “What will someone do next?” in situations where there are large uncertainties. One trick is reverting to high-level concepts (e.g., “greet”) when the system can’t predict more specific behaviors (e.g., “shake hands”).
Programming
  • JAX is a new Python library for high-performance mathematics. It includes a just-in-time compiler, support for GPUs and TPUs, automatic differentiation, and automatic vectorization and parallelization.
  • Matrix is an open standard for a decentralized “conversation store” that is used as the background for many other kinds of applications. Germany has announced that it will use Matrix as the standard for digital messaging in its national electronic health records system.
  • Brython is Python 3.9.5 running in the browser, with access to the DOM.  It’s not a replacement for JavaScript, but there are a lot of clever things you can do with it.
  • Using a terminal well has always been a superpower. Warp is a new terminal emulator built in Rust with features that you’d never expect: command sharing, long-term cloud-based history, a true text editor, and a lot more.
  • Is it WebAssembly’s time? Probably not yet, but it’s coming. Krustlets allow you to run WebAssembly workloads under Kubernetes. There is also an alternative to a filesystem written in wasm; JupyterLite is an attempt to build a complete distribution of Jupyter, including JupyterLab, that runs entirely in the browser.
Robotics
  • Google launches Intrinsic, a moonshot project to develop industrial robots.
  • 21st Century Problems: should autonomous delivery robots be allowed in bike lanes? The Austin (Texas) City Council already has to consider this issue.
Materials
  • Veins in materials? Researchers have greatly reduced the time it takes to build vascular systems into materials, which could have an important impact on our ability to build self-healing structures.
  • Researchers have designed fabrics that can cool the body by up to 5 degrees Celsius by absorbing heat and re-emitting it in the near-infrared range.
Hardware
  • A bendable processor from ARM could be the future of wearable computing. It’s far from a state-of-the-art CPU, and probably will never be one, but with further development could be useful in edge applications that require flexibility.
  • Google experiments with error correction for quantum computers.  Developing error correction is a necessary step towards making quantum computers “real.”
Security
  • Attackers have learned to scan repos like GitHub to find private keys and other credentials that have inadvertently been left in code that has been checked in. Checkov, a code analysis tool for detecting vulnerabilities in cloud infrastructure, can now can find these credentials in code.
  • Amnesty International has released an open source tool for checking whether a phone has been compromised by Pegasus, the spyware sold by the NSO group to many governments, and used (among other things) to track journalists. Matthew Green’s perspective on “security nihilism” discusses the NSO’s activity; it is a must-read.
  • The REvil ransomware gang (among other things, responsible for the Kaseya attack, which infected over 1,000 businesses) has disappeared; all of its web sites went down at the same time. Nobody knows why; possibilities include pressure from law enforcement, reorganization, and even retirement.
  • DID is a new proposed form of decentralized digital identity that is currently being tested in the travel passports with COVID data being developed by the International Air Transport Association.
  • A massive ransomware attack by the REvil cybercrime group exploited supply chain vulnerabilities. The payload was implanted in a security product by Kaseya that is used to automate software installation and updates. The attack apparently only affects on-premises infrastructure. Victims are worldwide; the number of victims is in the “low thousands.”
  • Kubernetes is being used by the FancyBear cybercrime group, and other groups associated with the Russian government, to orchestrate a worldwide wave of brute-force attacks aimed at data theft and credential stealing.
Operations
  • Observability is the next step beyond monitoring.  That applies to data and machine learning, too, and is part of incorporating ML into production processes.
  • A new load balancing algorithm does a much better job of managing load at datacenters, and reduces power consumption by allowing servers to be shut down when not in use.
  • MicroK8S is a version of Kubernetes designed for small clusters that claims to be fault tolerant and self-healing, requiring little administration.
  • Calico is a Kubernetes plugin that simplifies network configuration. 
Web and Mobile
  • Scuttlebutt is a protocol for the decentralized web that’s “a way out of the social media rat race.”  It’s (by definition) “sometimes on,” not a constant presence.
  • Storywrangler is a tool for analyzing Twitter at scale.  It picks out the most popular word combinations in a large number of languages.
  • Google is adding support for “COVID vaccination passports” to Android devices.
  • Tim Berners-Lee’s Solid protocol appears to be getting real, with a small ecosystem of pod providers (online data stores) and apps.
  • Why are Apple and Google interested in autonomous vehicles? What’s the business model? They are after the last few minutes of attention. If you aren’t driving, you’ll be in an app.
Virtual Reality
  • Mark Zuckerberg has been talking up the Metaverse as the next stage in the Internet’s evolution: a replacement for the Web as an AR/VR world. But who will want to live in Facebook’s world?
  • Facebook is committing to the OpenXR standard for its Virtual Reality products. In August 2022, all new applications will be required to use OpenXR; its proprietary APIs will be deprecated.
Miscellaneous
  • The Open Voice Network is an industry association organized by the Linux Foundation that is dedicated to ethics in voice-driven applications. Their goal is to close the “trust gap” in voice applications.
Categories: Technology

Communal Computing’s Many Problems

O'Reilly Radar - Tue, 2021/07/20 - 04:37

In the first article of this series, we discussed communal computing devices and the problems they create–or, more precisely, the problems that arise because we don’t really understand what “communal” means. Communal devices are intended to be used by groups of people in homes and offices. Examples include popular home assistants and smart displays like the Amazon Echo, Google Home, Apple HomePod, and many others.  If we don’t create these devices with communities of people in mind, we will continue to build the wrong ones.

Ever since the concept of a “user” was invented (which was probably later than you think), we’ve assumed that devices are “owned” by a single user. Someone buys the device and sets up the account; it’s their device, their account.  When we’re building shared devices with a user model, that model quickly runs into limitations. What happens when you want your home assistant to play music for a dinner party, but your preferences have been skewed by your children’s listening habits? We, as users, have certain expectations for what a device should do. But we, as technologists, have typically ignored our own expectations when designing and building those devices.

This expectation isn’t a new one either. The telephone in the kitchen was for everyone’s use. After the release of the iPad in 2010 Craig Hockenberry discussed the great value of communal computing but also the concerns:

“When you pass it around, you’re giving everyone who touches it the opportunity to mess with your private life, whether intentionally or not. That makes me uneasy.”

Communal computing requires a new mindset that takes into account users’ expectations. If the devices aren’t designed with those expectations in mind, they’re destined for the landfill. Users will eventually experience “weirdness” and “annoyance” that grows to distrust of the device itself. As technologists, we often call these weirdnesses “edge cases.” That’s precisely where we’re wrong: they’re not edge cases, but they’re at the core of how people want to use these devices.

In the first article, we listed five core questions we should ask about communal devices:

  1. Identity: Do we know all of the people who are using the device?
  2. Privacy: Are we exposing (or hiding) the right content for all of the people with access?
  3. Security: Are we allowing all of the people using the device to do or see what they should and are we protecting the content from people that shouldn’t?
  4. Experience: What is the contextually appropriate display or next action?
  5. Ownership: Who owns all of the data and services attached to the device that multiple people are using?

In this article, we’ll take a deeper look at these questions, to see how the problems manifest and how to understand them.

Identity

All of the problems we’ve listed start with the idea that there is one registered and known person who should use the device. That model doesn’t fit reality: the identity of a communal device isn’t a single person, but everyone who can interact with it. This could be anyone able to tap the screen, make a voice command, use a remote, or simply be sensed by it. To understand this communal model and the problems it poses, start with the person who buys and sets up the device. It is associated with that individual’s account, like a personal Amazon account with its order history and shopping list. Then it gets difficult. Who doesn’t, can’t, or shouldn’t have full access to an Amazon account? Do you want everyone who comes into your house to be able to add something to your shopping list?

If you think about the spectrum of people who could be in your house, they range from people whom you trust, to people who you don’t really trust but who should be there, to those who you  shouldn’t trust at all.


There is a spectrum of trust for people who have access to communal devices

In addition to individuals, we need to consider the groups that each person could be part of. These group memberships are called “pseudo-identities”; they are facets of a person’s full identity. They are usually defined by how the person associated themself with a group of other people. My life at work, home, a high school friends group, and as a sports fan show different parts of my identity. When I’m with other people who share the same pseudo-identity, we can share information. When there are people from one group in front of a device I may avoid showing content that is associated with another group (or another personal pseudo-identity). This can sound abstract, but it isn’t; if you’re with friends in a sports bar, you probably want notifications about the teams you follow. You probably don’t want news about work, unless it’s an emergency.

There are important reasons why we show a particular facet of our identity in a particular context. When designing an experience, you need to consider the identity context and where the experience will take place. Most recently this has come up with work from home. Many people talk about ‘bringing your whole self to work,’ but don’t realize that “your whole self” isn’t always appropriate. Remote work changes when and where I should interact with work. For a smart screen in my kitchen, it is appropriate to have content that is related to my home and family. Is it appropriate to have all of my work notifications and meetings there? Could it be a problem for children to have the ability to join my work calls? What does my IT group require as far as security of work devices versus personal home devices?

With these devices we may need to switch to a different pseudo-identity to get something done. I may need to be reminded of a work meeting. When I get a notification from a close friend, I need to decide whether it is appropriate to respond based on the other people around me.

The pandemic has broken down the barriers between home and work. The natural context switch from being at work and worrying about work things and then going home to worry about home things is no longer the case. People need to make a conscious effort to “turn off work” and to change the context. Just because it is the middle of the workday doesn’t always mean I want to be bothered by work. I may want to change contexts to take a break. Such context shifts add nuance to the way the current pseudo-identity should be considered, and to the overarching context you need to detect.

Next, we need to consider identities as groups that I belong to. I’m part of my family, and my family would potentially want to talk with other families. I live in a house that is on my street alongside other neighbors. I’m part of an organization that I identify as my work. These are all pseudo-identities we should consider, based on where the device is placed and in relation to other equally important identities.

The crux of the problem with communal devices is the multiple identities that are or may be using the device. This requires greater understanding of who, where, and why people are using the device. We need to consider the types of groups that are part of the home and office.

Privacy

As we consider the identities of all people with access to the device, and the identity of the place the device is to be part of, we start to consider what privacy expectations people may have given the context in which the device is used.

Privacy is hard to understand. The framework I’ve found most helpful is Contextual Integrity which was introduced by Helen Nissenbaum in the book Privacy in Context. Contextual Integrity describes four key aspects of privacy:

  1. Privacy is provided by appropriate flows of information.
  2. Appropriate information flows are those that conform to contextual information norms.
  3. Contextual informational norms refer to five independent parameters: data subject, sender, recipient, information type, and transmission principle.
  4. Conceptions of privacy are based on ethical concerns that evolve over time.

What is most important about Contextual Integrity is that privacy is not about hiding information away from the public but giving people a way to control the flow of their own information. The context in which information is shared determines what is appropriate.

This flow either feels appropriate, or not, based on key characteristics of the information (from Wikipedia):

  1. The data subject: Who or what is this about?
  2. The sender of the data: Who is sending it?
  3. The recipient of the data: Who will eventually see or get the data?
  4. The information type: What type of information is this (e.g. a photo, text)?
  5. The transmission principle: In what set of norms is this being shared (e.g. school, medical, personal communication)?

We rarely acknowledge how a subtle change in one of these parameters could be a violation of privacy. It may be completely acceptable for my friend to have a weird photo of me, but once it gets posted on a company intranet site it violates how I want information (a photo) to flow. The recipient of the data has changed to something I no longer find acceptable. But I might not care whether a complete stranger (like a burglar) sees the photo, as long as it never gets back to someone I know.

For communal use cases, the sender or receiver of information is often a group. There may be  multiple people in the room during a video call, not just the person you are calling. People can walk in and out. I might be happy with some people in my home seeing a particular photo, but find it embarrassing if it is shown to guests at a dinner party.

We must also consider what happens when other people’s content is shown to those who shouldn’t see it. This content could be photos or notifications from people outside the communal space that could be seen by anyone in front of the device. Smartphones can hide message contents when you aren’t near your phone for this exact reason.

The services themselves can expand the ‘receivers’ of information in ways that create uncomfortable situations. In Privacy in Context, Nissenbaum talks about the privacy implications of Google Street View when it places photos of people’s houses on Google Maps. When a house was only visible to people who walked down the street that was one thing, but when anyone in the world can access a picture of a house, that changes the parameters in a way that causes concern. Most recently, IBM used Flickr photos that were shared under a Creative Commons license to train facial recognition algorithms. While this didn’t require any change to terms of the service it was a surprise to people and may be in violation of the Creative Commons license. In the end, IBM took the dataset down.

Privacy considerations for communal devices should focus on who is gaining access to information and whether it is appropriate based on people’s expectations. Without using a framework like contextual inquiry we will be stuck talking about generalized rules for data sharing, and there will always be edge cases that violate someone’s privacy.

A note about children

Children make identity and privacy especially tricky. About 40% of all households have a child. Children shouldn’t be an afterthought. If you aren’t compliant with local laws you can get in a lot of trouble. In 2019, YouTube had to settle with the FTC for a $170 million fine for selling ads targeting children. It gets complicated because the ‘age of consent’ depends on the region as well: COPPA in the US is for people under 13 years old, CCPA in California is for people under 16, and GDPR overall is under 16 years old but each member state can set its own. The moment you acknowledge children are using your platforms, you need to accommodate them.

For communal devices, there are many use cases for children. Once they realize they can play whatever music they want (including tracks of fart sounds) on a shared device they will do it. Children focus on the exploration over the task and will end up discovering way more about the device than parents might. Adjusting your practices after building a device is a recipe for failure. You will find that the paradigms you choose for other parties won’t align with the expectations for children, and modifying your software to accommodate children is difficult or impossible. It’s important to account for children from the beginning.

Security

To get to a home assistant, you usually need to pass through a home’s outer door. There is usually a physical limitation by way of a lock. There may be alarm systems. Finally, there are social norms: you don’t just walk into someone else’s house without knocking or being invited.

Once you are past all of these locks, alarms, and norms, anyone can access the communal device. Few things within a home are restricted–possibly a safe with important documents. When a communal device requires authentication, it is usually subverted in some way for convenience: for example, a password might be taped to it, or a password may never have been set.

The concept of Zero Trust Networks speaks to this problem. It comes down to a key question: is the risk associated with an action greater than the trust we have that the person performing the action is who they say they are?


Source: https://learning.oreilly.com/library/view/zero-trust-networks/9781491962183/

Passwords, passcodes, or mobile device authentication become nuisances; these supposed secrets are frequently shared between everyone who has access to the device. Passwords might be written down for people who can’t remember them, making them visible to less trusted people visiting your household. Have we not learned anything since the movie War Games?

When we consider the risk associated with an action, we need to understand its privacy implications. Would the action expose someone’s information without their knowledge? Would it allow a person to pretend to be someone else? Could another party tell easily the device was being used by an imposter?

There is a tradeoff between the trust and risk. The device needs to calculate whether we know who the person is and whether the person wants the information to be shown. That needs to be weighed against the potential risk or harm if an inappropriate person is in front of the device.


Having someone in your home accidentally share embarrassing photos could have social implications.

A few examples of this tradeoff:

FeatureRisk and trust calculationPossible issuesShowing a photo when the device detects someone in the roomPhoto content sensitivity, who is in the room Showing an inappropriate photo to a complete strangerStarting a video callPerson’s account being used for the call, the actual person starting the callWhen the other side picks up it may not be who they thought it would bePlaying a personal song playlistPersonal recommendations being impactedIncorrect future recommendationsAutomatically ordering something based on a voice commandConvenience of ordering, approval of the shopping account’s ownerShipping an item that shouldn’t have been ordered

This gets even trickier when people no longer in the home can access the devices remotely. There have been cases of harassment, intimidation, and domestic abuse by people whose access should have been revoked: for example, an ex-partner turning off the heating system. When should someone be able to access communal devices remotely? When should their access be controllable from the devices themselves? How should people be reminded to update their access control lists? How does basic security maintenance happen inside a communal space?

See how much work this takes in a recent account of pro bono security work for a harassed mother and her son. Or how a YouTuber was blackmailed, surveilled, and harassed by her smart home. Apple even has a manual for this type of situation.

At home, where there’s no corporate IT group to create policies and automation to keep things secure, it’s next to impossible to manage all of these security issues. Even some corporations have trouble with it. We need to figure out how users will maintain and configure a communal device over time. Configuration for devices in the home and office can be wrought with lots of different types of needs over time.

For example, what happens when someone leaves the home and is no longer part of it? We will need to remove their access and may even find it necessary to block them from certain services. This is highlighted with the cases of harassment of people through spouses that still control the communal devices. Ongoing maintenance of a particular device could also be triggered by a change in needs by the community. A home device may be used to just play music or check the weather at first. But when a new baby comes home, being able to do video calling with close relatives may become a higher priority.

End users are usually very bad at changing configuration after it is set. They may not even know that they can configure something in the first place. This is why people have made a business out of setting up home stereo and video systems. People just don’t understand the technologies they are putting in their houses. Does that mean we need some type of handy-person that does home device setup and management? When more complicated routines are required to meet the needs, how does someone allow for changes without writing code, if they are allowed to?

Communal devices need new paradigms of security that go beyond the standard login. The world inside a home is protected by a barrier like a locked door; the capabilities of communal devices should respect that. This means both removing friction in some cases and increasing it in others.

A note about biometrics
 “Turn your face” to enroll in Google Face Match and personalize your devices.
(Source: Google Face Match video, https://youtu.be/ODy_xJHW6CI?t=26)

Biometric authentication for voice and face recognition can help us get a better understanding of who is using a device. Examples of biometric authentication include FaceID for the iPhone and voice profiles for Amazon Alexa. There is a push for regulation of facial recognition technologies, but opt-in for authentication purposes tends to be carved out.

However, biometrics aren’t without problems. In addition to issues with skin tone, gender bias, and local accents, biometrics assumes that everyone is willing to have a biometric profile on the device–and that they would be legally allowed to (for example, children may not be allowed to consent to a biometric profile). It also assumes this technology is secure. Google FaceMatch makes it very clear it is only a technology for personalization, rather than authentication. I can only guess they have legalese to avoid liability when an unauthorized person spoofs someone’s face, say by taking a photo off the wall and showing it to the device.

What do we mean by “personalization?” When you walk into a room and FaceMatch identifies your face, the Google Home Hub dings, shows your face icon, then shows your calendar (if it is connected), and a feed of personalized cards. Apple’s FaceID uses many levels of presentation attack detection (also known as “anti-spoofing”): it verifies your eyes are open and you are looking at the screen, and it uses a depth sensor to make sure it isn’t “seeing” a photo. The phone can then show hidden notification content or open the phone to the home screen. This measurement of trust and risk is benefited by understanding who could be in front of the device. We can’t forget that the machine learning that is doing biometrics is not a deterministic calculation; there is always some degree of uncertainty.

Social and information norms define what we consider acceptable, who we trust, and how much. As trust goes up, we can take more risks in the way we handle information. However, it’s difficult to connect trust with risk without understanding people’s expectations. I have access to my partner’s iPhone and know the passcode. It would be a violation of a norm if I walked over and unlocked it without being asked, and doing so will lead to reduced trust between us.

As we can see, biometrics does offer some benefits but won’t be the panacea for the unique uses of communal devices. Biometrics will allow those willing to opt-in to the collection of their biometric profile to gain personalized access with low friction, but it will never be useable for everyone with physical access.

Experiences

People use a communal device for short experiences (checking the weather), ambient experiences (listening to music or glancing at a photo), and joint experiences (multiple people watching a movie). The device needs to be aware of norms within the space and between the multiple people in the space. Social norms are rules by which people decide how to act in a particular context or space. In the home, there are norms about what people should and should not do. If you are a guest, you try to see if people take their shoes off at the door; you don’t rearrange things on a bookshelf; and so on.

Most software is built to work for as many people as possible; this is called generalization. Norms stand in the way of generalization. Today’s technology isn’t good enough to adapt to every possible situation. One strategy is to simplify the software’s functionality and let the humans enforce norms. For example, when multiple people talk to an Echo at the same time, Alexa will either not understand or it will take action on the last command. Multi-turn conversations between multiple people are still in their infancy. This is fine when there are understood norms–for example, between my partner and I. But it doesn’t work so well when you and a child are both trying to shout commands.


Shared experiences can be challenging like a parent and child yelling at an Amazon Echo to play what they want.

Norms are interesting because they tend to be learned and negotiated over time, but are invisible. Experiences that are built for communal use need to be aware of these invisible norms through cues that can be detected from peoples’ actions and words. This gets especially tricky because a conversation between two people could include information subject to different expectations (in a Contextual Integrity sense) about how that information is used. With enough data, models can be created to “read between the lines” in both helpful and dangerous ways.

Video games already cater to multiple people’s experiences. With the Nintendo Switch or any other gaming system, several people can play together in a joint experience. However, the rules governing these experiences are never applied to, say, Netflix. The assumption is always that one person holds the remote. How might these experiences be improved if software could accept input from multiple sources (remote controls, voice, etc.) to build a selection of movies that is appropriate for everyone watching?

Communal experience problems highlight inequalities in households. With women doing more household coordination than ever, there is a need to rebalance the tasks for households. Most of the time these coordination tasks are relegated to personal devices, generally the wife’s mobile phone, when they involve the entire family (though there is a digital divide outside the US). Without moving these experiences into a place that everyone can participate in, we will continue these inequalities.

So far, technology has been great at intermediating people for coordination through systems like text messaging, social networks, and collaborative documents. We don’t build interaction paradigms that allow for multiple people to engage at the same time in their communal spaces. To do this we need to address that the norms that dictate what is appropriate behavior are invisible and pervasive in the spaces these technologies are deployed.

Ownership

Many of these devices are not really owned by the people who buy them. As part of the current trend towards subscription-based business models, the device won’t function if you don’t subscribe to a service. Those services have license agreements that specify what you can and cannot do (which you can read if you have a few hours to spare and can understand them).

For example, this has been an issue for fans of Amazon’s Blink camera. The home automation industry is fragmented: there are many vendors, each with its own application to control their particular devices. But most people don’t want to use different apps to control their lighting, their television, their security cameras, and their locks. Therefore, people have started to build controllers that span the different ecosystems. Doing so has caused Blink users to get their accounts suspended.

What’s even worse is that these license agreements can change whenever the company wants. Licenses are frequently modified with nothing more than a notification, after which something that was previously acceptable is now forbidden. In 2020, Wink suddenly applied a monthly service charge; if you didn’t pay, the device would stop working. Also in 2020, Sonos caused a stir by saying they were going to “recycle” (disable) old devices. They eventually changed their policy.

The issue isn’t just what you can do with your devices; it’s also what happens to the data they create. Amazon’s Ring partnership with one in ten US police departments troubles many privacy groups because it creates a vast surveillance program. What if you don’t want to be a part of the police state? Make sure you check the right box and read your terms of service. If you’re designing a device, you need to require users to opt in to data sharing (especially as regions adapt GDPR and CCPA-like regulation).

While techniques like federated learning are on the horizon, to avoid latency issues and mass data collection, it remains to be seen whether those techniques are satisfactory for companies that collect data. Is there a benefit to both organizations and their customers to limit or obfuscate the transmission of data away from the device?

Ownership is particularly tricky for communal devices. This is a collision between the expectations of consumers who put something in their home; those expectations run directly against the way rent-to-use services are pitched. Until we acknowledge that hardware put in a home is different from a cloud service, we will never get it right.

Lots of problems, now what?

Now that we have dived into the various problems that rear their head with communal devices, what do we do about it? In the next article we discuss a way to consider the map of the communal space. This helps build a better understanding of how the communal device fits in the context of the space and services that exist already.

We will also provide a list of dos and don’ts for leaders, developers, and designers to consider when building a communal device.

Categories: Technology

Thinking About Glue

O'Reilly Radar - Tue, 2021/07/13 - 06:28

In Glue: the Dark Matter of Software, Marcel Weiher asks why there’s so much code. Why is Microsoft Office 400 million lines of code? Why are we always running into the truth of Alan Kay’s statement that “Software seems ‘large’ and ‘complicated’ for what it does”?

Weiher makes an interesting claim: the reason we have so much code is Glue Code, the code that connects everything together. It’s “invisible and massive”; it’s “deemed not important”; and, perhaps most important, it’s “quadratic”: the glue code is proportional to the square of the number of things you need to glue. That feels right; and in the past few years, we’ve become increasingly aware of the skyrocketing number of dependencies in any software project significantly more complex than “Hello, World!” We can all add our own examples: the classic article Hidden Technical Debt in Machine Learning Systems shows a block diagram of a system in which machine learning is a tiny block in the middle, surrounded by all sorts of infrastructure: data pipelines, resource management, configuration, etc. Object Relational Management (ORM) frameworks are a kind of glue between application software and databases. Web frameworks facilitate gluing together components of various types, along with gluing that front end to some kind of back end. The list goes on.

Weiher makes another important point: the simplest abstraction for glue is the Unix pipe (|), although he points out that pipes are not the only solution. Anyone who has used Unix or a variant (and certainly anyone who has read–or in my case, written–chunks of Unix Power Tools) realizes how powerful the pipe is. A standard way to connect tools that are designed to do one thing well: that’s important.

But there’s another side to this problem, and one that we often sweep under the rug. A pipe has two ends: something that’s sending data, and something that’s receiving it. The sender needs to send data in a format that the receiver understands, or (more likely) the receiver needs to be able to parse and interpret the sender’s data in a way that it understands. You can pipe all the log data you want into an awk script (or perl, or python), but that script is still going to have to parse that data to make it interpretable. That’s really what those millions of lines of glue code do: either format data so the receiver can understand it or parse incoming data into a usable form. (This task falls more often on the receiver than the sender, largely because the sender often doesn’t—and shouldn’t—know anything about the receiver.)

From this standpoint, the real problem with glue isn’t moving data, though the Unix pipe is a great abstraction; it’s data integration. In a discussion about blockchains and medical records, Jim Stogdill once said “the real problem has nothing to do with blockchains. The real problem is data integration.” You can put all the data you want on a blockchain, or in a data warehouse, or in a subsurface data ocean the size of one of Jupiter’s moons, and you won’t solve the problem that application A generates data in a form that application B can’t use. If you know anything about medical records (and I know very little), you know that’s the heart of the problem. One major vendor has products that aren’t even compatible with each other, let alone competitors’ systems. Not only are data formats incompatible, the meanings of fields in the data are often different in subtle ways. Chasing down those differences can easily run to hundreds of thousands, if not millions, of lines of code.

Pipes are great for moving data from one place to another. But there’s no equivalent standard for data integration. XML might play a role, but it only solves the easy part of the problem: standardizing parsing has some value, but the ease of parsing XML was always oversold, and the real problems stem more from schemas than data formats. (And please don’t play the “XML is human-readable and -writable” game.) JSON strikes me as XML for “pickling” JavaScript objects, replacing angle brackets with curly braces: a good idea that has gotten a lot of cross-language support, but like XML neglects the tough part of the problem.

Is data integration a problem that can be solved? In networking, we have standards for what data means and how to send it. All those TCP/IP packet headers that have been in use for almost 40 years (the first deployment of IPv4 was in 1982) have kept data flowing between systems built by different vendors. The fields in the header have been defined precisely, and new protocols have been built successfully at every layer of the network stack.

But this kind of standardization doesn’t solve the N squared problem. In a network stack, TCP talks to TCP; HTTPS talks to HTTPS. (Arguably, it keeps the N squared problem from being an N cubed problem.) The network stack designs the N squared problem out of existence, at least as far as the network itself is concerned, but that doesn’t help at the application layer. When we’re talking applications, a medical app needs to understand medical records, financial records, regulatory constraints, insurance records, reporting systems, and probably dozens more. Nor does standardization really solve the problem of new services. IPv4 desperately needs to be replaced (and IPv6 has been around since 1995), but IPv6 has been “5 years in the future” for two decades now. Hack on top of hack has kept IPv4 workable; but will layer and layer of hack work if we’re extending medical or financial applications?

Glue code expands as the square of the number of things that are glued. The need to glue different systems together is at the core of the problems facing software development; as systems become more all-encompassing, the need to integrate with different systems increases. The glue–which includes code written for data integration–becomes its own kind of technical debt, adding to the maintenance burden. It’s rarely (if ever) refactored or just plain removed because you always need to “maintain compatibility” with some old system.  (Remember IE6?)

Is there a solution? In the future, we’ll probably need to integrate more services.  The glue code will be more complex, since it will probably need to live in some “zero trust” framework (another issue, but an important one).  Still, knowing that you’re writing glue code, keeping track of where it is, and being proactive about removing it when it’s needed will keep the problem manageable. Designing interfaces carefully and observing standards will minimize the need for glue. In the final analysis, is glue code really a problem? Programming is ultimately about gluing things together, whether they’re microservices or programming libraries. Glue isn’t some kind of computational waste; it’s what holds our systems together.  Glue development is software development.

Categories: Technology

July 8th Virtual Meeting Topics

PLUG - Wed, 2021/07/07 - 14:45

We'll have 2 persentations for this month "MySQL 8.0 Indexes, Histograms, and Other Ways to Speed Up Your Queries" and "My Presentation Creation Stack with AsciiDoc"
Attend the meeting on by visiting: https://lufthans.bigbluemeeting.com/b/plu-yuk-7xx on the 8th of July at 7pm MST

Dave Stokes: MySQL 8.0 Indexes, Histograms, and Other Ways to Speed Up Your Queries

Description:
Improving the performance of database queries is often seen as a Harry Potter-ish dark art. In reality it is simple engineering and revolves around providing the query optimizer the best information about your data! And how do you do that? Well, you can start with properly planned indexes and histograms. We will also venture into some other areas that will help you speed up your queries.

About Dave:
Dave Stokes is a MySQL Community Manager for Oracle and the author of 'MySQL & JSON - A Practical Programming Guide'.


der.hans: My Presentation Creation Stack with AsciiDoc

Description:
Creating information rich presentations in AsciiDoc is easy.
AsciiDoc was created as a markup for making DocBook books.
Since it is plain text, AsciiDoc lends itself to writing, editing and using revision control.

The presentation introduces AsciiDoc and covers both advantages and disadvantages of using it for presentations.
I will also cover some of the presentation tools I've used previously.
Then I'll illuminate why I now prefer AsciiDoc and some suggestions when using it for presentations.

About der.hans:
der.hans is a technologist, Free Software advocate, parent and spouse.

Hans is chairman of the Phoenix Linux User Group (PLUG), chair for SeaGL Finance committee, founder of SeaGL Career Expo, BoF organizer for the Southern California Linux Expo (SCaLE) and founder of the Free Software Stammtisch. He presents regularly at large community-led conferences (SeaGL, SCaLE, LCA, FOSSASIA, Tübix, CLT, LFNW, OLF, SELF, GeekBeacon Fest) and many local groups.

Currently a Customer Data Engineer at Object Rocket. Public statements are not representative of $dayjob.

Fediverse/Mastodon - https://floss.social/@FLOX_advocate

Radar trends to watch: July 2021

O'Reilly Radar - Tue, 2021/07/06 - 10:12

Certainly the biggest news of the past month has been a continuation of the trend towards regulating the biggest players in the tech industry.  The US House of Representatives is considering 5 antitrust bills that would lead to major changes in the way the largest technology companies do business; and the Biden administration has appointed a new Chair of the Federal Trade Commission who will be inclined to use these regulations aggressively. Whether these bills pass in their current form, how they are challenged in court, and what changes they will lead to is an open question.  (Late note: Antitrust cases against Facebook by the FTC and state governments based on current law were just thrown out of court.)

Aside from that, we see AI spreading into almost every area of computing; this list could easily have a single AI heading that subsumes programming, medicine, security, and everything else.

AI and Data
  • A new algorithm allows autonomous vehicles to locate themselves using computer vision (i.e., without GPS) regardless of the season; it works even when the terrain is snow-covered.
  • An AI-based wildfire detection system has been deployed in Sonoma County. It looks for smoke plumes, and can monitor many more cameras than a human.
  • Researchers are investigating how racism and other forms of abuse enter AI models like GPT-3, and what can be done to prevent their appearance in the output. It’s essential for AI to “understand” racist content, but equally essential for it not to generate that content.
  • Google has successfully used Reinforcement Learning to design the layout for the next generation TPU chip. The layout process took 6 hours, and replaced weeks of human effort. This is an important breakthrough in the design of custom integrated circuits.
  • Facebook has developed technology to identify the source from which deepfake images originate. “Fingerprints” (distortions in the image) make it possible to identify the model that generated the images, and possibly to track down the creators.
  • Adaptive mood control is a technique that autonomous vehicles can use to detect passengers’ emotions and drive accordingly, making it easier for humans to trust the machine. We hope this doesn’t lead AVs to drive faster when the passenger is angry or frustrated.
  • IBM has developed Uncertainty Quantification 360, a set of open source tools for quantifying the uncertainty in AI systems. Understanding uncertainty is a big step towards building trustworthy AI and getting beyond the idea that the computer is always right. Trust requires understanding uncertainty.
  • Waymo’s autonomous trucks will begin carrying real cargo between Houston and Fort Worth, in a partnership with a major trucking company.
  • GPT-2 can predict brain activity and comprehension in fMRI studies of patients listening to stories, possibly indicating that in some way its processes correlate to brain function.
  • GPT-J is a language model with performance similar to GPT-3.  The code and weights are open source.
  • It appears possible to predict preferences directly by comparing brain activity to activity of others (essentially, brain-based collaborative filtering). A tool for advertising or for self-knowledge?
  • Features stores are tools to automate building pipelines to deliver data for ML applications in production. Tecton, which originated with Uber’s Michelangelo, is one of the early commercial products available.
  • How does machine learning work with language? Everything You Ever Said doesn’t answer the question, but lets you play with an NLP engine by pasting in a text, then adding or subtracting concepts to see how the text is transformed.  (Based on GLoVE, a pre-GPT model.)
  • The HateCheck dataset tests the ability of AI applications to detect hate speech correctly. Hate speech is a hard problem; being too strict causes systems to reject content that shouldn’t be classified as hate speech, while being too lax allows hate speech through.
Ethics
  • Twitter has built a data ethics group aimed at putting ethics into practice, in addition to research.  Among others, the group includes Rumman Chowdhury and Kristian Lum.
  • A study of the effect of noise on fairness in lending shows that insufficient (hence noisier) data is as big a problem as biased data. Poor people have less credit history, which means that their credit scores are often inaccurate. Correcting problems arising from noise is much more difficult than dealing with problems of bias.
  • Andrew Ng’s newsletter, The Batch, reports on a survey of executives that most companies are not practicing “responsible AI,” or even understand the issues. There is no consensus about the importance (or even the meaning) of “ethics” for AI.
  • Using AI to screen resumes is a problem in itself, but AI doing the interview? That’s taking problematic to a new level. It can be argued that AI, when done properly, is less subject to bias than a human interviewer, but we suspect that AI interviewers present more problems than solutions.
Web
  • WebGPU is a proposal for a standard API that makes GPUs directly accessible to web pages for rendering and computation.
  • An end to providing cookie consent for every site you visit?  The proposed ADPC (advanced data protection control) standard will allow users to specify privacy preferences once.
  • Using social media community guidelines as a political weapon: the Atajurt Kazakh Human Rights channel, which publishes testimonies from people imprisoned in China’s internment camps, has been taken down repeatedly as a result of coordinated campaigns.
Security
  • Microsoft is working on eliminating passwords! Other companies should take the hint. Microsoft is stressing biometrics (which have their own problems) and multi-factor authentication.
  • Supply chain security is very problematic.  Microsoft admits to an error in which they mistakenly signed a device driver that was actually a rootkit, causing security software to ignore it. The malware somehow slipped through Microsoft’s signing process.
  • Markpainting is a technology for defeating attempts to create a fake image by adding elements to the picture that aren’t visible, but that will become visible when the image is modified (for example, to eliminate a watermark).
  • Amazon Sidewalk lets Amazon devices connect to other open WiFi nets to extend their range and tap others’ internet connections. Sidewalk is a cool take on decentralized networking. It is also a Very Bad Idea.
  • Authentication using gestures, hand shapes, and geometric deep learning? I’m not convinced, but this could be a viable alternative to passwords and crude biometrics. It would have to work for people of all skin colors, and that has consistently been a problem for vision-based products.
  • According to Google, Rowhammer attacks are gaining momentum–and will certainly gain even more momentum as feature sizes in memory chips get smaller. Rowhammer attacks repeatedly access a single row in a memory chip, hoping to corrupt adjacent bits.
  • While details are sketchy, the FBI was able to recover the BTC Colonial Pipeline paid to Darkside to restore systems after their ransomware attack. The FBI has been careful to say that they can’t promise recovering payments in other cases. Whether this recovery reflects poor opsec on the part of the criminals, or that Bitcoin is more easily de-anonymized than most people think, it’s clear that secrecy and privacy are relative.
Design and User Experience
  • Communal Computing is about designing devices that are inherently shared: home assistants, home automation, and more. The “single account/user” model doesn’t work.
  • A microphone that only “hears” frequencies above the human hearing range can be used to detect human activities (for example, in a smart home device) without recording speech.
  • Digital Twins in aerospace at scale: One problem with the adoption of digital twins is that the twin is very specific to a single device. This research shows that it’s possible to model real-world objects in ways that can be reused across collections of objects and different applications.
Medicine
  • The Open Insulin Foundation is dedicated to creating the tools necessary to produce insulin at scale. This is the next step in a long-term project by Anthony DiFranco and others to challenge the pharma company’s monopoly on insulin production, and create products at a small fraction of the price.
  • Where’s the work on antivirals and other treatments for COVID-19? The answer is simple: Vaccines are very profitable. Antivirals aren’t. This is a huge, institutional problem in the pharmaceutical industry.
  • The National Covid Cohort Collaborative (N3C) is a nationwide database of anonymized medical records of COVID patients. What’s significant isn’t COVID, but that N3C is a single database, built to comply with privacy laws, that’s auditable, and that’s open for any group to make research proposals.
  • Can medical trials be sped up by re-using control data (data from patients who were in the control group) from previous trials? Particularly for rare and life-threatening diseases, getting trial volunteers is difficult because nobody wants to be assigned to the control group.
  • A remote monitoring patch for COVID patients uses AI to understand changes in the patient’s vital signs, allowing medical staff to intervene immediately if a patient’s condition worsens. Unlike most such devices, it was trained primarily on Black and Hispanic patients.
  • Machine learning in medicine is undergoing a credibility crisis: poor data sets with limited diversity lead to poor results.
Programming
  • Microsoft, OpenAI, and GitHub have announced a new service called Copilot that uses AI to make suggestions to programmers as they are writing code (currently in “technical preview”).  It is truly a cybernetic pair programmer.
  • Windows 11 will run Android apps. If nothing else, this is a surprise. Android apps will be provided via the Amazon store, not Google Play.
  • Microsoft’s PowerFx is a low-code programming language based on Excel formulas (which now include lambdas).  Input and output are through what looks like a web page. What does it mean to strip Excel from its 2D grid? Is this a step forward or backward for low code computing?
  • Open Source Insights is a Google project for investigating the dependency chain of any open source project. Its ability currently is limited to a few major packaging systems (including npm, Cargo, and maven), but it will be expanded.
  • Quantum computing’s first application will be in researching quantum mechanics: understanding the chemistry of batteries, drugs, and materials. In these applications, noise is an asset, not a problem.
Categories: Technology

Hand Labeling Considered Harmful

O'Reilly Radar - Wed, 2021/06/23 - 05:34

We are traveling through the era of Software 2.0, in which the key components of modern software are increasingly determined by the parameters of machine learning models, rather than hard-coded in the language of for loops and if-else statements. There are serious challenges with such software and models, including the data they’re trained on, how they’re developed, how they’re deployed, and their impact on stakeholders. These challenges commonly result in both algorithmic bias and lack of model interpretability and explainability.

There’s another critical issue, which is in some ways upstream to the challenges of bias and explainability: while we seem to be living in the future with the creation of machine learning and deep learning models, we are still living in the Dark Ages with respect to the curation and labeling of our training data: the vast majority of labeling is still done by hand.

There are significant issues with hand labeling data:

  • It introduces bias, and hand labels are neither interpretable nor explainable.
  • There are prohibitive costs to hand labeling datasets (both financial costs and the time of subject matter experts).
  • There is no such thing as gold labels: even the most well-known hand labeled datasets have label error rates of at least 5% (ImageNet has a label error rate of 5.8%!).

We are living through an era in which we get to decide how human and machine intelligence interact to build intelligent software to tackle many of the world’s toughest challenges. Labeling data is a fundamental part of human-mediated machine intelligence, and hand labeling is not only the most naive approach but also one of the most expensive (in many senses) and most dangerous ways of bringing humans in the loop. Moreover, it’s just not necessary as many alternatives are seeing increasing adoption. These include:

  • Semi-supervised learning
  • Weak supervision
  • Transfer learning
  • Active learning
  • Synthetic data generation

These techniques are part of a broader movement known as Machine Teaching, a core tenet of which is getting both humans and machines each doing what they do best. We need to use expertise efficiently: the financial cost and time taken for experts to hand-label every data point can break projects, such as diagnostic imaging involving life-threatening conditions and security and defense-related satellite imagery analysis. Hand labeling in the age of these other technologies is akin to scribes hand-copying books post-Gutenberg.

There is also a burgeoning landscape of companies building products around these technologies, such as Watchful (weak supervision and active learning; disclaimer: one of the authors is CEO of Watchful), Snorkel (weak supervision), Prodigy (active learning), Parallel Domain (synthetic data), and AI Reverie (synthetic data).

Hand Labels and Algorithmic Bias

As Deb Raji, a Fellow at the Mozilla Foundation, has pointed out, algorithmic bias “can start anywhere in the system—pre-processing, post-processing, with task design, with modeling choices, etc.,” and the labeling of data is a crucial point at which bias can creep in.


Figure 1: Bias can start anywhere in the system. Image adapted from A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle by Harini Suresh and John Guttag.

High-profile cases of bias in training data resulting in harmful models include an Amazon recruiting tool that “penalized resumes that included the word ‘women’s,’ as in ‘women’s chess club captain.’” Don’t take our word for it. Play the educational game Survival of the Best Fit where you’re a CEO who uses a machine learning model to scale their hiring decisions and see how the model replicates the bias inherent in the training data. This point is key: as humans, we possess all types of biases, some harmful, others not so. When we feed hand labeled data to a machine learning model, it will detect those patterns and replicate them at scale. This is why David Donoho astutely observed that perhaps we should call ML models recycled intelligence rather than artificial intelligence. Of course, given the amount of bias in hand labeled data, it may be more apt to refer to it as recycled stupidity (hat tip to artificial stupidity).

The only way to interrogate the reasons for underlying bias arising from hand labels is to ask the labelers themselves their rationales for the labels in question, which is impractical, if not impossible, in the majority of cases: there are rarely records of who did the labeling, it is often outsourced via at-scale global APIs, such as Amazon’s Mechanical Turk and, when labels are created in-house, previous labelers are often no longer part of the organization.

Uninterpretable, Unexplainable

This leads to another key point: the lack of both interpretability and explainability in models built on hand labeled data. These are related concepts, and broadly speaking, interpretability is about correlation, whereas explainability is about causation. The former involves thinking about which features are correlated with the output variable, while the latter is concerned with why certain features lead to particular labels and predictions. We want models that give us results we can explain and some notion of how or why they work. For example, in the ProPublica exposé of COMPAS recidivism risk model, which made more false predictions that Black people would re-offend than it did for white people, it is essential to understand why the model is making the predictions it does. Lack of explainability and transparency were key ingredients of all the deployed-at-scale algorithms identified by Cathy O’Neil in Weapons of Math Destruction.

It may be counterintuitive that getting machines more in-the-loop for labeling can result in more explainable models but consider several examples:

  • There is a growing area of weak supervision, in which SMEs specify heuristics that the system then uses to make inferences about unlabeled data, the system calculates some potential labels, and then the SME evaluates the labels to determine where more heuristics might need to be added or tweaked. For example, when building a model of whether surgery was necessary based on medical transcripts, the SME may provide the following heuristic: if the transcription contains the term “anaesthesia” (or a regular expression similar to it), then surgery likely occurred (check out Russell Jurney’s “Hand labeling is the past” article for more on this).
  • In diagnostic imaging, we need to start cracking open the neural nets (such as CNNs and transformers)! SMEs could once again use heuristics to specify that tumors smaller than a certain size and/or of a particular shape are benign or malignant and, through such heuristics, we could drill down into different layers of the neural network to see what representations are learned where.
  • When your knowledge (via labels) is encoded in heuristics and functions, as above, this also has profound implications for models in production. When data drift inevitably occurs, you can return to the heuristics encoded in functions and edit them, instead of continually incurring the costs of hand labeling.
On Auditing

Amidst the increasing concern about model transparency, we are seeing calls for algorithmic auditing. Audits will play a key role in determining how algorithms are regulated and which ones are safe for deployment. One of the barriers to auditing is that high-performing models, such as deep learning models, are notoriously difficult to explain and reason about. There are several ways to probe this at the model level (such as SHAP and LIME), but that only tells part of the story. As we have seen, a major cause of algorithmic bias is that the data used to train it is biased or insufficient in some way.

There currently aren’t many ways to probe for bias or insufficiency at the data level. For example, the only way to explain hand labels in training data is to talk to the people who labeled it. Active learning, on the other hand, allows for the principled creation of smaller datasets which have been intelligently sampled to maximize utility for a model, which in turn reduces the overall auditable surface area. An example of active learning would be the following: instead of hand labeling every data point, the SME can label a representative subset of the data, which the system uses to make inferences about the unlabeled data. Then the system will ask the SME to label some of the unlabeled data, cross-check its own inferences and refine them based on the SME’s labels. This is an iterative process that terminates once the system reaches a target accuracy. Less data means less headache with respect to auditability.

Weak supervision more directly encodes expertise (and hence bias) as heuristics and functions, making it easier to evaluate where labeling went awry. For more opaque methods, such as synthetic data generation, it might be a bit difficult to interpret why a particular label was applied, which may actually complicate an audit. The methods we choose at this stage of the pipeline are important if we want to make sure the system as a whole is explainable.

The Prohibitive Costs of Hand Labeling

There are significant and differing forms of costs associated with hand labeling. Giant industries have been erected to deal with the demand for data-labeling services. Look no further than Amazon Mechanical Turk and all other cloud providers today. It is telling that data labeling is becoming increasingly outsourced globally, as detailed by Mary Gray in Ghost Work, and there are increasingly serious concerns about the labor conditions under which hand labelers work around the globe.

The sheer amount of capital involved was evidenced by Scale AI raising $100 million in 2019 to bring their valuation to over $1 billion at a time when their business model solely revolved around using contractors to hand label data (it is telling that they’re now doing more than solely hand labels).

Money isn’t the only cost, and quite often, isn’t where the bottleneck or rate-limiting step occurs. Rather, it is the bandwidth and time of experts that is the scarcest resource. As a scarce resource, this is often expensive but, much of the time it isn’t even available (on top of this, the time it also takes to correct errors in labeling by data scientists is very expensive). Take financial services, for example, and the question of whether or not you should invest in a company based on information about the company scraped from various sources. In such a firm, there will only be a small handful of people who can make such a call, so labeling each data point would be incredibly expensive, and that’s if the SME even has the time.

This is not vertical-specific. The same challenge occurs in labeling legal texts for classification: is this clause talking about indemnification or not? And in medical diagnosis: is this tumor benign or malignant? As dependence on expertise increases, so does the likelihood that limited access to SMEs becomes a bottleneck.

The third cost is a cost to accuracy, reality, and ground truth: the fact that hand labels are often so wrong. The authors of a recent study from MIT identified “label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets.” They estimated an average error rate of 3.4% across the datasets and show that ML model performance increases significantly once labels are corrected, in some instances. Also, consider that in many cases ground truth isn’t easy to find, if it exists at all. Weak supervision makes room for these cases (which are the majority) by assigning probabilistic labels without relying on ground truth annotations. It’s time to think statistically and probabilistically about our labels. There is good work happening here, such as Aka et al.’s (Google) recent paper Measuring Model Biases in the Absence of Ground Truth.

The costs identified above are not one-off. When you train a model, you have to assume you’re going to train it again if it lives in production. Depending on the use case, that could be frequent. If you’re labeling by hand, it’s not just a large upfront cost to build a model. It is a set of ongoing costs each and every time.


Figure 2: There are no “gold labels”: even the most well-known hand labeled datasets have label error rates of at least 5% (ImageNet has a label error rate of 5.8%!). The Efficacy of Automation Techniques

In terms of performance, even if getting machines to label much of your data results in slightly noisier labels, your models are often better off with 10 times as many slightly noisier labels. To dive a bit deeper into this, there are gains to be made by increasing training set size even if it means reducing overall label accuracy, but if you’re training classical ML models, only up to a point (past this point the model starts to see a dip in predictive accuracy). “Scaling to Very Very Large Corpora for Natural Language Disambiguation (Banko & Brill, 2001)” demonstrates this in a traditional ML setting by exploring the relationship between hand labeled data, automatically labeled data, and subsequent model performance. A more recent paper, “Deep Learning Scaling Is Predictable, Empirically (2017)”, explores the quantity/quality relationship relative to modern state of the art model architectures, illustrating the fact that SOTA architectures are data hungry, and accuracy improves as a power law as training sets grow:

We empirically validate that DL model accuracy improves as a power-law as we grow training sets for state-of-the-art (SOTA) model architectures in four machine learning domains: machine translation, language modeling, image processing, and speech recognition. These power-law learning curves exist across all tested domains, model architectures, optimizers, and loss functions.

The key question isn’t “should I hand label my training data or should I label it programmatically?” It should instead be “which parts of my data should I hand label and which parts should I label programmatically?” According to these papers, by introducing expensive hand labels sparingly into largely programmatically generated datasets, you can maximize the effort/model accuracy tradeoff on SOTA architectures that wouldn’t be possible if you had hand labeled alone.

The stacked costs of hand labeling wouldn’t be so challenging were they necessary, but the fact of the matter is that there are so many other interesting ways to get human knowledge into models. There’s still an open question around where and how we want humans in the loop and what’s the right design for these systems. Areas such as weak supervision, self-supervised learning, synthetic data generation, and active learning, for example, along with the products that implement them, provide promising avenues for avoiding the pitfalls of hand labeling. Humans belong in the loop at the labeling stage, but so do machines. In short, it’s time to move beyond hand labels.

Many thanks to Daeil Kim for feedback on a draft of this essay.

Categories: Technology

Two economies. Two sets of rules.

O'Reilly Radar - Tue, 2021/06/22 - 06:07

At one point early this year, Elon Musk briefly became the richest person in the world. After a 750% increase in Tesla’s stock market value added over $180 billion to his fortune, he briefly had a net worth of over $200 billion. It’s now back down to “only” $155 billion.

Understanding how our economy produced a result like this—what is good about it and what is dangerous—is crucial to any effort to address the wild inequality that threatens to tear our society apart.

The betting economy versus the operating economy

In response to the news of Musk’s surging fortune, Bernie Sanders tweeted:

Bernie was right that a $7.25 minimum wage is an outrage to human decency. If the minimum wage had kept up with increases in productivity since 1979, it would be over $24 by now, putting a two-worker family into the middle class. But Bernie was wrong to imply that Musk’s wealth increase was at the expense of Tesla’s workers. The median Tesla worker makes considerably more than the median American worker.

Elon Musk’s wealth doesn’t come from him hoarding Tesla’s extractive profits, like a robber baron of old. For most of its existence, Tesla had no profits at all. It became profitable only last year. But even in 2020, Tesla’s profits of $721 million on $31.5 billion in revenue were small—only slightly more than 2% of sales, a bit less than those of the average grocery chain, the least profitable major industry segment in America.

No, Musk won the lottery, or more precisely, the stock market beauty contest. In theory, the price of a stock reflects a company’s value as an ongoing source of profit and cash flow. In practice, it is subject to wild booms and busts that are unrelated to the underlying economics of the businesses that shares of stock are meant to represent.

Why is Musk so rich? The answer tells us something profound about our economy: he is wealthy because people are betting on him. But unlike a bet in a lottery or at a racetrack, in the vast betting economy of the stock market, people can cash out their winnings before the race has ended.

This is one of the biggest unacknowledged drivers of inequality in America, the reason why one segment of our society prospered so much during the pandemic while the other languished.

What are the odds?

If the stock market is like a horse race where people can cash out their bets while the race is still being run, what does it mean for the race to finish? For an entrepreneur or an early-stage investor, an IPO is a kind of finish, the point where they can sell previously illiquid shares on to others. An acquisition or a shutdown, either of which puts an end to a company’s independent existence, is another kind of ending. But it is also useful to think of the end of the race as the point in time at which the stream of company profits will have repaid the investment.

Since ownership of public companies is spread across tens of thousands of people and institutions, it’s easier to understand this point by imagining a small private company with one owner, say, a home construction business or a storage facility or a car wash. If it cost $1 million to buy the business, and it delivered $100,000 of profit a year, the investment would be repaid in 10 years. If it delivered $50,000 in profit, it would take 20. And of course, those future earnings would need to be discounted at some rate, since a dollar received 20 years from now is not worth as much as a dollar received today. This same approach works, in theory, for large public companies. Each share is a claim on a fractional share of the company’s future profits and the present value that people put on that profit stream.

This is, of course, a radical oversimplification. There are many more sophisticated ways to value companies, their assets, and their prospects for future streams of profits. But what I’ve described above is one of the oldest, the easiest to understand, and the most clarifying. It is called the price/earnings ratio, or simply the P/E ratio. It’s the ratio between the price of a single share of stock and the company’s earnings per share (its profits divided by the number of shares outstanding.) What the P/E ratio gives, in effect, is a measure of how many years of current profits it would take to pay back the investment.

The rate of growth also plays a role in a company’s valuation. For example, imagine a business with $100 million in revenue with a 10% profit margin, earning $10 million a year. How much it is worth to own that asset depends how fast it is growing and what stage of its lifecycle it is in when you bought it. If you were lucky enough to own that business when it had only $1 million in revenue and, say, $50,000 in profits, you would now be earning 200x as much as you were when you made your original investment. If a company grows to hundreds of billions in revenue and tens of billions in profits, as Apple, Microsoft, Facebook, and Google have done, even a small investment early on that is held for the long haul can make its lucky owner into a billionaire. Tesla might be one of these companies, but if so, the opportunity to buy its future is long past because it is already so highly valued. The P/E ratio helps you to understand the magnitude of the bet you are making at today’s prices.

The average P/E ratio of the S&P 500 has varied over time as “the market” (the aggregate opinion of all investors) goes from bullish about the future to bearish, either about specific stocks or about the market as a whole. Over the past 70 years, the ratio has ranged from a low of 7.22 in 1950 to almost 45 today. (A note of warning: it was only 17 on the eve of the Great Depression.)

What today’s P/E ratio of 44.8 means that, on average, the 500 companies that make up the S&P 500 are valued at about 45 years’ worth of present earnings. Most companies in the index are worth less, and some far more. In today’s overheated market, it is often the case that the more certain the outcome the less valuable a company is considered to be. For example, despite their enormous profits and huge cash hoards, Apple, Google, and Facebook have ratios much lower than you might expect: about 30 for Apple, 34 for Google, and 28 for Facebook. Tesla at the moment of Elon Musk’s peak wealth? 1,396.

Let that sink in. You’d have had to wait almost 1,400 years to get your money back if you’d bought Tesla stock this past January and simply relied on taking home a share of its profits. Tesla’s more recent quarterly earnings are a bit higher, and its stock price quite a bit lower, so now you’d only have to wait about 600 years.

Of course, it’s certainly possible that Tesla will so dominate the auto industry and related energy opportunities that its revenues could grow from its current $28 billion to hundreds of billions with a proportional increase in profits. But as Rob Arnott, Lillian Wu, and Bradford Cornell point out in their analysis “Big Market Delusion: Electric Vehicles,” electric vehicle companies are already valued at roughly the same amount as the entire rest of the auto industry despite their small revenues and profits and despite the likelihood of more, rather than less, competition in future. Barring some revolution in the fundamental economics of the business, current investors are likely paying now for the equivalent of hundreds of years of future profits.

So why do investors do this? Simply put: because they believe that they will be able to sell their shares to someone else at an even higher price. In times where betting predominates in financial markets, what a company is actually worth by any intrinsic measure seems to have no more meaning than the actual value of tulips during the 17th century Dutch “tulip mania.” As the history of such moments teaches, eventually the bubble does pop.

This betting economy, within reason, is a good thing. Speculative investment in the future gives us new products and services, new drugs, new foods, more efficiency and productivity, and a rising standard of living. Tesla has kickstarted a new gold rush in renewable energy, and given the climate crisis, that is vitally important. A betting fever can be a useful collective fiction, like money itself (the value ascribed to pieces of paper issued by governments) or the wild enthusiasm that led to the buildout of railroads, steel mills, or the internet. As economist Carlota Perez has noted, bubbles are a natural part of the cycle by which revolutionary new technologies are adopted.

Sometimes, though, the betting system goes off the rails. Tesla’s payback may take centuries, but it is the forerunner of a necessary industrial transformation. But what about the payback on companies such as WeWork? How about Clubhouse? Silicon Valley is awash in companies that have persuaded investors to value them at billions despite no profits, no working business model, and no pathway to profitability. Their destiny, like WeWork’s or Katerra’s, is to go bankrupt.

John Maynard Keynes, the economist whose idea that it was essential to invest in the demand side of the economy and not just the supply side helped bring the world out of the Great Depression, wrote in his General Theory of Employment, Interest and Money, “Speculators may do no harm as bubbles on a steady stream of enterprise. But the position is serious when enterprise becomes the bubble on a whirlpool of speculation. When the capital development of a country becomes a by-product of the activities of a casino, the job is likely to be ill-done.”

In recent decades, we have seen the entire economy lurch from one whirlpool of speculation to another. And as at the gambling table, each lurch represents a tremendous transfer of wealth from the losers to the winners. The dot-com bust. The subprime mortgage meltdown. Today’s Silicon Valley “unicorn” bubble. The failures to deliver on their promises by WeWork, Katerra, and their like are just the start of yet another bubble popping.

Why this matters

Those at the gaming table can, for the most part, afford to lose. They are disproportionately wealthy. Nearly 52% of stock market value is held by the top 1% of Americans, with another 35% of total market value held by the next 9%. The bottom 50% hold only 0.7% of stock market wealth.

Bubbles, though, are only an extreme example of a set of dynamics that shape our economy far more widely than we commonly understand. The leverage provided by the betting economy drives us inevitably toward a monoculture of big companies. The local bookstore trying to compete with Amazon, the local cab company competing with Uber, the neighborhood dry cleaner, shopkeeper, accountant, fitness studio owner, or any other local, privately held business gets exactly $1 for every dollar of profit it earns. Meanwhile, a dollar of Tesla profit turns into $600 of stock market value; a dollar of Amazon profit turns into $67 of stock market value; a dollar of Google profit turns into $34, and so on. A company and its owners can extract massive amounts of value despite having no profits—value that can be withdrawn by those who own shares—essentially getting something for nothing.

And that, it turns out, is also one underappreciated reason why in the modern economy, the rich get richer and the poor get poorer. Rich and poor are actually living in two different economies, which operate by different rules. Most ordinary people live in a world where a dollar is a dollar. Most rich people live in a world of what financial pundit Jerry Goodman, writing under the pseudonym Adam Smith, called “supermoney,” where assets have been “financialized” (that is, able to participate in the betting economy) and are valued today as if they were already delivering the decades worth of future earnings that are reflected in their stock price.

Whether you are an hourly worker or a small business owner, you live in the dollar economy. If you’re a Wall Street investor, an executive at a public company compensated with stock grants or options, a venture capitalist, or an entrepreneur lucky enough to win, place, or show in the financial market horse race, you live in the supermoney economy. You get a huge interest-free loan from the future.

Elon Musk has built not one but two world-changing companies (Tesla and SpaceX.) He clearly deserves to be wealthy. As does Jeff Bezos, who quickly regained his title as the world’s wealthiest person. Bill Gates, Steve Jobs, Larry Page and Sergey Brin, Mark Zuckerberg, and many other billionaires changed our world and have been paid handsomely for it.

But how much is too much? When Bernie Sanders said that billionaires shouldn’t exist, Mark Zuckerberg agreed, saying, “On some level, no one deserves to have that much money.” He added, “I think if you do something that’s good, you get rewarded. But I do think some of the wealth that can be accumulated is unreasonable.” Silicon Valley was founded by individuals for whom hundreds of millions provided plenty of incentive! The notion that entrepreneurs will stop innovating if they aren’t rewarded with billions is a pernicious fantasy.

What to do about it

Taxing the rich and redistributing the proceeds might seem like it would solve the problem. After all, during the 1950s, ’60s, and ’70s, progressive income tax rates as high as 90% did a good job of redistributing wealth and creating a broad-based middle class. But we also need to put a brake on the betting economy that is creating so much phantom wealth by essentially letting one segment of society borrow from the future while another is stuck in an increasingly impoverished present.

Until we recognize the systemic role that supermoney plays in our economy, we will never make much of a dent in inequality. Simply raising taxes is a bit like sending out firefighters with hoses spraying water while another team is spraying gasoline.

The problem is that government policy is biased in favor of supermoney. The mandate for central bankers around the world is to keep growth rates up without triggering inflation. Since the 2009 financial crisis, they have tried to do this by “quantitative easing,” that is, flooding the world with money created out of nothing. This has kept interest rates low, which in theory should have sparked investment in the operating economy, funding jobs, factories, and infrastructure. But far too much of it went instead to the betting economy.

Stock markets have become so central to our imagined view of how the economy is doing that keeping stock prices going up even when companies are overvalued has become a central political talking point. Any government official whose policies cause the stock market to go down is considered to have failed. This leads to poor public policy as well as poor investment decisions by companies and individuals.

As Steven Pearlstein, Washington Post columnist and author of the book Moral Capitalism, put it in a 2020 column:

When the markets are buoyant, Fed officials claim that central bankers should never second-guess markets by declaring that there are financial bubbles that might need to be deflated. Markets on their own, they assure, will correct whatever excesses may develop.

But when bubbles burst or markets spiral downward, the Fed suddenly comes around to the idea that markets aren’t so rational and self-correcting and that it is the Fed’s job to second-guess them by lending copiously when nobody else will.

In essence, the Fed has adopted a strategy that works like a one-way ratchet, providing a floor for stock and bond prices but never a ceiling.

That’s the fire hose spraying gasoline. To turn it off, central banks should:

  • Raise interest rates, modestly at first, and more aggressively over time. Yes, this would quite possibly puncture the stock market bubble, but that could well be a good thing. If people can no longer make fortunes simply by betting that stocks will go up and instead have to make more reasonable assessments of the underlying value of their investments, the market will become better at allocating capital.
  • Alternatively, accept much larger increases in inflation. As Thomas Piketty explained in Capital in the Twenty-First Century, inflation is one of the prime forces that decreases inequality, reducing the value of existing assets and more importantly for the poor, reducing the value of debt and the payments paid to service it.
  • Target small business creation, hiring, and profitability in the operating economy rather than phantom valuation increases for stocks.

Tax policy also fans the fire. Taxes shape the economy in much the same way as Facebook’s algorithms shape its news feed. The debate about whether taxes as a whole should be higher or lower completely lacks nuance and so misses the point, especially in the US, where elites use their financial and political power to get favored treatment. Here are some ideas:

In general, we should treat not just illegal evasion but tax loopholes the way software companies treat zero-day exploits, as something to be fixed as soon as they are recognized, not years or decades later. Even better, stop building them into the system in the first place! Most loopholes are backdoors installed knowingly by our representatives on behalf of their benefactors.

This last idea is perhaps the most radical. The tax system could and should become more dynamic rather than more predictable. Imagine if Facebook or Google were to tell us that they couldn’t change their algorithms to address misinformation or spam without upsetting their market and so had to leave abuses in place for decades in the interest of maintaining stability—we’d think they were shirking their duty. So too our policy makers. It’s high time we all recognize the market-shaping role of tax and monetary policy. If we can hold Facebook’s algorithms to account, why can’t we do the same for our government?

Our society and markets are getting the results the algorithm was designed for. Are they the results we actually want?

Categories: Technology

Communal Computing

O'Reilly Radar - Tue, 2021/06/15 - 04:27

Home assistants and smart displays are being sold in record numbers, but they are built wrong. They are designed with one person in mind: the owner. These technologies need to fit into the communal spaces where they are placed, like homes and offices. If they don’t fit, they will be unplugged and put away due to lack of trust.

The problems are subtle at first. Your Spotify playlist starts to have recommendations for songs you don’t like. You might see a photo you took on someone else’s digital frame. An Apple TV reminds you of a new episode of a show your partner watches. Guests are asking you to turn on your IoT-enabled lights for them. The wrong person’s name shows up in the Zoom call. Reminders for medication aren’t heard by the person taking the medication. Bank account balances are announced during a gathering of friends.

Would you want your bank account balances announced during a dinner party?

This is the start of a series discussing the design of communal devices–devices designed to work in communal spaces. The series is a call to action for everyone developing communal devices–whether you are creating business cases, designing experiences, or building technology–to take a step back and consider what is really needed.

This first article discusses what communal devices are, and how problems that appear result from our assumptions about how they’re used. Those assumptions were inherited from the world of PCs: the rules that apply to your laptop or your iPad just don’t apply to home assistants and other “smart devices,” from light bulbs to refrigerators.  It isn’t just adding the ability for people to switch accounts. We need a new paradigm for the future of technical infrastructure for our homes and offices. In this series of articles we will tell you how we got here, why it is problematic, and where to go to enable communal computing.

The Wrong Model

Problems with communal devices arise because the industry has focused on a specific model for how these devices are used: a single person buys, sets up, and uses the device. If you bought one of these devices (for example, a smart speaker) recently, how many other people in your household did you involve in setting it up?

Smart screen makers like Amazon and Google continue to make small changes to try to fix the weirdness. They have recently added technology to automatically personalize based on someone’s face or voice. These are temporary fixes that will only be effective until the next special case reveals itself. Until the industry realizes the communal nature of users’ needs they will just be short lived patches. We need to turn the model around to make the devices communal first, rather than communal as an afterthought.

I recently left Facebook Reality Labs, where I was working on the Facebook Portal identity platform, and realized that there was zero discourse about this problem in the wider world of technology. I’ve read through many articles on how to create Alexa skills and attended talks about the use of IoT, and I’ve even made my own voice skills. There was no discussion of the communal impacts of those technologies. If we don’t address the problems this creates, these devices will be relegated to a small number of uses, or unplugged to make room for the next one. The problems were there, just beneath the shiny veneer of new technologies.

Communal began at home

Our home infrastructure was originally communal. Consider a bookcase: someone may have bought it, but anyone in the household could update it with new books or tchotchkes. Guests could walk up to browse the books you had there. It was meant to be shared with the house and those that had access to it.

The old landline in your kitchen is the original communal device.

Same for the old landline that was in the kitchen. When you called, you were calling a household. You didn’t know specifically who would pick up. Anyone who was part of that household could answer. We had protocols for getting the phone from the person who answered the call to the intended recipient. Whoever answered could either yell for someone to pick up the phone elsewhere in the home, or take a message. If the person answering the phone wasn’t a member of the household, it would be odd, and you’d immediately think “wrong number.”

It wasn’t until we had the user model for mainframe time sharing that we started to consider who was using a computer. This evolved into full login systems with passwords, password reset, two factor authentication, biometric authentication, and more. As computers became more common,  what made sense inside of research and academic institutions was repurposed for the office.

In the 1980s and 1990s a lot of homes got their first personal computer. These were shared, communal devices, though more by neglect than by intention. A parent would purchase it and then set it up in the living room so everyone could use it. The account switching model wasn’t added until visual systems like Windows arrived, but account management was poorly designed and rarely used. Everyone just piggybacked on each other’s access. If anyone wanted privacy, they had to lock folders with a password or hide them in an endless hierarchy.

Early Attempts at Communal Computing

Xerox-PARC started to think about what more communal or ubiquitous computing would mean. However, they focused on fast account switching. They were answering the question: how could I get the personal context to this communal device as fast as possible? One project was digitizing the whiteboard, a fundamentally communal device. It was called The Colab and offered a way for anyone to capture content in a meeting room and then walk it around the office to other shared boards.

Not only did the researchers at PARC think about sharing computers for presentations, they also wondered how they could have someone walk up to a computer and have it be configured for them automatically. It was enabled by special cards called “Active Badges,” described in “A New Location Technique for the Active Office.” The paper starts with an important realization:

“…researchers have begun to examine computers that would autonomously change their functionality based on observations of who or what was around them. By determining their context, using input from sensor systems distributed throughout the environment, computing devices could personalize themselves to their current user, adapt their behaviour according to their location, or react to their surroundings.”

Understanding the context around the device is very important in building a system that adapts. At this point, however, researchers were still thinking about a ‘current user’ and their position relative to the system, rather than the many people who could be nearby.

Even Bill Gates had communal technology in his futuristic home back then. He would give every guest a pin to put on their person that would allow them to personalize the lighting, temperature, and music as they went from room to room. Most of these technologies didn’t go anywhere, but they were an attempt at making the infrastructure around us adapt to the people who were in the space.  The term “ubiquitous computing” (also known as “pervasive computing”) was coined to discuss the installation of sensors around a space; the ideas behind ubiquitous computing later led to the Internet of Things (IoT).

Communal Computing Comes Home

When the late 2000s rolled around, we found that everyone wanted their own personal computing device, most likely an iPhone. Shared home PCs started to die. The prevalence of smartphones and personal laptops killed the need for shared home PCs. The drive goal to provide information and communication services conveniently wherever the users happened to be, including if they’re sitting together on their couches.

When the Amazon Echo with Alexa was released, they were sold to individuals with Amazon accounts, but they were clearly communal devices. Anyone could ask their Echo a question, and it would answer. That’s where the problem starts.  Although Echo is a communal device, its user model wasn’t significantly different than the early PCs: one account, one user, shared by everyone in the household. As a result, items being mistakenly ordered by children made Amazon pull back some features that were focused on shopping. Echo’s usage ended up being driven by music and weather.

With the wild success of the Echo and the proliferation of Alexa-enabled devices, there appeared a new device market for home assistants, some just for audio and others with screens. Products from Apple (HomePod with Siri), Google (Home Hub), and Facebook (Portal) followed. This includes less interactive devices like digital picture frames from Nixplay, Skylight, and others.

Ambient Computing

Ambient computing” is a term that has been coined to talk about digital devices blending into the infrastructure of the environment. A recent paper by Map Project Office focused on how “ambient tech brings the outside world into your home in new ways, where information isn’t being channelled solely through your smartphone but rather a series of devices.” We take a step back from screens and wonder how the system itself is the environment.

The concept of ambient computing is related to the focus of marketing organizations on omnichannel experiences. Omnichannel is the fact that people don’t want to start and end experiences on the same device. I might start looking for travel on a smartphone but will not feel comfortable booking a trip until I’m on a laptop. There is different information and experience needed for these devices. When I worked at KAYAK, some people were afraid of buying $1,000 plane tickets on a mobile device, even though they found it there. The small screen made them feel uncomfortable because they didn’t have enough information to make a decision. We found that they wanted to finalize the plans on the desktop.

Ambient computing takes this concept and combines voice-controlled interfaces with sensor interfaces–for example, in devices like automatic shades that close or open based on the temperature. These devices are finding traction, but we can’t forget all of the other communal experiences that already exist in the world:

Device or objectWhy is this communal?Home automation and IoT like light bulbs and thermostats Anyone with home access can use controls on device, home assistants, or personal appsiRobot’s RoombaPeople walking by can start or stop a cleaning through the ‘clean’ or ‘home’ buttonsVideo displays in office meeting roomsEmployees and guests can use the screens for sharing their laptops and video conferencing systems for callingDigital whiteboardsAnyone with access can walk up and start writingTicketing machines for public transportAll commuters buy and refill stored value cards without logging into an accountCar center screens for entertainmentDrivers (owners or borrowers) and passengers can change what they are listening toSmartphone when two people are watching a videoAnyone in arm’s reach can pause playbackGroup chat on Slack or DiscordPeople are exchanging information and ideas in a way that is seen by everyone Even public transportation ticketing machines are communal devices.

All of these have built experience models that need a specific, personal context and rarely consider everyone who could have access to them. To rethink the way that we build these communal devices, it is important that we understand this history and refocus the design on key problems that are not yet solved for communal devices.

Problems with single user devices in the home

After buying a communal device, people notice weirdness or annoyances. They are symptoms of something much larger: core problems and key questions that should have considered the role of communities rather than individuals. Here are some of those questions:

  1. Identity: do we know all of the people who are using the device?
  2. Privacy: are we exposing (or hiding) the right content for all of the people with access?
  3. Security: are we allowing all of the people using the device to do or see what they should and are we protecting the content from people that shouldn’t?
  4. Experience: what is the contextually appropriate display or next action?
  5. Ownership: who owns all of the data and services attached to the device that multiple people are using?

If we don’t address these communal items, users will lose trust in their devices. They will be used for a few key things like checking the weather, but go unused for a majority of the day. They are eventually removed when another, newer device needs the plug. Then the cycle starts again. The problems keep happening and the devices keep getting recycled.

In the following articles we will dive into how these problems manifest themselves across these domains and reframe the system with dos and don’ts for building communal devices.

Thanks

Thanks to Adam Thomas, Mark McCoy, Hugo Bowne-Anderson, and Danny Nou for their thoughts and edits on the early draft of this. Also, from O’Reilly, Mike Loukides for being a great editor and Susan Thompson for the art.

Categories: Technology

Code as Infrastructure

O'Reilly Radar - Tue, 2021/06/08 - 06:22

A few months ago, I was asked if there were any older technologies other than COBOL where we were in serious danger of running out of talent. They wanted me to talk about Fortran, but I didn’t take the bait. I don’t think there will be a critical shortage of Fortran programmers now or at any time in the future. But there’s a bigger question lurking behind Fortran and COBOL: what are the ingredients of a technology shortage? Why is running out of COBOL programmers a problem?

The answer, I think, is fairly simple. We always hear about the millions (if not billions) of lines of COBOL code running financial and government institutions, in many cases code that was written in the 1960s or 70s and hasn’t been touched since. That means that COBOL code is infrastructure we rely on, like roads and bridges. If a bridge collapses, or an interstate highway falls into disrepair, that’s a big problem. The same is true of the software running banks.

Fortran isn’t the same. Yes, the language was invented in 1957, two years earlier than COBOL. Yes, millions of lines of code have been written in it. (Probably billions, maybe even trillions.) However, Fortran and COBOL are used in fundamentally different ways. While Fortran was used to create infrastructure, software written in Fortran isn’t itself infrastructure. (There are some exceptions, but not at the scale of COBOL.) Fortran is used to solve specific problems in engineering and science. Nobody cares anymore about the Fortran code written in the 60s, 70s, and 80s to design new bridges and cars. Fortran is still heavily used in engineering—but that old code has retired. Those older tools have been reworked and replaced.  Libraries for linear algebra are still important (LAPACK), some modeling applications are still in use (NEC4, used to design antennas), and even some important libraries used primarily by other languages (the Python machine learning library scikit-learn calls both NumPy and SciPy, which in turn call LAPACK and other low level mathematical libraries written in Fortran and C). But if all the world’s Fortran programmers were to magically disappear, these libraries and applications could be rebuilt fairly quickly in modern languages—many of which already have excellent libraries for linear algebra and machine learning. The continued maintenance of Fortran libraries that are used primarily by Fortran programmers is, almost by definition, not a problem.

If shortages of COBOL programmers are a problem because COBOL code is infrastructure, and if we don’t expect shortages of Fortran talent to be a problem because Fortran code isn’t infrastructure, where should we expect to find future crises? What other shortages might occur?

When you look at the problem this way, it’s a no-brainer. For the past 15 years or so, we’ve been using the slogan “infrastructure as code.” So what’s the code that creates the infrastructure? Some of it is written in languages like Python and Perl. I don’t think that’s where shortages will appear. But what about the configuration files for the systems that manage our complex distributed applications? Those configuration files are code, too, and should be managed as such.

Right now, companies are moving applications to the cloud en masse. In addition to simple lift and shift, they’re refactoring monolithic applications into systems of microservices, frequently orchestrated by Kubernetes. Microservices in some form will probably be the dominant architectural style for the foreseeable future (where “foreseeable” means at least 3 years, but probably not 20). The microservices themselves will be written in Java, Python, C++, Rust, whatever; these languages all have a lot of life left in them.

But it’s a safe bet that many of these systems will still be running 20 or 30 years from now; they’re the next generation’s “legacy apps.” The infrastructure they run on will be managed by Kubernetes—which may well be replaced by something simpler (or just more stylish). And that’s where I see the potential for a shortage—not now, but 10 or 20 years from now. Kubernetes configuration is complex, a distinct specialty in its own right. If Kubernetes is replaced by something simpler (which I think is inevitable), who will maintain the infrastructure that already relies on it? What happens when learning Kubernetes isn’t the ticket to the next job or promotion? The YAML files that configure Kubernetes aren’t a Turing-complete programming language like Python; but they are code. The number of people who understand how to work with that code will inevitably dwindle, and may eventually become a “dying breed.” When that happens, who will maintain the infrastructure? Programming languages have lifetimes measured in decades; popular infrastructure tools don’t stick around that long.

It’s not my intent to prophesy disaster or gloom. Nor is it my intention to critique Kubernetes; it’s just one example of a tool that has become critical infrastructure, and if we want to understand where talent shortages might arise, I’d look at critical infrastructure. Who’s maintaining the software we can’t afford not to run? If it’s not Kubernetes, it’s likely to be something else. Who maintains the CI/CD pipelines? What happens when Jenkins, CircleCI, and their relatives have been superseded? Who maintains the source archives?  What happens when git is a legacy technology?

Infrastructure as code: that’s a great way to build systems. It reflects a lot of hard lessons from the 1980s and 90s about how to build, deploy, and operate mission-critical software. But it’s also a warning: know where your infrastructure is, and ensure that you have the talent to maintain it.

Categories: Technology

June 10th Virtual Meeting

PLUG - Mon, 2021/06/07 - 09:28
We've got 2 presentations this month on Containers and The Fediverse for you this month.

Attend the meeting on by visiting: https://lufthans.bigbluemeeting.com/b/plu-yuk-7xx

Sebastian: Putting Containers on Waiters

Description:
Containers have been around for a long time, though they have recently become more popular. Sebastian will go into fedora's main container management software, Podman; discuss the uses of containers and how they can improve an organizations security and reliability; downsides to using docker containers; and where many security issues are found in the space.
Over the course of this presentation, nginx, apache, and nextcloud servers will be built for examples. Maybe others, depending on audience desires.
This is a demo presentation, so slides are written as notes for people afterward.

About Sebastian:
By day, Sebastian works for the Arizona Counter Terrorism Information Center, in which he uses Linux and Open-Source software to accomplish the organization's tasks.
By night, Sebastian... does the same, thing. He works on servers for home use and researches what open-source software people can generally use, as well as how to improve the processes for their construction and maintenance.



Bob Murphy: An Introduction to the Fediverse

Description:
The Fediverse is a collection of communities that is a bit of a throwback to a smaller, more personal time of the internet. There are services for short messaging, audo and video sharing, and event organizing, among other things.

We'll talk a bit about the origin, and the present state of the Fediverse, and some of the services that you can use to have a more reasonable social media experience.

About Bob:
Linux sysadmin, long time desktop Linux user. Not the most social "social media" user.

Radar trends to watch: June 2021

O'Reilly Radar - Tue, 2021/06/01 - 06:45

The most fascinating idea this month is POET, a self-enclosed system in which bots that are part of the system overcome obstacles that are generated by the system. It’s a learning feedback loop that might conceivably be a route to much more powerful AI, if not general intelligence.

It’s also worth noting the large number of entries under security. Of course, security is a field lots of people talk about, but nobody ends up doing much. Is the attack against the Colonial pipeline going to change anything? We’ll see. And there’s one trend that’s notably absent. I didn’t include anything on cryptocurrency. That’s because, as far as I can tell, there’s no new technology; just a spike (and collapse) in the prices of the major currencies. If anything, it demonstrates how easily these currencies are manipulated.

AI
  • Using AI to create AI: POET is a completely automated virtual world in which software bots learn to navigate an obstacle course.  The navigation problems themselves are created by the world, in response to its evaluation of the robots’ performance. It’s a closed loop. Is it evolving towards general intelligence?
  • IBM is working on using AI to write software, focusing on code translation (e.g., COBOL to Java). They have released CodeNet, a database of 14 million samples of source code in many different programming languages. CodeNet is designed to train deep learning systems for software development tasks. Microsoft is getting into the game, too, with GPT-3.
  • Vertex AI is a “managed machine learning platform” that includes most of the tools developers need to train, deploy, and maintain models in an automated way. It claims to reduce the amount of code developers need to write by 80%.
  • Google announces LaMDA, a natural language model at GPT-3 scale that was trained specifically on dialog. Because it was trained in dialog rather than unrelated text, it can participate more naturally in conversations and appears to have a sense of context.
  • Automated data cleaning is a trend we started watching a few years ago with Snorkel. Now MIT has developed a tool that uses probabilistic programming to fix errors and omissions in data tables.
  • AI is becoming an important tool in product development, supplementing and extending the work of engineers designing complex systems. This may lead to a revolution in CAD tools that can predict and optimize a design’s performance.
  • Designing distrust into AI systems: Ayanna Howard is researching the trust people place in AI systems, and unsurprisingly, finding that people trust AI systems too much. Tesla accidents are only one symptom. How do you build systems that are designed not to be perceived as trustworthy?
  • Important lessons in language equity: While automated translation is often seen as a quick cure for supporting non-English speaking ethnic groups, low quality automated translations are a problem for medical care, voting, and many other systems. It is also hard to identify misinformation when posts are translated badly, leaving minorities vulnerable.
  • Andrew Ng has been talking about the difference between putting AI into production and getting it to work in the lab. That’s the biggest hurdle AI faces on the road to more widespread adoption. We’ve been saying for some time that it’s the unacknowledged elephant in the room.
  • According to The New Stack, the time needed to deploy a model has increased year over year, and at 38% of the companies surveyed, data scientists spend over half of their time in deployment. These numbers increase with the number of models.
Data
  • Collective data rights are central to privacy, and are rarely discussed. It’s easy, but misleading, to focus discussions on individual privacy, but the real problems and harms stem from group data. Whether Amazon knows your shoe size doesn’t really matter; what does matter is whether they can predict what large groups want, and force other vendors out of the market.
  • Mike Driscoll has been talking about the stack for Operational Intelligence. OI isn’t the same as BI; it’s about a real time understanding of the infrastructure that makes the business work, rather than day to day understanding of sales data and other financial metrics.
  • Deploying databases within containerized applications has long been difficult. DataStax and other companies have been evolving databases to work well inside containers. This article is  primarily about Cassandra and K8ssandra, but as applications move into the cloud, all databases will need to change.
Programming
  • Software developers are beginning to think seriously about making software sustainable. Microsoft, Accenture, Github, and Thoughtworks have created the Green Software Foundation, which is dedicated to reducing the carbon footprint required to build and run software. O’Reilly Media will be running an online conversation about cloud providers and sustainability.
  • Google has released a new open source operating system, Fuchsia, currently used only in its Home Hub.  Fuchsia is one of the few recent operating systems that isn’t Linux-based. Application programming is based on Flutter, and the OS is designed to be “invisible.”
  • A service mesh without proxies is a big step forward for building applications with microservices; it simplifies one of the most difficult aspects of coordinating services that are working together.
  • As much as they hate the term, unqork may be a serious contender for enterprise low-code. They are less interested in democratization and “citizen developers” than making the professional software developers more efficient.
  • The evolution of JAMstack: distributed rendering, streaming updates, and extending collaboration to non-developers.
  • Grain is a new programming language designed to target Web Assembly (wasm). It is strongly typed and, while not strictly functional, has a number of features from functional languages.
  • Grafar and Observable Plot are new JavaScript libraries for browser-based data visualization. Observable Plot was created by Mike Bostock, the author of the widely used D3 library.
Security
  • Morpheus is a microprocessor that randomly changes its architecture to foil attackers: This is a fascinating idea. In a 3-month long trial, 525 attackers were unable to crack it.
  • Self-sovereign identity combines decentralized identifiers with verifiable credentials that can be stored on devices. Credentials are answers to yes/no questions (for example, has the user been vaccinated for COVID-19).
  • A WiFi attack (now patched) against Teslas via the infotainment system doesn’t yield control of the car, but can take over everything that the infotainment system controls, including opening doors and changing seat positions. Clearly the infotainment system controls too much. Other auto makers are believed to use the same software in their cars.
  • Passphrases offer better protection than complex passwords with complex rules. This has been widely known for several years now. The important question is why companies aren’t doing anything about it. We know all too well that passwords are ineffective, and that forcing users to change passwords is an anti-pattern.
  • Fawkes and other tools for defeating face recognition work by adding small perturbations that confuse the algorithms. For the moment, at least. Face recognition systems already appear to be catching up.
  • Tracking phishing sites has always been a problem. Phish.report is a new service for reporting phishing sites, and notifying services that flag phishing sites.
Web and Social Media
  • Ben Evans has a great discussion of online advertising and customer acquisition in a post-Cookie world.
  • Models from epidemiology and the spread of viruses can be used to understand the spread of misinformation. The way disease spreads and the way misinformation spreads turn out to be surprisingly similar.
  • Google brings back RSS in Chrome? The implementation sounds awkward, and there have always been decent RSS readers around. But Google has clearly decided that they can’t kill it off–or that they don’t want web publishing to become even more centralized.
  • Video editing is exploding: YouTube has made that old news.  But it’s set to explode again, with new tools, new users, and increased desire for professional quality video on social media.
  • New York has passed a law requiring ISPs to provide broadband to poor families for $15/month. This provides 25 Mbps downloads; low income households can get high speed broadband for $20/month.
Hardware
  • Google, Apple, and Amazon back Matter, a standard for interoperability between smart home devices. A standard for interoperability is important, because nobody wants a “smart phone” where every appliance, from individual light bulbs to the media players, requires a separate app.
  • Moore’s law isn’t dead yet: IBM has developed 2 nanometer chip technology; the best widely used technology is currently 7nm. This technology promises lower power consumption and faster speeds.
  • Google plans to build a commercially viable error-corrected quantum computer by 2029. Error correction is the hard part. That will require on the order of 1 million physical qbits; current quantum computers have under 100 qbits.
Biology
  • The photo is really in bad taste, but researchers have developed a medical sensor chip so small that Bill Gates could actually put it into your vaccine! It’s powered by ultrasound, and uses ultrasound to transmit data.
  • With sensors implanted in his brain, a paralyzed man was able to “type” by imagining writing. AI decoded signals in his brain related to the intention to write (not the actual signals to his muscles). He was able to “type” at roughly 15 words per minute with a 5% error rate.
Categories: Technology
Subscribe to LuftHans aggregator