Nine secrets of statistics and a poke at web scale


, ,

My comment on Nine secrets you should have been taught as part of your undergraduate statistics degree follows.

Since you asked, I have a few additional suggestions.

For statistics students

Even if your interest is in mathematical statistics, do take at least one course in observational methods. Statistics for sociologists might seem tedious; it did to me! It is sufficiently different, e.g. Chi-squared, SPSS, that you’ll be happily surprised you had some exposure to it, even years later.

Secret Six is excellent advice. Statistics and probability theory give you a cabinet of analytic tools. In the workplace, you’ll have the freedom and the responsibility to decide which inference test or model is best, given the problem and available data. It is fun and exciting!

While reading that entry from the Opinion section of StatsLife, a pleasingly casual publication of The Royal Statistical Society, I noticed that it referenced another helpful list, 10 Secrets You Should Have Learned with Your Software Engineering Degree – But Probably Didn’t. Given the spirited and seemingly interminable debate about NoSQL versus traditional relational database management systems, I found Secret Five of Software Engineering amusing and ironic.


‘All the SQL I know I learned on the job. Why are databases an elective? What doesn’t use a database?’

The era of storing data in flat files is over. Everything goes into and out of a database. SQL is used to retrieve it. SQL is also a declarative language, not a procedural language, and so requires learning a new way of thinking about problem solving. Every programmer should understand the basics of database normalization and SELECTs, including basic INNER and OUTER JOINs, INSERTs, UPDATEs and DELETEs.


In the NoSQL ecosystem, MongoDB is particularly famous for its speed but less so for… many things, e.g. no schema, no JOINs. In recognition of such, I have included the infamous “MongoDB is Web Scale” video. It underscores the importance of several entries in the “secrets of software engineering education” post! For reinforced concept understanding, the video’s creator devoted an entire website, mongodb-is-web-scale to the transcript and back story.

Warning! Some NSFW language; visual is safe for all ages

What IS web scale?

There even seems to be some confusion over on Serverfault, see MongoDB: What does web scale mean?

I’ll try to be serious for a moment. MongoDB is touted as a document database, not a table database. Also, remember that NoSQL is often used as a general term for a non-relational database. NoSQL means “not only SQL” rather than “absolutely no SQL is allowed”!

Ignore data, focus on power


Ellie K:

Speaking truth to power is not what is going on with open data, although that was my understanding when I first learned about open data, about four years ago. I am more cynical now, apparently with good cause.

What IS open data?

Here’s a definition, straight from the source, which is ultimately the Open Knowledge Foundation (OKF), one of whose major funders is Pierre Omidyar of e-Bay, First Intercept and Glenn Greenwald fame. As defined by the Open Definition, open data  is data that can be freely used, reused and redistributed by anyone – subject only, at most, to the requirement to attribute and share alike. (What is “Open Definition”? Open Definition is another OKF project.) The key tenets are as follow:

  1. Availability and Access: the data must be available in its entirety, in a “convenient” format and at minimal cost, e.g. internet download
  2. Reuse and Redistribution: users of the data must be permitted to reuse and redistribute it.
  3. Universal Participation: everyone must be able to use, reuse and redistribute it, e.g. ‘non-commercial’ restrictions preventing ‘commercial’ use, or for educational purposes only, are not allowed.

The higher level goal is interoperability. For open data, it mostly means being able to intermix data sets from multiple providers.

The scope of the Open Knowledge Foundation’s openness initiative is quite broad. It includes but is not limited to, open access, open data, open education, open science, open government, open licenses and open software.

Government? Licenses? Software?! What’s up with that? What about FOSS, GNU, FSF, GPL and Creative Commons? I don’t know. Maybe this OKF post about the Open Software Service Definition will be helpful, insofar as it pertains to online software.

The limits of transparency

Open data isn’t a direct cure for injustice. It was not intended to be subversive, e.g. for some kinds of government data, national security restrictions apply. Similarly, the focus of open data was NOT to be on personal information.

More data is not resulting in greater transparency nor any of the beneficial aspects of insight into the workings of power.

Originally posted on mathbabe:

I get asked pretty often whether I “believe” in open data. I tend to murmur a response along the lines of “it depends,” which doesn’t seem too satisfying to me or to the person I’m talking about. But this morning, I’m happy to say, I’ve finally come up with a kind of rule, which isn’t universal. It focuses on power.

Namely, I like data that shines light on powerful people. Like the Sunlight Foundation tracks money and politicians, and that’s good. But I tend to want to protect powerless people, like people who are being surveilled with sensors and their phones. And the thing is, most of the open data focuses on the latter. How people ride the subway or how they use the public park or where they shop.

Something in the middle is crime data, where you have compilation of people being stopped by the police (powerless) and…

View original 218 more words

Subverting computing research for fool’s gold


Bitcoin is released on a regular schedule, no matter how much computing power is applied to it. Thus supply never meets demand, resulting in ever higher prices being paid with more computing power, i.e. more electricity.

bitcoin chart

Virtual money, real consumption

Chart via Virtual money, real consumption by Lucio BRAGAGNOLO. The following is a wretched, ad hoc Google Translate from Italian to English. Better to go read the original article, Denaro virtuale, consumi concreti.

We must solve two basic problems: who guarantees in case of adverse events, as there is no central bank; and most importantly what and how we should all pay the price of the financial adventures of a few. The really underrated aspect of Bitcoin is that the mechanism is designed to increase the computational power required as it increases the value of the currency. Currently there are mechanisms to adhere with excessive enthusiasm, but no one to call out wrong doers…

Next, via The cost of bitcoin:

Recall the magic that makes Bitcoin profound: scores of independent computers all over the world running at full speed in the hope of capturing new Bitcoin, and in the process verifying transactions for free. Those computers need power, and that power needs to be generated. True, whoever owns the servers is paying a huge electricity bill… Moreover, the design of Bitcoin guarantees that electrical consumption increases dramatically, indefinitely.

Perverse incentives: high performance computing

What makes Bitcoin so clever is how it assumes self-interest and uses incentives. All those virtuous, decentralized, distributed miners… Let’s put aside the fact that ASIC mining rigs with sufficient processing power to mine bitcoin (due to the more advanced status of the blockchain; first mover’s advantage) are now priced in the tens of thousands of dollars. Rather, the more general problem, one that always lurks in economic and mathematical models, is the negative externality. Fraud is one example of such, in the case of Mt. Gox.

Sadly, we now have another. US national agency computers misused to mine bitcoins via BBC.

Wisdom of the Cloud


, , , ,

Is it easier to secure the cloud?

On 7 Nov 2011, senior Defense Department officials and IT industry experts met in Arlington, VA to discuss how to better protect military and commercial cyberspace. At that time, the director of DARPA said that 2004 was the first year that proceeds from cyber crime activities were greater than those from illegal drug sales.

Army Gen. Keith Alexander, commander of U.S. Cyber Command and director of the National Security Agency said that the Defense Department is looking at cloud computing platforms. In cloud computing, remote servers are used to store data. “It’s easier to secure the cloud and it’s cheaper,” Gen. Alexander said, noting potential savings of 30%.

 — DOD, Industry Address ‘Intense Challenge’ of Cyber Security

On the wisdom of a DoD transition to the cloud

The article said, “Another change that would upgrade the military’s cyber defense and save money is adopting cloud computing platforms. It’s easier to secure the cloud…”

Please be careful about reliance on cloud computing! The cloud is cheaper. That’s great. There are probably other benefits, for example, better performance and improved access in the field. The field could be any remote location, say, Antarctica, or underwater, not just the battle field! But there’s nothing as safe and secure as a server and processor accessed over dedicated lines, no internet connectivity, with people on location controlling physical access 24/7, and all of it ring-fenced by, well, fences! The Centers for Disease Control and Hoover Dam operate under that paradigm or similar, as stated on each entity’s public-facing website. Shouldn’t the NSA, CIA and DOD too?  If you transition to cloud computing, test it thoroughly. Thank you for allowing me to share my concerns and opinions.
— Ellie Kesselman, Arizona, U.S.A. 11/8/2011 5:48:26 AM


My hesitancy about the wisdom of relying on vendor cloud computing has increased since then. I am not certain that it is easier to secure the cloud. I fear that reliance on contractors, facilitated by FedRAMP, is likely to cost us dearly in the long-run. Continue reading

What you do not know is what matters



Uncertainty is when there is little consistent data, probability distributions are unknown and statistical analysis must be replaced by intuition and instinct.

Rumsfeld’s risk categorization grid

There are known knowns; there are things we know…there are known unknowns; that is to say, we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know.

 It’s What You Don’t Know That Matters, David Blitzer, S&P    

Fed Reserve issues monthly update

FOMC tapers to $45 Bil per month

Former U.S. Secretary of Defense Donald Rumsfeld omitted one important case, the unknown knowns. These are boundary values, that are believed to hold with absolute certainty. Sometimes they don’t. Economic models and financial markets do not operate under the same binding structural constraints of physical systems.

These are the truths that aren’t…

Continue reading


Get every new post delivered to your Inbox.