Statistical Modeling, Causal Inference, and Social Science

Web Name: Statistical Modeling, Causal Inference, and Social Science






In her book, Telling it like it wasn t: The counterfactual imagination in history and fiction, Catherine Gallagher usefully distinguishes between three sorts of historical speculation:1. Counterfactual histories which are generally analytical rather than narrative and indicate multiple possibilities that went unrealized rather than to trace out single historical alternative trajectories in detail. 2. Alternate histories, which describe one continuous sequence of departures from the historical record, thereby inventing a long counterfactual nearrative with a correspondingly divergent fictional world, while drawing the dramatis personae exclusively form the actual historical record. 3. Alternate-history novels, which invent not only alternative-history trajectories but also fictional characters . . . presenting in detail the social, cultural, technological, psychological, and emotional totalities that result from the alterations. A few years ago we considered (also here) Niall Ferguson (in his pre-John Yoo phase), who edited a book of counterfactual histories and wrote a thoughtful essay on the subject.And around that time I also expressed my view that a common feature of the best alternate-history novels is that the alternate world is not real, in the context of the stories themselves:The Man in the High Castle by Philip K. Dick. In this book, which takes place in a world in which the Allies lost World War II, hints keep peeking through that the world inhabited by the characters is not reality. Our world is real, and the novel s characters are living in a fake world (which is imperfectly perceived by the title character, who is thus so dangerous to those in power). It’s a more complex twist on the theme of Time out of Joint, but ultimately the same idea: the people in the novel are living in a fake world which can come apart around them as they recognize that it is a shared illusion.Sort of like The Matrix in reverse. It is a standard theme that our world is fake, there is an underlying truth, etc. Dick turns this around. (Actually, I ve never seen The Matrix but this is what I m imagining it s about.)Pavane by Keith Roberts. In this classic, the Catholics regained control of England in the 1500s, leading to a much different twentieth-century world. The backstory, eventually revealed in the novel, is that the masters of our real world had seen the risks of nuclear weapons and had rerun history to give humankind an opportunity to develop without modern science and thus get some more time to figure things out before having to deal with potential species-ending warfare.Bring the Jubilee by Ward Moore, which describes a United States in which the Confederates had won the battle of Gettysburg and then the Civil War. In this one, the pattern of Pavane is reversed, sort of, in that the original world was the one described in most of the novel (the alternative history) but then, through some time-traveling mishaps, the story ends up in our reality.In discussing these examples, I argued that the thrill or interest of alternate-history novels comes from playing off the fact that our world is the real one. We also discussed this point here and here.Gallagher s book didn t bring up this particular issue of the role of the real world in alternate-history fiction, but it contained lots of other interesting ideas, and I recommend it. Among other things, Gallagher considers why certain alternative scenarios seem to have such strong appeal. She discusses how the scenario of Nazi-occupied Britain has been so popular, even though according to historians, this was never gonna happen: apparently the Germans never even had a serious invasion plan.In 2016, by coincidence (I assume), two books with similar titles and similar topics came out at the same time: Underground Railroad by Colson Whitehead, and Underground Airlines by Ben Winters. I remember reading the reviews when they came out, but it s only recently that I read the books themselves. I liked them both a lot.The two books are both alternate-history novels about American slavery, but they have some differences. As the titles suggest, Railroad takes place in the past (or, I guess, I should say, the past, as it s an alternative version with some fantastical or steampunk elements), whereas Airlines takes place in an alternative present. Also, Whitehead is a literature writer and Winters is a genre (mystery and science fiction) writer, and this is reflected in their styles. Both books have a lot of plot, but Airlines is more plot-driven.Anyway, both these books, in addition to being readable, thought-provoking, memorable, and funny they both had great deadpan humor had this feature that the real world is what s real.In Railroad this came about because the fantastical elements stood out from the rest of the story: the very ridiculousness of the railroad setup was a reminder of the groundedness of real life. Unlike the sort of alternate history that tries to convince you that, yes, it really could ve happened this way, Railroad introduces an implausible foreground in order to render the background more plausible.Airlines is different: there, the alternative world is treated more realistically with all sorts of little details that both connect the story to the real world and emphasize the differences but the running joke of the novel is how this alternative world has the same sort of racism and racial inequality we see in the modern-day United States. The message, then, is that this is who we are: this aspect of real world is so real that even a massive change in purportedly pivotal historical events does not change it.P.S. Here are good reviews of the three books mentioned above.Telling it like it wasn t, reviewed by Michael Wood.The Underground Railroad, reviewed by Jay Nordlinger.Underground Airlines, reviewed by Laura Miller. Two months ago, our Google Summer of Code (GSOC) intern Neel Shah wrote:Over the summer, I [Shah] will add LambertW transforms to Stan which enable us to model skewed and heavy-tailed data as approximate normals. . . .Though the normal distribution is one of our go-to tools for modeling, the real world often generates observations that are inconsistent with it. The data might appear asymmetric around a central value as opposed to bell-shaped or have extreme values that would be discounted under a normality assumption. When we can’t assume normality, we often have to roll up our sleeves and delve into a more complex model. But, by using LambertW × Z random variables it is possible for us to model the skewness and kurtosis from the data. Then, we continue with our model as if we had a normal distribution. Later, we can back-transform predictions to account for our skewness and kurtosis.In the first part, we introduce the LambertW function, also known as the product logarithm. Next, we discuss skewness and kurtosis (measures of asymmetry and heavy-tailedness), define the LambertW × Z random variables, and share our implementation plans. Finally, we demonstrate how LambertW transforms can be used for location-hypothesis testing with Cauchy-simulated data.To simplify matters, we are focusing on the case of skewed and/or heavy-tailed probabilistic systems driven by Gaussian random variables. However, the LambertW function can also be used to back-transform non-Gaussian latent input. Because Stan allows us to sample from arbitrary distributions, we anticipate that LambertW transforms would naturally fit into many workflows.And he did some stuff! Here s his progress report. Good to see work getting done. A few months ago I (Jessica) wrote about how the Census Bureau is applying differential privacy (DP) to 2020 results, and been sued for it by Alabama. Prior to this, data swapping techniques were used. The (controversial) switch to using differential privacy was motivated by database reconstruction attacks that the Census simulated using published 2010 Census tables. I didn’t go into detail about the attacks in my last post, but since their adoption of DP remains controversial and a new paper critiques the value of the information that their database reconstruction experiment provides, it’s a good time to discuss. For their reconstruction experiment (see Appendix B here), the Census took nine tables from 2010 results, reporting 6.2 billion of a total 150 billion statistics in the 2010 Census. They used the tables to infer a system of equations for sex, age, race, Hispanic/Latino ethnicity, and Census block variables then solved it using linear programming. Because the swapping method used in 2010 required the total and voting age populations to not vary at the block level, they were able to reconstruct over 300 million records for block location and voting age (18+). Next they used the data on race (63 categories), Hispanic/Latino origin, sex, and age (in years) from the 2010 tables to reconstruct individual-level records containing those variables. They calculated their accuracy using two internal (confidential) Census files: Census Edited File (CEF, the confidential data) and Hundred-percent Detail File (HDF, the confidential swapped individual-level data before tabulation). The table shows the results of the join, where Exact means exact match on all five variables (block, race, Hispanic/Latino origin, sex, age in years). Fuzzy age means age agreed within one year, and one error means one variable other than block (which they started with) was wrong (for age meaning off by more than 1 year). From here they simulated a re-identification attack using commerical data containing name, address, sex and birthdate that they purportedly acquired around the time of the 2010 Census. After converting the names and addresses in the commercial database to the corresponding Census key (PIK) and matching each address to Census block, they do a form of greedy match consisting of two loops through the reconstructed data: On a first pass they take the first record in the commercial data that matches exactly on block, sex, and age, and then on a second pass they try to match the remaining unmatched records in the reconstructed data, taking the first exact match on block and sex with age matching +/-1 year. The successful matches from both are the putative re-identifications (not yet confirmed as correct), which link the reconstructed data on block, sex, age, race, and ethnicity to a name and address. This process identified 138 million (45% of the 2010 Census resident population of the U.S.) putative re-identifications, 52 million of which were confirmed to indeed have matches in the confidential Census data (17% of the 2010 Census resident population). These figures are suggested to be conservative because the commercial data used were obtainable at the time, and so if an attacker had access to some better quality data that the Census wasn’t able to purchase in 2010, they might do better. To get worst-case stats, they can pretend the attacker has the most accurate known information, namely the names and addresses from the confidential Census CEF file. Under this assumption, the reconstructed data would produce 238 million putative re-identifications and roughly 180 million confirmed re-identifications, which means 58% of the 2010 Census resident population.  Is this alarming? John Abowd, Chief Scientist at the Census who provided the official details on the reconstruction experiments in court, has called these results “conclusive, indisputable, and alarming.” A recent paper called “The Role of Chance in the Census Bureau Database Reconstruction Experiment,” published in Population Research and Policy Review by Ruggles and Van Riper argues that they are not. The paper describes a simple simulation that does, in aggregate, approximately as well in matching rate (specifically, they are talking about the 46.5% exact matching rate calculated against CEF in the first table I shared above). The authors conclude therefore that “The database reconstruction experiment therefore fails to demonstrate a credible threat to confidentiality”, and argue multiple times that what the Census has done is equivalent to a clinical trial without a control group. (In fact this is said five times; the writing style is a bit heavy-handed). Ruggles and Van Riper describe generating 10,000 simulated blocks and populating them with random draws from the 2010 single-year-of-age and sex distribution, accounting for block population size (at least I think that’s what’s meant by “the simulated blocks conformed to the population-weighted size distribution of blocks observed in the 2010 Census.”) They then randomly drew 10,000 new age-sex combinations and searched for each of them in each of the 10,000 simulated blocks. In 52.6% of cases, they found someone in the simulated block who exactly matched the random-age-sex combination.At this point I have to stop and remark that despite how “simple” this procedure is, even writing the above description was annoying. The Ruggles and Van Riper paper is sloppily worded to the point where it’s hard to say what they did (though they do provide code, and anyone with the inclination to look at it should feel free to help fill in the details, like what 2010 tables exactly they were using). Even to write the high level description above, I had to search around online to disambiguate the matching and ultimately found a description on Ruggles’ website that clarifies they were searching for each of the new 10,000 random age sex combinations in each of the 10,000 simulated blocks, which is not what the paper says. Argh. The authors did at least confirm that in terms of “population uniques” (individuals who are unique in their combination of Census block, sex, and age in years) their simulated population was similar to the actual uniques in 2010 (reported to be 44%).  They then try assigning everyone on each simulated block the most frequent race and ethnicity in the block according to 2012 Census data, and find that race and ethnicity will be correct in 77.8% of cases (again, the language in the paper is a little loose but I assume that means match on both). They use “that method to adjust the random age-sex combinations” and find that 40.9% of cases would be expected to match on all four characteristics to a respondent on the same block, hence not so far off from that 46.5% reported by the Census. So, is this alarming? Before getting into that, perhaps I should mention that I personally welcome all of the “attacks” on database reconstruction attacks, because it’s the database reconstruction attack theorem and these experiments that are leading to big decisions like the Census deciding DP is the future for privacy protection. We do need to question all the assumptions made in motivating applications of DP, as it is certainly complex and unintuitive in various ways (my last post aired what I see as some of the major challenges). In other words, I see it as very important to understand the reasons we should be wary.However, this paper does not in my opinion provide a good one, for several reasons. Sloppiness of explication aside, the biggest is that there is a difference in the results of the Bureau’s reconstruction attacks, and it’s directly related to the Census’ concerns about using the old data swapping technique (which required block totals to be invariant). Here are two figures, one from Abowd’s Declaration and one from Ruggles and Van Riper. Abowd’s reaction to Ruggles and Van Riper’s analysis points out (as Ruggles and Van Riper do in their paper) that the Census reconstruction attack does considerably better on reconstructing records in low population census blocks. (Normally, I might ask why the Ruggles and Van Riper’s figure omits any axis ticks, as this really doesn’t aid in comparison, but that’s a question for another day). In text they do acknowledge that their random simulation guessed age and sex correctly in only 2.6% of cases with fewer than ten people, while the Census’ re-identification rate was just over 20%, a pretty big difference. Abowd also points out that while 2010 swapping focused primarily on the small blocks, “for the entire 2010 Census a full 57% of the persons are population uniques on the basis of block, sex, age (in years), race (OMB 63 categories), and ethnicity. Furthermore, 44% are population uniques on block, age and sex.” So differential privacy is a way to attempt to simultaneously address both.So is it all worth worrying about? On some level you could say that’s a religious question. I personally don’t worry about my data leaking from the Census, but some people do, and it’s worth noting that the Census is mandated by law to not “disclose or publish any private information that identifies an individual or business such, including names, addresses (including GPS coordinates), Social Security Numbers, and telephone numbers.” There’s some room for interpretation here, and the database reconstruction theorem provides a perspective. My own conclusion is that when you start thinking about protecting privacy and you want to do so with maximum guarantees, you end up somewhere close to differential privacy. There’s a lot of back and forth in the declarations and other proceeding descriptions I read from the trial that imply that there were some misunderstandings on Ruggles’ part about what aspects of the Bureau’s interpretation of compromised privacy have changed versus stayed consistent; there’s too much nuance to summarize concisely here. My problems with the Ruggles and Van Riper paper aside, there are important questions one might ask of the actual privacy budget used by the Census, including how exactly accuracy and risk trade-off across different subsets of the population with respect to common analytical applications of Census data. I don’t know all of what’s been discussed among privacy circles here, but I’ve heard that the Census 2020 epsilon is somewhere around 17 (looks like there are details here). Priyanka Nanayakkara, who provided comments as I was writing this post, tells me that it was initially much lower (around 4) but feedback from stakeholders about accuracy requirements led to the current relatively high level. I also don’t know enough about the post-processing that comprises part of their TopDown Algorithm either, but it seems Cynthia Dwork, DP creator, has commented that she doesn’t really like some of those steps. There is also a valid question of how the external attacker would verify that they got the re-identifications correct (on Twitter, Ruggles states that the Census Bureau “has reluctantly acknowledged” that “an outside intruder would have no means of determining if any particular inference was true.” In his declaration, Abowd acknowledges that an attacker would: “have to do extra field work to estimate the confirmation rate—the percentage of putative re-identifications that are correct. An external attacker might estimate the confirmation rate by contacting a sample of the putative re-identifications to confirm the name and address. An external attacker might also perform more sophisticated verification using multiple source files to select the name and address most consistent with all source files and the reconstructed microdata.” Assuming the worst about the attacker’s capabilities is a perspective that aligns well with a mindset that seems common in privacy/security research. At any rate Ruggles’ assertion about the Census admitting there’s no way doesn’t seem quite right. Finally, I just don’t get the reasoning in Ruggles and Van Riper’s clinical trial analogy. From the abstract: “To extend the metaphor of the clinical trial, the treatment and the placebo produced similar outcomes. The database reconstruction experiment therefore fails to demonstrate a credible threat to confidentiality.” It’s as though Ruggles and Van Riper want to be comparing results of a reconstruction attack made on differentially private versions of the same 2010 Census tables to non-differentially private versions, and finding that there isn’t a big difference. But that’s not what their paper is about, and I haven’t seen anyone attempt that. Priyanka had the same reaction and informs me that the Census has released DP-noised 2010 Census “demonstration data” corresponding to different levels of privacy preservation however, so it seems within the realm of possibility. Bottom line is that showing that a more random database reconstruction technique matches fairly well in aggregate does not invalidate the fact that the Census reconstructed 17% of the population’s records. P.S. I read through a lot of relevant parts of Abowd’s declarations in writing this (thankfully his descriptions of what the Census did were quite clear). Overall, some of it is pretty juicy! Ruggles apparently testified in court, so there are pages of Abowd directly addressing Ruggles’ criticisms, which appear to be based on the same simulated data reconstruction attack he and Van Riper report in the recent paper. At one point Abowd points out that Ruggles’ has misinterpreted the implementation of DP that the Census uses (their TopDown Algorithm) with DP itself. (Differential privacy is not an algorithm, it’s a mathematical definition of privacy that some algorithms satisfy) I can sympathize: I recall making this mistake at least once when I first encountered DP.  What I can’t quite imagine is getting to the point where I’m making this mistake while publicly taking on the Census Bureau.P.P.S. In response to Ruggles response to my post I crossed out part of my P.S. above. I can t easily verify whether what Abowd said about the misinterpretation is true, and it s irrelevant to my main points anyway. So I guess now my title is ironic! As a blogger with a moderate-sized readership, sometimes I get books in the mail, or emails offering me books to review. Recently I was contacted by the publisher of Art Studio, a series of books on real art for real children. The author writes that his book is geared to showing children that art and math are complementary studies, where one helps you learn the other. I was curious so I asked for a copy.The book is really interesting! Just to situate myself here, I like to draw sometimes but I m bad at it. I think I can confidently say I m in the lower half of the distribution of drawing ability. To put it another way: You don t want me on your Pictionary team. But I d like to get better.I flipped through the book, and I m getting the impression I can learn a lot from it. Some of the basic ideas I ve seen before, for example the idea of drawing a figure by first putting it together as as series of simple shapes but I like how Watt presents this, not just as a trick but as a matter of underlying form. And he has some good slogans, like the right place at the right size. I guess I resonated with the mathematical principles.Also this, on Drawing Hair:Usually children will scribble hair floating about the head. They wonder why it looks so awful. Well, so would your own hair if you scribbled it instead of combing it!Art is the study of Universal Form . If you want to draw the forms of the Universe, you have to understand how they are actually formed and simply copy the same movements. There is no magic to art. It is simply seeking a clearer view of how the Universe really works.That s very statistical he s talking about generative modeling! It s the Bayesian way: you don t fit a curve through data, you construct a process that could create the data.Also I like how he flat-out says that some ways of drawing are right and some are wrong. I understand that ultimately anything goes, but when I m learning I d like some guidelines, and a straight permissive approach doesn t really help. I like to be told what to do, not in detail but in general principles such as to start from the center and go outward from there.My plan now is to go through the book myself doing all the exercises. I guess it will take a few weeks. I ll report back to let you know if my drawing has improved.The target audience for this book is kids, so I showed it to a 10-year-old who likes to draw and asked for her thoughts:Q: Tell us about yourself. What sorts of things do you like to draw?A: No comment.Q: Do you prefer to draw with a pencil? Pen? Does it matter?A: It depends. Pencil is the obvios choice, since you can erase. Some pens have erasers, those work also.Q: What was your first impression when opening this book on Basic Drawing?A: When I first opened the book I thought it would be a journey-esque thing through drawing.Q: Did you learn anything from the book? If so, tell us about what you learned.A: It showed shapes and lines then formed them to drawings. I learned different techniques.Q: Would you recommend this book to someone of your abilities and experience?A: Yes, it takes you through drawing, helping you gain knowledge about drawing that you might not know.Q: Would you recommend this book to a drawing newbie?A: Yes. You learn all the starting things plus things that experienced artists might not know.Q: What is your favorite thing about the book?A: I like how it goes through drawing like a journey.Q: What is the most annoying thing about the book?A: It is pretty good, what could be improved is that they have some long parts they could substitute with something shorter but still learn the same amount.Q: If the book could have one more thing added to it, what would it be?A: I honestly don t know.Q: Thanks for answering all these questions. You now get a popsicle!A: Thanks!Q: One more question. The author writes, This book is geared to showing children that art and math are complementary studies, where one helps you learn the other. Did you seee this connection when reading the book?A: Not really, I guess with shapes but that s all.P.S. The above drawings are not intended to represent great drawing; rather, they re examples of two of the early lessons in the book. Jon Zelner, Julien Riou, Ruth Etzioni, and I write:Just as war makes every citizen into an amateur geographer and tactician, a pandemic makes epidemiologists of us all. Instead of maps with colored pins, we have charts of exposure and death counts; people on the street argue about infection fatality rates and herd immunity the way they might have debated wartime strategies and alliances in the past. The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has brought statistics and uncertainty assessment into public discourse to an extent rarely seen except in election season and the occasional billion-dollar lottery jackpot. In this paper, we reflect on our role as statisticians and epidemiologists and lay out some of the challenges that arise in measuring and communicating our uncertainty about the behavior of a never-before-seen infectious disease. We look at the problem from multiple directions, including the challenges of estimating the case fatality rate (i.e., proportion of individuals who will die from the disease), the rate of transmission from person to person, and even the number of cases circulating in the population at any time. We advocate for an approach that is more transparent about the limitations of statistical and mathematical models as representations of reality and suggest some ways to ensure better representation and communication of uncertainty in future public health emergencies.We discuss several issues of statistical design, data collection, analysis, communication, and decision-making that have arisen in recent and ongoing coronavirus studies, focusing on tools for assessment and propagation of uncertainty. This paper does not purport to be a comprehensive survey of the research literature; rather, we use examples to illustrate statistical points that we think are important.Here are the sections of the paper:Statistics and uncertaintyData and measurement qualityDesign of clinical trials for treatments and vaccinesDisease transmission modelsMultilevel statistical modelingCommunicationInformation aggregation and decision-makingNew issues keep coming up and they didn t all make it into the article; for example we didn t get into the confusions arising from aggregation bias. There s always more to be said. I found this question and answer from a leading social science researcher at a a trustworthy internet source:Why do people prefer to share misinformation, even when they value accuracy and are able to identify falsehoods?Researchers point to distraction and inattention. When they prompted Twitter users, even subtly, to think about accuracy before sharing content, the quality of the postings improved. Now we just need to get social media platforms to do something that reminds us of our preference for accuracy before we share.I m thinking this is a more general problem. NPR and Ted are more like traditional media than social media, but they exist within a social media environment, and both these media are notorious for (a) promoting junk science and (b) not coming to terms with the fact that they promote junk science. Much of the time, the junk science they promote stays up on their sites indefinitely (as with the notorious Why We Sleep), and when they do issue some sort of retraction, I don t feel that these organizations fully address the problem of their credulity with respect to all the bad studies they didn t get around to retracting.I get it: when we produce reports, we can make errors. I ve published errors myself, and I ve issued corrections for four of my published papers. Errors are inevitable, and we need to think about the next step, which is how to learn from them. I don t know that subtle prompting to think about accuracy is enough.I also don t think that lie detection tests are the answer. First, there s no such thing as a lie detector. Second, the big problem is not lying but the general free-lunch attitude that there is this large and readily accessible set of small painless interventions can have large and consistent effects, which in a social science context violates the piranha principle, for reasons we ve discussed in various places on this blog. Ethical problems and statistical challenges are connected through the Armstrong principle and Clarke s law, so it s complicated but I don t think that lie detection would help so much, even if it were possible, because it s my impression that lots of people who do junk science are sincere, and I bet that they can justify to themselves even the most extreme cases of data faking as just some jumping through hoops in order to satisfy the persnickety Stasi-type statisticians who are going around insisting on statistical significance for everything.Yes, I m a statistician and I hate statistical significance, but I doubt these researchers distinguish between different statisticians. To them, we re all a single annoying mass, an institution that annoyingly sends mixed messages, sometimes requiring statistical significance and other times telling people not to compute p-values at all. To ordinary researchers who are just trying to do their job, get on NPR a lot, give Ted talks, and get rich and famous, the entire field of statistics is like some sort of pesky IRB that keeps throwing up ever-changing roadblocks. So, to some of these researchers, I imagine that faking your data is kind of like backdating an expense account a step that, sure, is ethically questionable but really it s just a matter of fiddling with the paperwork in order to get to what is ultimately fair. And everybody does it, right? As the professor of marketing says, what separates honest people from not-honest people is not necessarily character, it’s opportunity. My point is, yes, this can be viewed as a moral issue, but ultimately I see it as a statistical issue, in that once you start with certain misconceptions (the expectation that effect sizes are huge, that statistical significance can and should be routinely found, that measurement doesn t really matter if you have causal identification, that everything you want to find is right there, if you put in the effort to look hard enough, that social scientists are heroes who deserve to be celebrated in the national media) and then you operate by avoiding opportunities to learn from your mistakes, then all these other problems flow from that. If you get into this mindset that everything you do is fundamentally correct, then questionable research practices, misrepresentation of data and the literature, and out-and-out fraud won t seem so bad. We discussed this general attitude last year, the view that the scientific process is “red tape,” just a bunch of hoops you need to jump through so you can move on with your life.As I wrote a few years ago in the context of pizzagate, it s fine to shine a light on bad behavior, but I think it s a mistake to focus on that rather than on the larger problem of researchers being trained to expect routine discovery and then being rewarded for coming up with a stream of apparent discoveries in a scientist-as-hero narrative.P.S. At the link two paragraphs above, commenter Sebastian draws a TV Tropes connection. I love TV Tropes. Paul Buerkner writes:In Stuttgart, I have a new fully funded PhD Student position about Meta-Uncertainty in Bayesian Model Comparison with close connection to Bayesian workflow topics, especially simulation-based calibration.At the Cluster of Excellence SimTech, University of Stuttgart, Germany, I am currently looking for a PhD Student (3 years; fully funded) to work with me on Meta-Uncertainty in Bayesian Model Comparison.And here is the description of the project:In experiments and observational studies, scientists gather data to learn more about the world. However, what we can learn from a single data set is always limited, and we are inevitably left with some remaining uncertainty. It is of utmost importance to take this uncertainty into account when drawing conclusions if we want to make real scientific progress. Formalizing and quantifying uncertainty is thus at the heart of statistical methods aiming to obtain insights from data. Numerous research questions in basic science are concerned with comparing multiple scientific theories to understand which of them is more likely to be true, or at least closer to the truth. To compare these theories, scientists translate them into statistical models and then investigate how well the models’ predictions match the gathered real-world data. One widely applied approach to compare statistical models is Bayesian model comparison (BMC). Relying on BMC, researchers obtain the probability that each of the competing models is true (or is closest to the truth) given the data. These probabilities are measures of uncertainty and, yet, are also uncertain themselves. This is what we call meta-uncertainty (uncertainty over uncertainties). Meta-uncertainty affects the conclusions we can draw from model comparisons and, consequently, the conclusions we can draw about the underlying scientific theories. However, we have only just begun to unpack and to understand all of the implications. This project contributes to this endeavor by developing and evaluating methods for quantifying meta-uncertainty in BMC. Building upon mathematical theory of meta-uncertainty, we will utilize extensive model simulations as an additional source of information, which enable us to quantify so-far implicit yet important assumptions of BMC. What is more, we will be able to differentiate between a closed world, where the true model is assumed to be within the set of considered models, and an open world, where the true model may not be within that set – a critical distinction in the context of model comparison procedures.For more details about the position, please see an M-open mind, is all I can say. Richard Juster points to this article, Evaluation of Aducanumab for Alzheimer Disease: Scientific Evidence and Regulatory Review Involving Efficacy, Safety, and Futility. The article begins:On November 6, 2020, a US Food and Drug Administration (FDA) advisory committee reviewed issues related to the efficacy and safety of aducanumab, a human IgG1 anti-Aβ monoclonal antibody specific for β-amyloid oligomers and fibrils implicated in the pathogenesis of Alzheimer disease. . . .The primary evidence of efficacy for aducanumab was intended to be 2 nearly identically designed, phase 3, double-blind, placebo-controlled randomized clinical trials . . . The studies were initiated after a phase 1b safety and dose-finding study indicated suitable drug safety . . . Approximately halfway through the phase 3 studies, a planned interim analysis met prespecified futility criteria and, in March 2019, the sponsor announced termination of the trials.However, following this decision, and augmenting the data set with additional trial information that had been gathered after the futility determination, conflicting evidence of efficacy was identified in the 2 studies.This sort of thing must happen all the time: decisions must be made based on partial evidence.But then things start getting weird:Study 301 (n = 1647 randomized patients) did not meet its primary end point of a reduction relative to placebo in the Clinical Dementia Rating–Sum of Boxes (CDR-SB) score. According to prespecified plans to protect against erroneous conclusions when performing multiple analyses, no statistically valid conclusions could therefore be made for any of the secondary end points in study 301. By contrast, study 302 (n = 1638 patients) reached statistical significance on its primary end point, estimating a high dose treatment effect corresponding to a 22% relative reduction in the CDR-SB outcome compared with placebo (P = .01). In the low-dose aducanumab group in study 302, the effect was not statistically significant compared with placebo, and based on the prespecified analytic plan, this precluded the ability to assess efficacy with respect to secondary outcomes in both the high- and low-dose groups. . . .Lots of jargon here, but the message seems to be that the decisions are being made based on some sort of house of cards based on various statistical significance statements.I feel like these authors are doing their best. It s just that they re working with very crude tools, trying to paint a picture using salad tongs. In a randomized experiment (i.e. RCT, A/B test, etc.) units are randomly assigned to treatments (i.e. conditions, variants, etc.). Let s focus on Bernoulli randomized experiments for now, where each unit is independently assigned to treatment with probability q and to control otherwise.Thomas Aquinas argued that God s knowledge of the world upon creation of it is a kind of practical knowledge: knowing something is the case because you made it so. One might think that that in randomized experiments we have a kind of practical knowledge: we know that treatment was randomized because we randomized it. But unlike Aquinas s God, we are not infallible, we often delegate, and often we are in the position of consuming reports on other people s experiments.So it is common to perform and report some tests of the null hypothesis that this process did indeed generate the data. For example, one can test that the sample sizes in treatment and control aren t inconsistent with this. This is common in at least in the Internet industry (see, e.g., Kohavi, Tang Xu on sample ratio mismatch ), where it is often particularly easy to automate. Perhaps more widespread is testing whether the means of pre-treatment covariates in treatment and control are distinguishable; these are often called balance tests. One can do per-covariate tests, but if there are a lot of covariates then this can generate confusing false positives. So often one might use some test for all the covariates jointly at once.Some experimentation systems in industry automate various of these tests and, if they reject at, say, p 0.001, show prominent errors or even watermark results so that they are difficult to share with others without being warned. If we re good Bayesians, we probably shouldn t give up on our prior belief that treatment was indeed randomized just because some p-value is less than 0.05. But if we ve got p 1e-6, then — for all but the most dogmatic prior beliefs that randomization occurred as planned — we re going to be doubtful that everything is alright and move to investigate.In my own digital field and survey experiments, we indeed run these tests. Some of my papers report the results, but I know there s at least one that doesn t (though we did the tests) and another where we just state they were all not significant (and this can be verified with the replication materials). My sense is that reporting balance tests of covariate means is becoming even more of a norm in some areas, such as applied microeconomics and related areas. And I think that s a good thing.Interestingly, it seems that not everyone feels this way.In particular, methodologists working in epidemiology, medicine, and public health sometimes refer to a Table 1 fallacy and advocate against performing and/or reporting these statistical tests. Sometimes the argument is specifically about clinical trials, but often it is more generally randomized experiments.Stephen Senn argues in this influential 1994 paper:Indeed the practice [of statistical testing for baseline balance] can accord neither with the logic of significance tests nor with that of hypothesis tests for the following are two incontrovertible facts about a randomized clinical trial:1. over all randomizations the groups are balanced;2. for a particular randomization they are unbalanced.Now, no ‘significant imbalance’ can cause 1 to be untrue and no lack of a significant balance can make 2 untrue. Therefore the only reason to employ such a test must be to examine the process of randomization itself. Thus a significant result should lead to the decision that the treatment groups have not been randomized, and hence either that the trialist has practised deception and has dishonestly manipulated the allocation or that some incompetence, such as not accounting for all patients, has occurred.In my opinion this is not the usual reason why such tests are carried out (I believe the reason is to make a statement about the observed allocation itself) and I suspect that the practice has originated through confused and false analogies with significance and hypothesis tests in general.This highlights precisely where my view diverges: indeed the reason I think such tests should be performed is because I think that they could lead to the conclusion that the treatment groups have not been randomized . I wouldn t say this always rises to the level of incompetence or deception , at least in the applications I m familiar with. (Maybe I ll write about some of these reasons at another time — some involve interference, some are analogous to differential attrition.)It seems that experimenters and methodologists in social science and the Internet industry think that broken randomization is more likely, while methodologists mainly working on clinical trails put a very, very small prior probability on such events. Maybe this largely reflects the real probabilities in these areas, for various reasons. If so, part of the disagreement simply comes from cross-disciplinary diffusion of advice and overgeneralization. However, even some of the same researchers are sometimes involved in randomized experiments that aren t subject to all the same processes as clinical trials.Even if there is a small prior probability of broken randomization, if it is very easy to test for it, we still should. One nice feature of balance tests compared with other ways of auditing a randomization and data collection process is that they are pretty easy to take in as a reader.But maybe there are other costs of conducting and reporting balance tests?Indeed this gets at other reasons some methodologists oppose balance testing. For example, they argue that it fits into an, often vague, process of choosing estimators in a data-dependent way: researchers run the balance tests and make decisions about how to estimate treatment effects as a result.This is articulated in a paper in The American Statistician by Mutz, Pemantle Pham, which includes highlighting how discretion here creates a garden of forking paths. In my interpretation, the most considered and formalized arguments are saying is that conducting balance tests and then using that to determine which covariates to include in the subsequent analysis of treatment effects in randomized experiments has bad properties and shouldn t be done. Here the idea is that when these tests provide some evidence against the null of randomization for some covariate, researchers sometimes then adjust for that covariate (when they wouldn t have otherwise); and when everything looks balanced, researchers use this as a justification for using simple unadjusted estimators of treatment effects. I agree with this, and typically one should already specify adjusting for relevant pre-treatment covariates in the pre-analysis plan. Including them will increase precision.I ve also heard the idea that these balance tests in Table 1 confuse readers, who see a single p 0.05 — often uncorrected for multiple tests — and get worried that the trial isn t valid. More generally, we might think that Table 1 of a paper in a widely read medical journal isn t the right place for such information. This seems right to me. There are important ingredients to good research that don t need to be presented prominently in a paper, though it is important to provide information about them somewhere readily inspectable in the package for both pre- and post-publication peer review.In light of all this, here is a proposal:Papers on randomized experiments should report tests of the null hypothesis that treatment was randomized as specified. These will often include balance tests, but of course there are others.These tests should follow the maxim analyze as you randomize , both accounting for any clustering or blocking/stratification in the randomization and any particularly important subsetting of the data (e.g., removing units without outcome data).Given a typically high prior belief that randomization occurred as planned, authors, reviewers, and readers should certainly not use p 0.05 as a decision criterion here.If there is evidence against randomization, authors should investigate, and may often be able to fully or partially fix the problem long prior to peer review (e.g., by including improperly discarded data) or in the paper (e.g., by identifying the problem only affected some units assignments, bounding the possible bias).While it makes sense to mention them in the main text, there is typically little reason — if they don t reject with a tiny p-value — for them to appear in Table 1 or some other prominent position in the main text, particularly of a short article. Rather, they should typically appear in a supplement or appendix — perhaps as Table S1 or Table A1.This recognizes both the value of checking implications of one of the most important assumptions in randomized experiments and that most of the time this test shouldn t cause us to update our beliefs about randomization much. I wonder if any of this remains controversial and why.[This post is by Dean Eckles. This is my first post here. Because this post discusses practices in the Internet industry, I note that my disclosures include related financial interests and that I ve been involved in designing and building some of those experimentation systems.] Back in the 1970s, I remember occasionally reading a newspaper or magazine article about this mysterious thing called an HMO a health maintenance organization. The idea was that the medical system as we knew it (you go to the doctor when you re sick and pay some money, or you go to the hospital if you re in really bad shape and pay some money) had a problem because it gave doctors and hospitals a motivation for people to be sick: as it s sometimes said today, sick care, not health care. The idea is not that health care providers would want people to be sick, but that they d have no economic incentive to increase the general health in the population. This seemed in contradiction to Deming s principles of quality control, in which the goal should be to improve the system rather than to react to local problems.In contract, the way HMOs work is that you pay them a constant fee every month, whether or not you go to the doctor. So they are motivated to keep you healthy, not sick. Sounds like a great idea.But something happened between 1978 and today. Now we all have HMOs, but there s even more concerned about screwed-up economic motivations in the health care system. This time the concern is not that they want us to go to the doctor too much, it s that they want to perform too many tests on us and overcharge us for ambulance rides, hospital stays, aspirins they give us while we re in the ambulance or the hospital, etc. I guess this arises from the fact that much of the profit for HMOs is coming not from our monthly fees but from those extra charges.What s my point in writing about this? I m not an expert in health care research, so I don t have much to add in that direction (see for example . Rather, I m coming at this as an outsider.The simplest message here is: Ha! Unexpected consequences! Or, to get more granular, you could say that as long as there s loose money floating around, there will be operators figuring out how to grab it.Still, it s interesting to me how HMOs solved a problem of counterproductive incentives but then this led to a new problem of counterproductive incentives. And I don t think it s inevitable, as there are lots of other countries that don t have this particular set of problems with their health care systems. I was just listening to this This American Life show on Alex Jones. I knew he was a bad guy but I hadn t realized how horrible he was a kind of supercharged Al Sharpton with a much higher level of hatred and lying. But that won t be news to many of you. The stunning part to me was near the end, when interviewer Jon Ronson talks to some people who personally knew Jones in high school, knew that Jones told lie after lie after lie, and yet they say things like, who s to say. . . . some of the stuff he says could be true. It could be. I mean, Obama, he could be a Muslim. He could back them up, the radical Muslims. And he could have been giving them money behind I mean, who knows? We don t know. I mean, we hear what they want us to hear. We see what they want us to see. I mean, anything could be anything. As Ronson puts it, All these people who knew for sure that Alex had been a liar back in Rockwall. A lot of them believed that what he says on Infowars might be true. It struck me that this is an example of the fallacy of the one-sided bet, which is, in this case, to hear argument X and think that X might be true or maybe X isn t true, but not to consider the possibility that the opposite of X might be true. So, yeah, Alex Jones could be a pathological liar and still tell the truth sometimes on his show. Maybe Obama is a Muslim, despite there being no evidence for this. Maybe that school shooting never happened. The probability is about 1 in a zillion of that, but no probability is exactly zero. But, once you open the door to these things, why not consider other equally unlikely possibilities. Pick a random celebrity and say that he did the school shootings. Pick another random Christian celebrity and say that she s a Muslim. Etc.Being open-minded is fine, but no reason to take as your default belief the statements of a person who s known to have lied repeatedly.The purpose of this post is not to convince any Alex Jones fans that they re wrong. This blog has 10,000 readers so maybe there some Alex Jones fans in the audience, and maybe there are some others who aren t Alex Jones fans, exactly, but see him has a fighter for their side, who knows. It s easier to talk about ovulation and voting or beauty and sex ratio rather than Holocaust deniers and school shooting deniers because . . . these latter topics are just more upsetting to think about, as they involve actual people being murdered and then disrespected. Election denial is somewhere in the middle: no dead bodies involved (except on January 6th) but it threatens democracy so there s that. Anyway, if you re an Alex Jones fan, you might still be able to compartmentalize your views and so you can still read this blog for the statistics advice.No, the point of this post is just to reflect on the persistence of the fallacy of the one-way bet. Or, to put it another way, the power of defaults. My plea to you: Whenever you hear the claim X, don t immediately frame this as Maybe X is true and maybe it s not, leading to a some sort of belief that s a weighted average of X and what you thought before. That s not a generally appropriate mode of Bayesian analysis. This post is by Phil Price, not Andrew.A few weeks ago I posted about the claim by the company that made the running track for the Tokyo Olympics that the bounciness of the track makes it 1-2% faster than other professional tracks. The claim isn t absurd: certainly the track surface can make a difference. If today s athletes had to run on the cinder tracks of yesteryear their speed would surely be slower.At the time I wrote that post the 400m finals had not yet taken place, but of course they re done by now so I went ahead and took another quick look at the whole issue and the bottom line is that I don t think the track in Tokyo let the runners run noticeably faster than the tracks used in recent Olympics and World Championships. Here s the story in four plots. All show average speed rather than time: the 200m takes about twice as long as the 100m,  so they have comparable average speed. Men are faster, so in each panel (except the bottom right) the curve(s) for men are closer to the top, women are closer to the bottom.  Andrew, thanks for pointing out that this is better than having separate rows of plots for women and men, which would add a lot of visual confusion to this display. The top left plot shows the average speed for the 1st-, 2nd-, 3rd-, and 4th-place finishers in the 100, 200, and 400m, for men and women.  Each of the subsequent plots represents a further aggregation of these data. The upper right just adds the times together and the distances together, so, for instance, the top line is (100 + 200 + 400 meters) / (finishing time of the fastest man in the 100m + finishing time of the fastest man in the 200m + finishing time of the fastest man in the 400 m).  The bottom left aggregates even farther: the total distance run by all of the male finishers divided by the total time of all the male finishers, in all of the races; and the same for the women.And finally, taking it to an almost ludicrous level of aggregation, the bottom right shows the mean speed the total distance run by all of the competitors in all of the races, divided by the total of all of the times  divided by mean of all of the mean speeds, averaged over all of the years. A point at a y-value of 1.01 on this plot would mean that the athletes that year averaged 1% faster than in an average year.If someone wants to claim the track allows performances that are 1-2% faster than on previous tracks, they re going to have to explain why the competitors in the sprints this year were only about 0.4% faster than the average per the past several Olympics and World Championships. Even that 0.4% looks a bit iffy, considering the men weren t faster at all. You can make up a just so story about the track being better tuned towards women s lighter bodies and lower forces exerted on the track, but I won t believe it. There s year-to-year and event-to-event variation in results, depending on exactly what athletes are competing, where they are in their careers, what performance-enhancing drugs they are taking (if any), and other factors too (wind, temperature on race day, etc.).  It s not inconceivable that the sprint speeds would have been 1-2% slower this year if not for the magical track, which just happened to bring them back up to around the usual values. But that s sure not the way to bet.This post is by Phil.  This is Jessica. Earlier this year, I was reading some of the emerging work on AI bias and fairness, and it got me thinking about the nature of claims (and common concerns about them) in AI/ML research compared to those in empirical social science research. In a social science like psych, one could characterize researchers as often aiming to learn a function from (often small) data so they can interpret parameter values to make claims about the real world referents they are thought to summarize. Whereas in In AI/ML, the goal is more to learn a function from large data in order to make claims about its performance on unseen data. So how much do the types of reproducibility concerns that arise in AI/ML work resemble those in the social sciences? Discussions of reproducibility, replicability, and robustness of claims made in ML-oriented research appear more nascent than in social science methods reform, but are cropping up and (I hope) will eventually become more mainstream. I say ML-oriented research to include both research aimed at producing new ML techniques and more applied ML research where claims about performance tend to be more domain-specific. The idea of a reproducibility crisis affecting ML research has been around for a few years; Joelle Pineau started talking about this a few years ago, initially about problems reproducing results in deep reinforcement learning (though there are earlier papers on reinforcement learning claims being vulnerable to overfitting to the evaluation environment). Pineau led the ICLR 2018 reproducibility challenge and there have been subsequent ICLR challenges and NeurIPS reproducibility programs among others. The initial ICLR challenge found only 32.7% of reported attempts to reproduce 95 papers were successful in reproducing most of the work in the paper given whatever materials the authors provided. Most of the work I’ve seen focuses on reproducibility over replication or robustness, which makes sense as if you can’t reproduce the results in the paper on the same data, you shouldn’t expect that you’ll be able to replicate them on new data, or that they’ll be robust to different implementations of the learning methods. Some of the reasons for non-reproducibility are obvious, like not being able to obtain either the code or training data, which I would assume is less common than it is social science, but still apparently doesn’t happen at times. And as described here, some of the problems may be due simply to bad or incomplete reporting, given that another study found that over half of 255 papers could be successfully independently reproduced (successfully reproduced meaning that 75%+ of the claims made could be reproduced)  and this number increased when the original authors were able to help, unlike examples of adding the original authors in psych reproduction attempts.  But then there’s the more interesting reasons. How different is “the art of tweaking” in AI/ML oriented research relative to empirical psych, for instance? There’s the need to tune hyperparameters, the inherent randomness in probabilistic algorithms, the fact that the relationships between model parameters and performance is often pretty opaque, etc., all of which may contribute to a lack of awareness that one is doing the equivalent of cherrypicking. I’ve done only a little applied ML work but the whole ‘let’s tweak this knob we don’t fully understand the implications of and try rerunning’ which Pineau implies in the talk linked above seems kind of like par for the course from what I ve seen. Among researchers I would expect there is often some attempt to understand the knob first, but with deep models there are  design choices where the explanations are still actively being worked out by the theorists. I guess in this way the process can be similar to the process of trying to get a complex Bayesian model to fit, where identifying the priors for certain parameters that will lead to convergence can seem opaque. In the conventional empirical psych study analyzed using NHST, I think of researcher “tweaking” as being more about tweaking aspects of the data collection (e.g., the set of stimuli studied), of the data included in analysis, and of how parameters of interest are defined. There’s also an asymmetry in incentives in ML research to hyper-tweak your model and not the baseline models you’re benchmarking against, which you want to look bad. Maybe there s an analogue in psych experiments where you want your control condition to perform badly.“Leakage” is another point of comparison. In reform discussions affecting empirical psych, pre-registration is framed as a way to prevent researchers from making analysis decisions conditional on the specific data that was collected. In applied ML, leakage can seem less human-mediated. In a recent paper, Kapoor and Narayanan describe a case study on applied ML in political science (see also this project page), where the claims are about out-of-sample performance in predicting civil war onset. They identified recent civil war related studies published in top poli sci journals claiming superior performance of ML models over logistic regression, and ultimately found 12 that provided code and data and made claims about ML model performance using a train-test split. Four papers in this set claimed an ML model outperformed logistic regression, and they tried to reproduce these results. For all four, they were able to reproduce the rankings of models, but they found errors in all four analyses stemming from leakage, none of which were discoverable from the paper text alone. When they corrected the errors, only one of the cases resulted in the ML model actually performing better than logistic regression models.  The types of leakage Kapoor and Narayanan found included imputing test and training data together, such that the out-of-sample test set captures similar correlations between target and independent variables as observed in the training data. The paper that did this was also not just imputing a few values, but 95% of the missing values in the out-of-sample test set. Yikes! Also, interestingly, this paper had already been criticized (in political science venues), and their code revised, but the imputation leakage had not been pointed out. And then other papers were re-using inappropriately imputed data, which as Kapoor and Narayanan point out, can then “lend credibility to exaggerated claims about the performance of ML models by ‘independently’ validating their claims. Other examples just in this set of four papers include data leakage due to using variables that are proxies for the dependent variable; when they removed these and reran the models, they found not only did the ML models no longer beat logistic regressions, but they also all failed to outperform a baseline that predicts war when war occurred in the previous year and peace otherwise. Finally, several of the papers did k-fold cross validation on temporal data without using procedures like forward chaining to prevent leakage from the future data.They also include a table of results from other systematic evaluations of applied ML in quantitative fields, as evidence that there’s potentially a reproducibility crisis in research fields that use ML methods: It almost makes you think watching a few random videos, finding the appropriate python libraries, and pumping out some useful classification models isn’t as easy as thousands of tutorials, blog posts, certificate programs, etc. make it out to be.One other class of reproducibility issues in ML oriented research that Kapoor and Narayanan address, which also comes up a little in some of the earlier reports, is insufficient uncertainty quantification in these model comparisons. Perhaps my favorite sentence from the 2019 NeurIPS report is this one: “Thirdly, as opposed to most scientific disciplines where uncertainty of the observed effects are routinely quantified, it appears like statistical analysis is seldom conducted in ML research (Forde and Paganini, 2019; Henderson et al., 2018).” Nine of the 12 papers Kapoor and Narayanan obtained data and code for included no form of uncertainty quantification (including reporting any kind of significance testing or CIs on performance) for the out-of-sample model performance comparisons. The size of the test set in some of the papers they reviewed was a source of variance (one of the papers had only 11 instances of civil way onset). That same paper compared performance based on smoothed ROC curves rather than the empirical curves, and the difference in performance associated with this choice is as big as the best minus the worst performance metric reported. Again, yikes. I imagine there s probably lots more to say with regard to how unaddressed uncertainty affects claims made in ML papers.   Natesh Pillai points us to this recent article, The misuse of colour in science communication, which begins:The accurate representation of data is essential in science communication. However, colour maps that visually distort data through uneven colour gradients or are unreadable to those with colour-vision deficiency remain prevalent in science. These include, but are not limited to, rainbow-like and red–green colour maps.Yes, rainbow color scheme is well known to be horrible, and there are some alternatives.I sent this to Jessica Hullman to get her thoughts, and she wrote:I ve never been a big color perception person, but yes, that s generally what s been assumed and taught in visualization research. Though in the last few years there have been a few studies that looked at this, inspired in part by how many scientists refuse to give them up (maybe there s some utility we just haven t thought of yet?) and found reasons to think rainbow color maps are not as awful as previously thought. I just last week saw this one presented at IEEE VIS, which varies the task (inference specifically how well people can tell which visualizations were produced by the same model rather than just perceiving data values through color), and finds: Contrary to conventional guidelines, participants were more accurate when viewing colormaps that cross a variety of uniquely nameable colors. I haven t had time to read closely, but there was some discussion about whether the argument that crossing more nameable colors is helpful can really be made if they didn t control for the number of discriminable steps (steps for which there is a just-noticeable difference) in the color ramp.Another one suggests they re not bad for judging differences in gradients of scalar fields.Some other work finds that people are consistent in how they implicitly discretize rainbow scales, but also finds some data-specific differences in implicit discretization, concluding it s more complicated than we thought to evaluate them.While I m still teaching students that they are generally a bad idea, I tell students this doesn t mean there won t be scientific applications where they do work ok. For instance, phase diagrams where you can look to see where all the colors meet (e.g., here). In general, there would seem to be few visualization guidelines that are always true. (This post is by Kaiser, not Andrew.)The Tokyo Olympics ended with the U.S. once again topping the medals table, whether you count golds or all medals. Boring! The journalist class moaned. So they found alternative ranking schemes. For example, BBC elevated tiny San Marino to the top based on population. These articles inspired me to write this post.As statisticians, we all have had snarky comments thrown at us, alleging that we will manufacture any story we like out of data. In a moment of self-reflection, I decided to test this accusation. Is it possible to rank any country on top by inventing different metrics?I start from this official medals table:China is #2. After adjusting for the number of athletes, China is #1 in winning golds.ROC is #3. It is #1 in medals after adjusting for the number of athletes. Its female athletes were particularly successful.Team GB is #4. I elevate them to #1 by counting the proportion of sports they joined from which they won golds.The host nation, Japan, came in 5th place. It is #1 when counting the proportion of medals won that were golds.Australia finished 6th. No worries. It is #1 if I look at how much better the Aussie women were at winning golds than their male compatriots.Italy is #7. No nation has suffered as much from close calls: it had the highest proportion of medals won that were silvers or bronzes.Germany is #8. It has the most disappointing campaign, having the biggest drop in golds won compared to Rio 2016.The Netherlands is #9. Its Olympics athletes showed the largest improvement in total medals compared to Rio 2016.Our next host nation, France, is #10. It s ranked #1 if I rank countries by how much their male athletes outperfomed their female compatriots.        So I completed the challenge! It is indeed possible to anoint 10 different winners using 10 different ranking schemes. No country is left behind.I even limited myself to the Olympics dataset, and didn t have to merge in other data like population or GDP. Of course, the more variables we have, the easier it is to accomplish this feat.For those teaching statistics, I recommend this as an exercise: pick some subset of countries, and ask students to come up with metrics that rank each country #1 within the subset, and write appropriate headlines. This exercise trains skills in exploring data, and generating insights.In the end, I plead guilty as a statistician. We indeed have some superpowers. About 80 people pointed me to this post by Uri Simonsohn, Joe Simmons, and Leif Nelson about a 2012 article, Signing at the beginning makes ethics salient and decreases dishonest self-reports in comparison to signing at the end. Apparently some of the data in that paper were faked; see for example here:Uri et al. report some fun sleuthing:The analyses we have performed on these two fonts provide evidence of a rather specific form of data tampering. We believe the dataset began with the observations in Calibri font. Those were then duplicated using Cambria font. In that process, a random number from 0 to 1,000 (e.g., RANDBETWEEN(0,1000)) was added to the baseline (Time 1) mileage of each car, perhaps to mask the duplication. . . .The evidence presented in this post indicates that the data underwent at least two forms of fabrication: (1) many Time 1 data points were duplicated and then slightly altered (using a random number generator) to create additional observations, and (2) all of the Time 2 data were created using a random number generator that capped miles driven, the key dependent variable, at 50,000 miles.This is basically the Cornell Food and Brand Lab without the snow.Uri et al. summarize:We have worked on enough fraud cases in the last decade to know that scientific fraud is more common than is convenient to believe, and that it does not happen only on the periphery of science. Addressing the problem of scientific fraud should not be left to a few anonymous (and fed up and frightened) whistleblowers and some (fed up and frightened) bloggers to root out. The consequences of fraud are experienced collectively, so eliminating it should be a collective endeavor. What can everyone do?There will never be a perfect solution, but there is an obvious step to take: Data should be posted. The fabrication in this paper was discovered because the data were posted. If more data were posted, fraud would be easier to catch. And if fraud is easier to catch, some potential fraudsters may be more reluctant to do it. . . .Until that day comes, all of us have a role to play. As authors (and co-authors), we should always make all of our data publicly available. And as editors and reviewers, we can ask for data during the review process, or turn down requests to review papers that do not make their data available. A field that ignores the problem of fraud, or pretends that it does not exist, risks losing its credibility. And deservedly so.Their post concludes with letters from four of the authors of the now-discredited 2012 paper. All four of these authors agree that Uri et al. presented unequivocal evidence of fraud. Only one of the authors handled the data; this was Dan Ariely, who writes:I agree with the conclusions and I also fully agree that posting data sooner would help to improve data quality and scientific accuracy. . . . The work was conducted over ten years ago by an insurance company with whom I partnered on this study. The data were collected, entered, merged and anonymized by the company and then sent to me. . . . I was not involved in the data collection, data entry, or merging data with information from the insurance database for privacy reasons.Some related materialLots of people sent me the above-linked post by Uri, Joe, and Leif. Here are a few related things that some people sent in: Kevin Lewis pointed to the betting odds at (see image at top of post). Gary Smith pointed to this very recent article, Insurance Company Gives Sour AI promises, about an insurance company called Lemonade:In addition to raising hundreds of millions of dollars from eager investors, Lemonade quickly attracted more than a million customers with the premise that artificial intelligence (AI) algorithms can estimate risks accurately and that buying insurance and filing claims can be fun . . . The company doesn’t explain how its AI works, but there is this head-scratching boast:A typical homeowners policy form has 20-40 fields (name, address, bday…), so traditional insurers collect 20-40 data points per user.AI Maya asks just 13 Q’s but collects over 1,600 data points, producing nuanced profiles of our users and remarkably predictive insights.This mysterious claim is, frankly, a bit creepy. How do they get 1,600 data points from 13 questions? Is their app using our phones and computers to track everywhere we go and everything we do? The company says that it collects data from every customer interaction but, unless it is collecting trivia, that hardly amounts to 1,600 data points. . . . In May 2021 Lemonade posted a problematic thread to Twitter (which was later deleted):When a user files a claim, they record a video on their phone and explain what happened. Our AI carefully analyzes these videos for signs of fraud. [AI Jim] can pick up non-verbal cues that traditional insurers can’t, since they don’t use a digital claims process. This ultimately helps us lower our loss ratios (aka how much we pay out in claims vs. how much we take in).Are claims really being validated by non-verbal cues (like the color of a person’s skin) that are being processed by black-box AI algorithms that the company does not understand?There was an understandable media uproar since AI algorithms for analyzing people’s faces and emotions are notoriously unreliable and biased. Lemonade had to backtrack. A spokesperson said that Lemonade was only using facial recognition software for identifying people who file multiple claims using multiple names.I agree with Smith that this sounds fishy. In short, it sounds like the Lemonade people are lying in one place or another. If they re really only using facial recognition software for identifying people who file multiple claims using multiple names, then that can t really be described as pick[ing] up non-verbal cues. But yeah, press releases. People lie in press releases all the time. There could ve even been some confusion, like maybe the nonverbal cues thing was a research idea that they never implemented, but the public relations writer heard about it and thought it was already happening.The connection to the earlier story is that Dan Ariely works at Lemonade; he s their Chief Behavioral Officer, or at least he had this position in 2016. I hope he s not the guy in charge of detecting fraudulent claims, as it seems that he s been fooled by fraudulent data from an insurance company at least once in the past. A couple people also pointed me to this recent Retraction Watch article from Adam Marcus, Prominent behavioral scientist’s paper earns an expression of concern, about a 2004 article, Effort for Payment: A Tale of Two Markets. There were inconsistencies in the analysis, and the original data could not be found. The author said, It’s a good thing for science to put a question mark [on] this. . . . Most of all, I wish I kept records of what statistical analysis I did. . . . That’s the biggest fault of my own, that I just don’t keep enough records of what I do. It actually sounds like the biggest fault was not a lack of records of the analysis, but rather no records of the original data.Just don t tell me they re retracting the 2004 classic, Garfield: A Tail of Two Kitties. I can take a lot of bad news, but Bill Murray being involved in a retraction that s a level of disillusionment I can t take right now.Why such a big deal?The one thing I don t quite understand is why this latest case got so much attention. It s an interesting case, but so were the Why We Sleep story and many others. Also notable is how this seems to be blowing up so fast, as compared with the Harvard primatologist or the Cornell Food and Brand episode, each of which took years to play out. Maybe people are more willing to accept that there has been fraud, whereas in these earlier cases lots of people were bending over backward to give people the benefit of the doubt? Also there s the dramatic nature of this fraud, which is similar to that UCLA survey from a few years ago. The Food and Brand Lab data problems were so messy . . . it was clear that the data were nothing like what was claimed, but the setup was so sloppy that nobody could figure out what was going on (and the perp still seems to live in a funhouse world in which nothing went wrong). I m glad that Uri et al. and Retraction Watch did these careful posts; I just don t quite follow why this story got such immediate interest. One person suggested that people were reacting to the irony of fraud in a study about dishonesty?The other interesting thing is that, as reported by Uri et al., the results of the now-discredited 2012 article failed to show up in an independent replication. And nobody seems to even care.Here s some further background:Ariely is the author of the 2012 book, The Honest Truth About Dishonesty: How We Lie to Everyone Especially Ourselves. A quick google search finds him featured in a recent Freakonomics radio show called, Is Everybody Cheating These Days? , and a 2020 NPR segment in which he says, One of the frightening conclusions we have is that what separates honest people from not-honest people is not necessarily character, it s opportunity . . . the surprising thing for a rational economist would be: why don t we cheat more? But . . . wait a minute! The NPR segment, dated 17 Feb 2020, states:That s why Ariely describes honesty as something of a state of mind. He thinks the IRS should have people sign a pledge committing to be honest when they start working on their taxes, not when they re done. Setting the stage for honesty is more effective than asking someone after the fact whether or not they lied.And that last sentence links directly to the 2012 paper indeed, it links to a copy of the paper sitting at Ariely s website. But the new paper with the failed replications, Signing at the beginning versus at the end does not decrease dishonesty, by Kristal Whillans, Bazerman, Gino, Shu, Mazar, and Ariely, is dated 31 Mar 2020, and it was sent to the journal in mid-2019:Ariely, as a coauthor of this article, had to have known for at least half a year before the NPR story that this finding didn t replicate. But in that NPR interview he wasn t able to spare even a moment to share this information with the credulous reporter? This seems bad, even aside from any fraud. If you have a highly publicized study, and it doesn t replicate, then I think you d want to be damn clear with everyone that this happened. You wouldn t want the national news media to go around acting like your research claims held up, when they didn t.I guess that PNAS might retract the paper (that s what the betting odds say!) NPR will eventually report on this story, and Ted might take down the talks (no over-under action on this one, unfortunately), but I don t know that they ll confront the underlying problem. What I d like is not just an NPR story, Fraud case rocks the insurance industry and academia, but something more along the lines of:We at NPR were also fooled. Even before the fraud was revealed, this study which failed to replicate was reported without qualification. This is a problem. Science reporters rely on academic scientists. We can t vet all the claims told to us, but at the very least we need to increase the incentives for scientists to be open to us about their failures, and reduce the incentives for them to exaggerate. NPR and Ted can t get everything right, but going forward we endeavor to be part of the solution, not part of the problem. As a first step, we re being open about how we were fooled. This is not just a story we are reporting; it s also a story about us.P.S. Recall the Armstrong Principle.P.P.S. More on the story from Stephanie Lee.P.P.P.S. And more from Jonatan Pallesen, who concludes, This is a case of fraud that is completely bungled by ineptitude. As a result it had signs of fraud that were obvious from just looking at the most basic summary statistics of the data. And still, it was only discovered after 9 years, after someone attempted a replication. . . . This makes it seem likely that there is a lot more fraud than most people expect. Here’s a post describing an informative idea that Erik van Zwet (with the collaboration of me and Andrew G.) came up with in response to my post Is an improper uniform prior informative? it isn’t by any accepted measure of information I know of:One feature (or annoyance) of Bayesian methodology over conventional frequentism comes from its ability (or requirement) to incorporate prior information, beyond the prior information that goes into the data model. A Bayesian procedure that does not include informative priors can be thought of as a frequentist procedure, the outputs of which become misleading when (as seems so common in practice) they are interpreted as posterior probability statements. Such interpretation is licensed only by uniform (“noninformative”) priors, which at best leave the resulting posterior as an utterly hypothetical object that should be believed only when data information overwhelms all prior information. That situation may arise in some large experiments in physical sciences, but is far from reality in many fields such as medical research.Credible posterior probabilities (that is, ones we can take seriously as bets about reality) need to incorporate accepted, established facts about the parameters. For example, for ethical and practical reasons, human clinical trials are only conducted when previous observations have failed to demonstrate effects beyond a reasonable doubt. For medical treatments, that vague requirement (imposed by IRBs and funding agencies) comes down to a signal-to-noise ratio (the true effect divided by the standard error of its estimator) that rarely exceeds 3 and is often much smaller, as discussed here. Adding in more specific information may change this ground state, but even modestly well-informed priors often yield posterior intervals that are appreciably shifted from frequentist confidence intervals (which are better named compatibility or uncertainty intervals) with the posterior mean being closer to the null relative to the maximum likelihood estimate. In that sense, using the uniform prior without actually believing it leads to overestimation in clinical trials, although a more accurate description is that the overestimation arises from the fact that a uniform prior neglects important prior information about these experiments.“Information” is a complex, multifaceted topic about which much has been written. In standard information theories (e.g. of Shannon, Fisher, Kullback-Leibler), it is formalized as a property of a sample given a probability distribution on a fixed sample space S, or as an expectation of such a property over the distribution (information entropy). As useful as these measures can be in classical applications (in which the information in data is the sole focus and the space of possible samples is fixed), from an informative-Bayes perspective we find there are more dimensions to the concept of information that need to be captured. Here, we want to discuss a different way to think about information that seems to align better with the idea of empirical prior information in Bayesian analyses.Suppose we want to choose a prior for a treatment effect β in a particular trial. Consider the finite multi-set (allowing that the same value might occur multiple times) S1 of such treatment effects in all clinical trials that meet basic, general validity (or quality) considerations, together with the frequency distribution p1 of effects in S1. We consider subsets Sk of S1 that meet certain further conditions, and their frequency distributions pk. The distributions pk can be obtained by conditioning p1 on Sk. Examples of such reference sets are:S1.           The effects in all RCTsS2.           The effects in all RCTs in intensive careS3.           The effects in all RCTs in intensive care with a parallel designS4.           The effects in all RCTs in intensive care in elderly patientsS5.           The effects in all RCTs in intensive care in elderly patients with a parallel designPrior p1 (with reference set S1) represents the information that we are considering the treatment effect in an RCT that meets the general considerations used to define S1. Prior p2 (with reference set S2) represents the additional information that the trial concerns intensive care. Since the pair (p2,S2) represents more information than (p1,S1), we could say it is more informative. More generally, consider two potential priors pk and pj that are the frequency distributions of reference sets Sk and Sj, respectively. If Sk is a strict subset of Sj, then we call the pair (pk,Sk) more informative than the pair (pj,Sj).To give another example, we would call (p3,S3) more informative than (p2,S2). We believe that this definition agrees well with the common usage of the term “information” because (p3,S3) incorporates additional information about the design of the trial. But p3 is not necessarily more informative than p2 in the sense of Shannon or Fisher or Kullback-Leibler. To say it even more simply, there is no requirement that the variances in S1, S2, S3 form a non-decreasing sequence. Carlos Ungil gave a clear example here. We have defined only a partial ordering of “informativeness” on pairs (pk,Sk); for example, the pairs (p3,S3) and (p4,S4) would not be comparable because S3 and S4 are not subsets of each other.Our usage of the word “information” in relation to reference sets Sk is very similar to how a filtration in stochastic process theory is called “information”. This is very different from information theory where information is (like the mean and variance) a property of p alone, or relative to another distribution on the same sample space S. Both p and S are relevant when we want to think about the information in the prior.In certain applications it can make sense start with the set S0 of all logically possible but otherwise unspecified effects on top of the hierarchy, where p0 is a uniform distribution over S0 or satisfies some criterion for minimal informativeness (such as maximum entropy) within a specified model family or set of constraints. For example, this can be appropriate when the parameter is the angle of rotation of photon polarity (thanks to Daniel Lakeland). However, in most applications in the life sciences (p0,S0) is not a sensible starting point because the context will almost always turn out to supply quite a bit more information than either S0 or p0 does. For example, clinical trials reporting hazard ratios for treatment effects of say HR 1/20 or HR 20 are incredibly rare and typically fraudulent or afflicted by severe protocol violations. And then an HR of 100 could represent a treatment for which practically all the treated and none of the untreated respond, and thus is far beyond anything that would be uncertain enough to justify an RCT we do not do randomized trials comparing jumping with and without a parachute from 1000m up. Yet typical weakly informative priors assign considerable prior probability to hazard ratios far below 1/20 or far above 20. More sensible yet still weak reference priors are available; for log hazard-ratios (and log odds-ratios) the simplest choices are in the conjugate family, which includes the logistic distribution and its log-F generalizations. Biostatistician Jeff Morris writes:I [Morris] have downloaded and evaluated the recent Israeli data in detail to explore what it tells us about efficacy vs. severe disease with the Delta variant.In spite of the fact that ~60% of those with severe infections are vaccinated (as emphasized by anti-vaxxers as well as people pushing third dose boosters), the data clearly show the efficacy vs. severe disease is 85-90% in both younger and older age groups.I have written an article on my blog clearly explaining this paradoxical result step by step.My explanation illustrates the erroneous arguments made by people who just compare raw counts in discussing vaccine efficacy, and also highlights how Simpson’s paradox rears its ugly head here given older people are both more vaccinated and have inherently higher risk of hospitalization and thus any overall efficacy results produces misleading results if not stratified by age.Morris s one-sentence summary:Many are confused by results that >1/2 of hospitalized in Israel are vaccinated, thinking this means vaccines don t work. I [Morris] downloaded actual Israeli data and show why these data provide strong evidence vaccines strongly protect vs. serious disease.I agree with Morris that this is a policy-relevant example of a general statistics principle. They don t admit their mistakes. In particular, they don t admit when they ve been conned.1. Freakonomicsfrom 2009:A Headline That Will Make Global-Warming Activists ApoplecticThe BBC is responsible. The article, by the climate correspondent Paul Hudson, is called “What Happened to Global Warming?” Highlights:For the last 11 years we have not observed any increase in global temperatures. And our climate models did not forecast it, even though man-made carbon dioxide, the gas thought to be responsible for warming our planet, has continued to rise. So what on Earth is going on?And:According to research conducted by Professor Don Easterbrook from Western Washington University last November, the oceans and global temperatures are correlated. . . . Professor Easterbrook says: “The PDO cool mode has replaced the warm mode in the Pacific Ocean, virtually assuring us of about 30 years of global cooling.”Let the shouting begin. Will Paul Hudson be drummed out of the circle of environmental journalists? Look what happened here, when Al Gore was challenged by a particularly feisty questioner at a conference of environmental journalists.We have a chapter in SuperFreakonomics about global warming and it too will likely produce a lot of shouting, name-calling, and accusations ranging from idiocy to venality. It is curious that the global-warming arena is so rife with shrillness and ridicule. Where does this shrillness come from? . . .Ahhh, 2009. We were all so much younger then! We thought global warming was an open question. We used the word shrill unironically. We can t be blamed for our youthful follies.Sure, back in 2009 when Dubner was writing about “A Headline That Will Make Global-Warming Activists Apoplectic,” and Don Easterbrook was “virtually assuring us of about 30 years of global cooling,” the actual climate-science experts were telling us that things would be getting hotter. The experts were pointing out that oft-repeated claims such as “For the last 11 years we have not observed any increase in global temperatures . . .” were pivoting off the single data point of 1998, but Dubner and Levitt didn’t want to hear it. Fiddling while the planet burns, one might say.It’s not that the experts are always right, but it can make sense to listen to their reasoning instead of going on about apoplectic activists, feisty questioners, and shrillness.But everyone makes mistakes. What bothers me about the Freaknomics team is not that they screwed up in 2009 but that they never seemed to have corrected themselves, or even realized how screwed up they were.I found this interview from 2015 where one of the Freakonomics authors said:I tell you what we were guilty of . . . We made fun of the environmentalists for getting upset about some other problem that turned out not to be true. But we didn’t do it with enough reverence, or enough shame and guilt. And I think we pointed out that it’s completely totally and actually much more religion than science. I mean what are you going to do about that? I think that’s just a fact.Typical nudgelord behavior. Yammering on about how rational they are, how special it is to think like an economist, but not willing to come to terms with their own mistakes. Best defense is a good offense, don t give an inch, etc. I hate that crap.Just to be clear, I don t think that the Freakonomics authors are currently pushing any climate change denial sorry, heresy. They appear to have been convinced by all the evidence that s convinced everyone else (for example, the rise in temperatures that contradicts the earlier claim they were pushing about some climate pattern virtually assuring us of about 30 years of global cooling ).Their problem is not of hanging on to an earlier mistake but of not acknowledging it, not wrestling with it.We learn from our mistakes. But only when we re willing to learn.Or, to put it another way: We are all sinners. But we can only be redeemed when we confront the sins within ourselves.2. NudgeFrom the celebrated book from 2008:Brian Wansink . . . hmmm, where have we heard that name before?But, sure, everybody makes mistakes. The Nudge authors were fooled by Wansink and his masterpieces of science fiction, but so were NPR, New York Times, Ted, and lots of other institutions. The Bush administration hired Wansink, but then again the Obama administration hired one of the Nudge authors. Getting conned was a bipartisan thing.I assume the Nudge authors don t believe Wansink now. But my problem with them is the same as my problem with the Freakonomists: no reckoning with the past. Given the hype they showered up on the now-disgraced food researcher (they described one of his experiments as fiendish, which I guess is more accurate than they realized), and given that they have had the time to smear their critics by analogizing them to the former East German secret police, you d think they could ve taken a few hours, sometime in the past couple of years, to come to terms with the fact that . . . they. got. conned. By an Ivy League professor. How embarrassing. Best not to talk about it.But we should talk about it. We can learn from our mistakes, if we re willing to do so.Look. My point is not that everything in Nudge is wrong, or even that most of the things in Nudge are wrong. As the joke goes, all we know is that at least one sheep in Scotland is black on at least one side. That s not the point. The point is that, if the Nudge recommendations are based on evidence, and some of the star evidence has been faked, maybe it s worth reassessing your standards of evidence.3. The NudgelordsIt s embarrassing to admit you ve been conned. I get it. But . . . get over it!Or, you might think: This is yesterday s news. The Freakonomics authors have moved on from climate change denial and the Nudge authors have left the school lunchroom behind. So why can t we?The reason why we can t move on why we shouldn t move on is because of the next time. And there will be a next time that these Nudgelords are swept up in enthusiasm for some idea promoted by a suave storyteller who s unconstrained by the rules of scientific truth.I can t trust the Nudgelords because, if they can t come to terms with how they got fooled last time, why should I think they won t get fooled next time, in the very same way? Avram Altaras writes:This study is being quoted to justify the need for booster shorts for age > 60 and the immunocompromized. From what in learned in the regression class, they should have added the prevalence of the delta varient at time of pcr test as a variable in the regression, no?Yes, I think so. Whenever you re trying to estimate a causal effect when comparing two groups, you should adjust for relevant pre-treatment variables.Where can you find the best CBD products? CBD gummies made with vegan ingredients and CBD oils that are lab tested and 100% organic? Click here.

TAGS:Causal Modeling Statistical 

<<< Thank you for your visit >>>

Websites to related :
Bioregional - championing a bett

  This site uses cookies. Read our Privacy Policy to find out more. Bioregional champions a better way to live. We work with partners to create better,

Welcome to nginx on Debian!

  Welcome to nginx on Debian!If you see this page, the nginx web server is successfully installed andworking on Debian. Further configuration is require

Homepage | University Library

  Use OneSearch when you need to find a variety of materials on your topic. It’s a great starting point when you don’t know where to begin your resear

Webster Group - Mississippi Stat

  Our research involves several areas of theoretical and computational chemistry, including areas of biological catalysis, bond activation, and structur

Arizona Community Foundation >


Our Home - The sweethearts Found

  THE NEEDThe wheelchair is one of the most commonly used assistive devices for enhancing personal mobility, yet, thousands of South Africans remain bed

Herbert Publications | Herbert O

  Nobel Laureate, Herbert Charles Brown has inspired us to allure the word Herbert from his name. Herbert Publications took the motivation from his unus

QJM: An International Journal of

  Celebrate QJM's increased Impact Factor of 3.21 with a collection of the top 10 most cited articles. The collection will be free to access through the

Bioscience Reports | Portland Pr

  Hear from our Editor-in-Chief and Deputy-Editor-in-Chief regarding the challenges the journal has faced and how the journal hopes to grow in the futur

Αρχική | Ελληνικό

  Ελληνικό Ινστιτούτο Παστέρ Διεθνές κέντρο βιοϊατρικής έρευναςμε κοινωνική αποσ


Hot Websites