Mark Van Der Laan is a Professor of Biostatistics and Statistics at the University of California, Berkeley. In 2005 he won a presidential award for his work. In 2006 he wrote a number of articles about the two surveys on deaths in Iraq since the 2003 invasion that publicly became known as the Lancet reports. The first Lancet paper was published in October 2004 and estimated 98,000 excess deaths in the 18 months following the overthrow of Saddam, excluding the province of Anbar. The second one argued there were 654,965 killed from March 2003 to July 2006. Van Der Laan was one of many who questioned the reliability of these polls. Unfortunately, those critiques remained mostly academic, and were never heard of by most of the public. Today as violence is increasing in Iraq and the insurgency is making a comeback, the Lancet studies are being brought up again even though they have major flaws. Here is Prof. Van Der Laan explaining his views of the Lancet reports.
The two Lancet papers on deaths in Iraq created huge controversies with their large estimates of 98,000 killed from 2003-2004 and 654,965 from 2003-2006 (BBC)
1. Both Lancet reports used a very small sample size. The 1st Lancet went to 33 clusters of 30 households each with each house representing roughly 739,000 people. The 2nd Lancet included 47 clusters of 40 households, for an average of 577,000 each. What can happen if only a few people are polled in a rather large population?
A statistical procedure maps the sample into an estimate of the number of deaths and a 95% confidence interval that is constructed in such a way that it will contain the true number of deaths (approximately) 95% of the times. This confidence interval takes into account the uncertainty in the estimate, and is therefore by far the most important output of a study. When the sample size (33 and 47) is small, the estimates of the number of deaths have a large standard error, and as a consequence the confidence intervals will be wide. In these cases the study might provide little information. This is exactly what happened in the first Lancet study in which case the confidence interval was given by 8,000-194,000 showing that the study could only claim with 95% confidence that there were more than 8000 deaths, while the Iraq body count at that time was around 25,000. So the only sound conclusion of this first Lancet study is that it failed to provide any new information, even if we ignore the potential biases of this study.
Because of its large range, the first Lancet study could only predict that 8,000 or more Iraqis were killed from 2003-2004. Van Der Laan’s work with the second Lancet data found that its range was only around 290,000 dead, not its reported 426,369-793,663. (Reuters)
2. You mentioned the very wide ranges for the possible number of excess deaths in Iraq after 2003 that the Lancet papers came up with. The first one had a range of 8,000-194,000 killed, while the second one was from 426,369-793,663. You received some data on the 2006 Lancet from a few of the authors, and did your own statistical analysis. First, are their problems with having such a wide range, and second, what did you find from your study of the 2nd Lancet numbers?
These ranges of excess deaths are so called confidence intervals and in any scientific journal these represent the only reliable output of a study: i.e. the estimate of the number of excess deaths is itself not meaningful and can only be interpreted in combination with an estimated standard error of this estimate. So the scientific value of the results of the first Lancet study for the scientific community is that the number of deaths was somewhere between 8000-194,000, and therefore this study had no news to report. To prevent such studies that represent a waste of resources, one typically first carries out so called sample-size calculations that can determine what sample size is needed to end up with confidence intervals of small enough width so that the study can provide some valuable information to the larger scientific community. Clearly, that was not done.
However, as long as the study, including the statistical procedures employed, is scientifically sound, at least we can trust the confidence intervals. Our biggest concern with both studies was that that they were suffering from serious biased sampling or measurement error causing these out of whack estimates and biased confidence intervals, as was confirmed by the much larger and more reliable Iraq Living Condition study in 2004 that contradicted the first Lancet study, and the WHO study in 2008 that contradicted the second Lancet study.
Having said this, as a statistician I was interested in the development of statistical methods for construction of confidence intervals that can be trusted when the sample size is so small, assuming a reliable random sample. The confidence intervals reported in the study rely on sample averages across the 33 or 47 clusters to have a so-called normal distribution. Therefore, I presumed that, even if we would ignore the potential biases of these studies, the reported confidence intervals should still be quite unreliable due to the small sample size and the highly non-normal type distribution of the number of deaths across the clusters of households: the number of deaths across clusters had many zeros and many large numbers, clearly showing that assuming this normality assumption is unreasonable. Therefore, I became interested as a statistician in developing more robust confidence intervals that could be used in future studies of this type. For a robust confidence interval method I found a lower bound of around 100,000 for the second Lancet study. Subsequently, with a post-doctoral student of mine at the time, Michael Rosenblum, we wrote an article “Confidence Intervals for the Population Mean Tailored to Small Sample Sizes, with applications to survey sampling” also developing semi-robust confidence intervals that rely not as much on the normality assumption, but are not assumption free either, and applied them to this data set from the second Lancet study. Using this semi-robust method we found a lower bound for the confidence interval around 290,000. Again, these modifications of the confidence intervals are assuming that the sample is a random sample and are thus by no means meant to correct for the other potential biases of the studies.
3. You had problems with some of the fieldwork. Many studies explain how they do their survey work. The Lancet papers did not. Later, the authors said that their teams were able to complete a cluster of 40 houses in one day in 2006. Was that enough time to question a household?
This is hard to judge without an explanation of how this study was exactly run. Clearly, this is not enough time if it was done by one single team: as we wrote “They moved from one household to the other within the context of tribal communities brimming with distrust, explained their mission and succeeded in gaining access to the living quarters, got into a person’s confidence, asked for intimidate experiences, listened to personal stories of loss and grief – and all this within 18 minutes per household, assuming a 12 hour workday.” I read in your blog that the authors have been giving contradictory statements about this simple fact (i.e., single or multiple teams) as well.
The biggest concern we had at the time that we needed to know how the experiment was carried out in order to judge if the sample is a reliable random sample. For example, what questionnaire was used, how where the interviews conducted, were the interviewers supervised, were the houses a priori randomly sampled with no preference for areas with high violence, were the counted deaths independently verified, when visiting a village, how did one arrange to only count the a priori specified households without upsetting the local population due to having to skip households in which violent deaths occurred, and so on. Many have tried to get answers, but crucial information was simply not provided.
It is the responsibility of the designers of the study to document the operations of the study and to be completely open about it so that the scientific community is able to assess the scientific validity of the study. To make a long study short, Professor Spagat has investigated the lack of scientific validity of these Lancet studies in detail and published on it. Eventually, after many years of pointing out the issues, and pressure from various journalists, the American Association for Public Opinion Research conducted an 8 month study and stated that the main author of the Lancet Study repeatedly refused to make public essential facts about his research on civilian deaths in Iraq. Shortly thereafter Johns Hopkins University itself where the main author is a faculty member publicly stated that this study had violated scientific standards, and the main author was censured accordingly by its own university.
Indeed, when Leon Dewinter asked me to evaluate these studies from a statistical perspective, I wanted to get answers about how the fieldwork was conducted. Not getting these answers, and the manner in which the articles were written, made me uncomfortable about the validity of these studies. Having heard a radio interview from one of the authors stating that he was afraid due to tensions in the village, laying low in his car to stay out of sight, while the field workers were doing their interviews without supervision, did only make me more suspicious about the scientific validity of the operations.
4. The authors claimed that violence was equally spread throughout Iraq, and that they would cover all areas of the country as a result. What are some problems with this argument?
I do not believe they claimed that violence was equally spread throughout the country, since everybody knew that was not the case. However, regarding the design of the study, we found it remarkable that a priori knowledge about the places in which violence has been prevalent was not used in the design of the analysis: a better design would have been a stratified sample that samples frequently in areas in which violence was known to be prevalent and samples less in areas in which it has been relatively calm. Especially, given that the design sampled few clusters, such considerations are extremely important and would have resulted in smaller confidence intervals.
As we wrote in our article, “Instead the authors advertise the selected design of the study as the accepted standard, which would then wrongly imply that it makes sense to sample as many clusters of households in areas in which violence is non-existent as in violent areas, as long as these areas have the same population size.” Their design should have used this available information.
5. The Lancet authors wrote that 80% of the deaths recorded in 2004 had death certificates. You didn’t think that was a believable figure, and even if it was it offered a missed opportunity for the researchers. Can you explain those two points?
Firstly, in order to trust the deaths that were counted in these Lancet studies it is important that these were verified in some reliable way. The authors reported that 80% of the counted deaths were verified with a death certificate. If we accept that number, and we also accept their predicted number of 650,000 “excess deaths” in Iraq in the post-invasion period, as reported by the second Lancet survey, then we should also accept that more than 500,000 additional death certificates have been issued by government organizations. As Leon de Winter and I wrote in our article: “In other words, these government organizations are somehow hiding all of these certificates from the public.” In addition, if 80% of the deaths are actually reported to the Government and result in an official death certificate, why are we then not just counting the death certificates, which would be a much better study than trying to obtain insight through random sampling of relatively few households.
6. You broke down one 12-month period from the 2nd Lancet. There were an estimated 330,000 people killed from June 2005-June 2006. That breaks down to 27,500 per month, 6,875 per week, 982 per day. According to the authors that would also mean 40,000 died from Coalition air strikes, 60,000 from car bombs, 40,000 from explosions, and 174,000 from gunshots. Based upon press reports, Iraq Body Count only had 22,030 deaths for that period, or an average of 60 per day. The media can never cover all casualties in a country, especially in a war zone, but do you think that it could miss that magnitude of violence?
As Leon and I noted in our article: “These are extremely large numbers – and we would have to believe that the hundreds of independent radio stations, TV-stations, newspapers and magazines that operated in Iraq did not notice these massacres. According to this survey, American air strikes must have erased whole neighborhoods without the press noticing.” As you can imagine, the Iraq Body Count was not happy with these claims by the Lancet study. As we now know, the much larger and better designed WHO-study in 2008 came up with a much smaller estimate of around 150,000 for a larger population and using an significant ad hoc upward adjustment that was not used in the Lancet instead of 650,000, and as Professor Spagat shows the two central estimates of these two studies actually differ by a factor 6.6 when put on a comparable basis.
7. Most studies take months of peer review before they are published. The first Lancet was finished in September 2004, and published the next month. The second Lancet seemed to have a quick turn around from its completion to it appearing in The Lancet as well. What was happening in the U.S. when these papers were published, and do you think that was a coincidence?
It is unheard of to publish a paper within a month, let alone such a high profile paper that is known to have such an impact on the country. And one should seriously wonder if it was a coincidence that four days after The Lancet took the world’s headlines with this survey, the American people would vote for its new president. We also have to keep in mind that both the editor of the Lancet and one of the authors were politically active. It makes one wonder if the study was done under enormous time pressure and the publication was pushed out that fast to get it out before the election. The second Lancet study was again published in October right before the 2006 elections. Both publications made national headlines, and had an enormous impact on society, while both publications should have raised serious warnings. For example, the first Lancet study contradicted the much larger Iraq Living Conditions study that was occurring at the same time. The Lancet should have been alarmed and in that manner they could have prevented the second Lancet study, which even made more dramatic claims, and again was heavily contradicted by a much larger reliable study.
8. You mentioned a few other surveys conducted in Iraq during this time period that also tried to estimate the number of killed during the war. Can you explain some of their findings, and how they compared to the Lancet in terms of their fieldwork?
Right before the first Lancet study, there had been a much larger Iraq Living Conditions 2004 survey sampling 22,000 households instead of less than 1,000. It reported a predicted death count of 24,000 with a 95 percent confidence interval from 18,000-29,000, completely aligned with the Iraq Body Count death toll. These results were published shortly after the publication of the first Lancet study. That is, a much more reliable study yields a predicted death count a factor 4 smaller than the predicted death count in The Lancet’s 2004 article. Of course, one should wonder why the 100,000 from the Lancet study became world news, in spite of the confidence interval 8,000-194,000, while this large study was conveniently ignored by the media and the authors of the Lancet study.
Regarding the WHO Study that contradicted the second Lancet study, I quote from a letter by Neil Johnson, David Kane, Seppo Laaksonen, Mark van der Laan, Peter Lynn, Fritz Scheuren, and Michael Spagat, we had submitted to Science magazine asking for an independent investigation of the second Lancet study:
“John Bohannon’s article “Calculating Iraq’s Death Toll: WHO Study Backs Lower Estimate” (18 January 2008, p. 273) exposes some serious weaknesses in the second (2006) Lancet study (1) of Iraq mortality (L2). The WHO study (2) has a much larger sample, is much better supervised and uses sampling methods that are greatly superior to the published methods of the L2 survey (3). The WHO team even seems to bend over backwards to minimize the distance between the two surveys. Yet even WHO’s apples-with-oranges comparison, 151,000 estimated violent deaths for WHO versus 601,000 for L2, leaves L2 exceeding WHO by 450,000 violent deaths.
Actually the factor-of-four difference greatly understates the true discrepancy between WHO and L2 for two main reasons. First, the WHO study applies quite a substantial upward adjustment to its estimate to compensate for an assumed reporting bias that is “common in household surveys”. However, if this generic argument applies to the WHO survey then it applies with equal force to the L2 survey. Second, WHO applies its estimated violent mortality rate to a larger population estimate than does L2. We address these distortions by comparing estimated violent-death rates, rather than totals, in the two surveys. L2 reports a post-war violent death rate of 7.2 per 1,000 per year compared to 1.09 per 1,000 per year for WHO. The two central estimates differ by a factor of 6.6 when put on a comparable basis.”
9. The Lancet papers received a huge amount of press when they were released. They are still mentioned to the present day even though you and others have found major flaws with them. How can you explain the staying power of the two papers?
The Lancet should not have published these articles. The fact that these articles were published in the Lancet suggests enormous credibility and allows people to refer to them as if they represent scientific truth. Combined with the enormous publicity these articles received in the media, it makes a powerful story out there for people to use.
Another issue is that countering a dramatic story does not receive much support from the media. At the time, Leon de Winder and I submitted our article to lots of newspapers right after the publication of the second Lancet study, but none of them were willing to publish our response. For example, The New York Times had asked me to comment on the second Lancet study the day before it would reach the national headlines, and they were grateful for my comments, making them decide to move it away from the headlines, in contrast to most other newspapers. However, they were not willing to publish our response. Part of the problem might have been that it did not represent what people wanted to hear at the time, and newspapers apparently need to be sensitive to that. Interestingly, our article somehow reached people through the Internet and that is how you probably ran into it. So it found its own life. Similarly, when a few years later we submitted a letter to Science to ask for an independent investigation of the second Lancet study, it was again rejected. Discrediting what once was headline-news is not an easy task and takes time and effort from many people who care about truth.
Van Der Laan, Mark, “”Mortality after the 2003 invasion of Iraq: A cross-sectional cluster sample survey”, by Burnham et al (2006, Lancet, www.thelancet.com): An Approximate Confidence Interval for Total Number of Violent Deaths in the Post Invasion Period,” Division of Biostatistics, University of California, Berkeley, 10/26/06
Van der Laan, Mark de Winter, Leon, “Lancet,” November 2006
- “Statistical Illusionism,” U.C. Berkeley, 2006
Spagat, Michael, “Ethical and Data-Integrity Problems In The Second Lancet Survey of Mortality in Iraq,” Defense and Peace Economics, February 2010
- “Mainstreaming an Outlier: The Quest to Corroborate the Second Lancet Survey of Mortality in Iraq,” Department of Economics Department, University of London, February 2009
Rosenblum, Michael, van der Laan, Mark J., “Confidence Intervals for the Population
Mean Tailored to Small Sample Sizes, with Applications to Survey Sampling,”
The International Journal of Biostatistics, 2009