Statistical Bullshit – Statistical Bullshit

Excel and SPSS Help!

It is becoming that time of the semester where my classes are more time-consuming than I expect. For that reason, I am shifting gears (slightly), and I am going to focus more on developing statistical guides for Excel and SPSS on MattCHoward.com. Fortunately, I already have a few guides made. So, if you need help with calculating statistics in Excel and SPSS…

Click here for the Excel help page.

And click here for the SPSS help page.

As always, if you have any questions, comments, stories, or anything else, feel free to contact me a MHoward@SouthAlabama.edu. I always have time to write about Statistical Bullshit stories, even if I am focused on creating statistical guides!

Bullshit Outliers

Once upon a time, I asked a colleague for their dataset featured in a published article. The article produced a significant correlation between two variables, and I wanted to reproduce the findings using their dataset. Lo and behold, I was able to successfully replicate the correlation! However, when I further inspected their data, I noticed a particular concern. It seemed that the significance of the relationship hinged on a single value that could be considered an outlier. I struggled with whether I should bring this up to my colleague, but I finally did after much thought. Their response educated me on outliers, but it also exposed me to a new type of Statistical Bullshit – and how reasonable statistical practices could be confused for Statistical Bullshit.

Today’s post first discusses Statistical Bullshit surrounding outliers. Then, I summarize the discussion that I had with my colleague, and why things don’t always seem as they appear when it comes to outliers and Statistical Bullshit!

While many different types of outliers can be classified, today’s post discusses three. The first uses the following dataset: Click Here for Dataset . In this dataset, we have job satisfaction and job performance recorded for 29 employees. Each scale ranges from 0 to 100. When we calculate a correlation between these two variables, we get a perfect and statistically significant relationship (r = 1.00, p < .01). But let’s look at a scatterplot of this data.

Hmm, clearly something is wrong! This is because one employee had missing data for both variables, and the missing data was coded as 9999; however, when the analyses were performed, the 9999 was not properly removed and/or the program was not told that 9999 represented missing data. When we run the analyses again with the outlier removed, the correlation is small and not statistically significant (r = .08, p > .05).

I label this type of outlier as a researcher error outlier. Numerically it is an outlier, but it does not represent actual data. In all cases, this outlier should certainly be removed.

Next, let’s use the following dataset to discuss a second type of outlier: Click Here for Dataset. Again, we have job satisfaction and job performance recorded for 29 employees. Each scale ranges from 0 to 100. When we calculate a correlation between these two variables, we get a very strong and statistically significant relationship (r = .76, p < .01). But let’s look at a scatterplot of this data.

Interesting. We certainly have an outlier, but it is not clearly “wrong.” Instead, it seems that most of the sample falls within the range of 0-30 for each variable, but one person had a value of 100 for both. When we run the analyses again with the outlier removed, the correlation is small and not statistically significant (r = .01, p > .05).

But, before being satisfied with our removal, we should strongly consider what this means for our data. The occurrence of the one person certainly throws off our results, but this one person does indeed represent actual, meaningful data. So, can we justify removing this person? This question can be partially answered by determining whether we are interested in all employees or typical employees. If we are interested in all employees, then the outlier should certainly stay in the dataset. If we are interested in typical employees, then the outliers should possibly be removed. No matter the decision, however, researchers and practitioners should report all of their analytical decisions, so that any readers could be aware of changes to the data before further analyses were conducted.

I label this type of outlier as an extreme outlier.

Lastly, let’s use the following dataset to discuss a third type of outlier: Click Here for Dataset. Again, we have job satisfaction and job performance recorded for 29 employees. Each scale ranges from 0 to 100. When we calculate a correlation between these two variables, we get a moderate and statistically significant relationship (r = .32, p < .01). But let’s look at a scatterplot of this data.

Now, we certainly have an outlier, but it is much closer to the other values. While most of the sample falls within the range of 0-30 for each variable, but one person had a value of 43 for both. When we run the analyses again with the outlier removed, the correlation is small and not statistically significant (r = .07, p > .05).

Like the prior case, we need to strongly consider whether we should remove this participant. It would be much more difficult to argue that this person represents unreasonable data, and it may even be difficult to argue that this person represents data that deviates from the typical population. Yes, the person is a little extreme, but they are not drastically different. For this reason, it is likely that we want to keep this observation within our sample. I label this type of outlier as a reasonable outlier.

So, how is this relevant to Statistical Bullshit? Well, for researcher error outliers, the entire significance of a relationship could be built on a single mistake. Large decisions could be made based on nothing factual at all. Similarly, for extreme outliers, our relationship could largely be driven by a single person, and our decisions could be overly influenced by this single person. Lastly, for reasonable outliers and some extreme outliers, we could choose to remove these observations, which could result in a very different relationship that we could base our decisions. However, our decisions would be based on only a portion of the sample, and we could be missing out on very important aspects of the population. Thus, both not removing and removing outliers could result in Statistical Bullshit!

To bring this post full-circle, what happened when I chatted with my colleague? As you probably guessed, the outlier was determined to be a reasonable outlier. Certainly an outlier, but not enough to be confidently considered outside the typical population range. After our conversation, I certainly saw their point, and felt that they made the correct decision with their analyses – and it helped me understand how to conduct my analyses in the future.

Well, that’s all for today! If you have any questions, comments, or stories, please email me at MHoward@SouthAlabama.edu. Until next time, watch out for Statistical Bullshit!

More Issues with P-Values

What did Cohen (1992) have to say in The Earth is Round (p < .05)?

Few statistical topics have spurred as much controversy as p-values. For this reason, I felt that another post on StatisticalBullshit.com about p-values could be helpful to all readers – any myself! I always learn new things about Statistical Bullshit when I review p-values, and I hope you learn a little bit from reading this post. If you have any questions or comments about this post (or anything else), please email me at MHoward@SouthAlabama.edu. I’ll do my best to reply ASAP.

An overview of Statistical Bullshit associated with p-values can start with Cohen’s (1994) classic article, “The Earth is Round (p < .05).” The title is meant to be a pun on p-values. Given that we cannot obtain multiple measurements about the roundness of the earth and apply statistical tests (under normal approaches), it is impossible to determine the statistical significance of the earth’s roundness – but we know it is round. So, does that mean that there are certain important conclusions that p-values cannot derive? Probably so!

The first line of Cohen’s abstract describes null hypothesis significance testing (NHST), which makes research inferences based p-values, as a “ritual” which makes a “mechanical dichotomous decision around a sacred .05 criterion.” Cohen makes his disdain for p-values and NHST obvious from the beginning. To immediately support his claims, in the introduction, Cohen also cites previous authors who state that the “everybody knows” the concerns of p-values, and they are “hardly original.” My favorite quotation is from Meehl who claims that NHST is “a potent but sterile intellectual rake who leaves in his merry path a long train of ravished maidens but no viable scientific offspring.” (1967, p. 265). Quite the statement! And the earliest of these citations is in 1938! For a long time, researchers have been foaming-at-the-mouth over p-values.

After the initial damnation of p-values, Cohen gets into the actual concerns. First, he notes that people often misunderstand p-values. He notes the logic of NHST, when p-values are significant, is:

If the null hypothesis is correct, then these data are highly unlikely.
These data have occurred.
Therefore, the null hypothesis is highly unlikely.

Cohen dislikes this logic, and he notes that this logic can derive problematic conclusions. In the article, Cohen provides a slightly unusual situation which denotes the problem in the above reasoning. I am going to provide a similar situation which results in the same conclusions, but it (hopefully) will make more sense to readers than Cohen’s example.

Most employees will not be on the Board of Directors in a company. But, let’s say that we sampled a random person, and they were on the Board of Directors within a company. Therefore, we can state the following, using the logic of NHST.

If a person is employed, then (s)he is probably not on the Board of Directors.
This person is on the Board of Directors.
Therefore, (s)he is probably not employed.

We know that, if an individual is on the Board of Directors, then they are employed. So, the logic of NHST led us to an inappropriate conclusion.

Second, Cohen notes that the meaning of p-values, “the probability that the data could have arisen if the null hypothesis is true,” is not the same as, “the probably that the null hypothesis is true given the data.” Unfortunately, researchers often mistake p-values for the latter, and lead to some problematic inferences.

To demonstrate the problem with this thinking, Cohen provides an example with schizophrenia. Assume that schizophrenia arises in two percent of the population, and assume we have a tool that has a 95% accuracy in diagnosing schizophrenia and even a 97% accuracy in declaring normality. Not bad! When we use the tool, our null hypothesis is that an individual is normal, and the research hypothesis is that the individual is schizophrenic. So, the (problematic) logic is: when the test is significant, the null hypothesis is not true.

However, when we calculate the math for a sample of 1000, some problems arise. Particularly, in a sample of 1000, the number of schizophrenics is most likely 20. The tool would correctly identify 19 of them and label one as normal. This result is not overly concerning. Alternatively, in the sample of 1000, the number of normal individuals is most likely 980. The tool would, most likely, correctly identify 951 normal individuals as normal (980 * .97); however, 29 individuals would be labeled as schizophrenic! So, of the 50 people identified as schizophrenia, 60 percent of them would actually be normal! The problem in this example that we assume that the null hypothesis is false given the data (a significant result), when we should think about the probably that the data could have arisen (five percent chance) given the null hypothesis is true.

Third, Cohen notes that the null hypothesis is almost always that no effect exists. But is this stringent enough to test hypotheses? Cohen argues that it is not. He notes that, given enough participants, anything can be significantly different from nothing. Also, given enough participants, the relationship of anything with anything else is greater than nothing. So, if authors find (p > .05), they can just add more participants until (p < .05).

Fourth, Cohen notes that, when only observing p-values, that results cannot be “very statistically significant” or “strongly statistically significant,” although authors like to make similar claims when the p-value is less than 0.01, 0.001, or 0.001. Instead, results can only be statistically significant when analyzing p-values. This is because p-values are derived from magnitude, variance, and sample size. If a p-value is significant at a 0.05 level and another at a 0.001 level, the results do not necessarily mean that the magnitude of effect was larger. Instead, the magnitude of the effect can be identical, and the sample sizes of the two groups are just larger. As the sample size does not reflect the effect itself, it is inappropriate to say that the former is only “statistically significant” and the other is “very statistically significant.” This is a mistake that I still see in many articles.

These four aspects represent the primary concerns of Cohen. In his article, he also suggests some modest solutions to p-values and NHST. Probably the most adopted is the use of confidence intervals. Confidence intervals indicate a range of values that we can expect the effect to fall within, based upon the magnitude of the effect and standard error. If a confidence interval contains zero, then it cannot be significantly different from random chance. In my experiences, most researchers view confidence intervals the same as p-values, and they see zeros as the on-off switch instead of (p < .05).

Once Cohen’s article was published, discussions of p-values grew even further. Several authors published responses. I won’t summarize all of them, but Cohen’s rejoinder (1995) downplayed most of the criticisms in a single-page article; however, one of the few responses that Cohen seemed to ponder was that of Baril and Cannon (1995). The authors criticized Cohen’s NHST examples and considered them “inappropriate” and “irrelevant.” Cohen replied that his examples were not intended to model NHST as used in the real world, but only to demonstrate how wrong conclusions can be when the logic of NHST is violated. Further, in response to a different article, Cohen claims that he does not question the validity of NHST but only its widespread misinterpretation.

My Two Cents

In general, I think p-values are alright. Before you gather your torches and pitchforks, let me explain. I see p-values as a quick sketch of studies’ results. A p-value can provide quick, easy information, but any researcher should know to look at effect sizes and the actual study results to fully understand the data. No one should see (p > .05) or (p < .05) and move on. There is so much more to know.

Also, while confidence intervals provide more information, their application usually provides the same results as p-values. Researchers often just look for confidence intervals outside of zero and move on.

Lastly, p-values are a great first-step into understanding statistics. They aren’t too scary, and understanding p-values can lead to understanding more complex topics.

Summary

A p-value indicates the probably that the observed data occurred due to random chance. Usually, if the p-value is above 0.05, we do not accept the result as being different from random chance. If it is below 0.05, we accept the result as being more than random chance. While p-values are good first-looks at data, everyone should know how to interpret other statistical results, as they are much better descriptions of the data.

Keep looking out for Statistical Bullshit! And let me know if you have any questions or topics that you’d like to see on StatisticalBullshit.com by emailing me at MHoward@SouthAlabama.edu!

Note: This was originally posted to MattCHoward.com, but I felt it was particularly relevant to StatisticalBullshit.com.