What I’ve Been Doing

MCHC - MCHoward Consulting LLCHi all!  No Statistical Bullshit this week.  Instead, I wanted to share what I’ve been working on recently…in addition to my academic research and teaching, of course.

I’ve been developing my LLC!  It’s name is MCHoward Consulting, LLC, and it is focused on four primary services: selection, assessment, training, and analytics.  However, I can provide several other business solutions through pairing with other consulting services.

Many of the issues on StatisticalBullshit.com are centered around these four business domains.  If you’ve ever read a post on this website and notice that your business may be committing Statistical Bullshit, feel free to email me about my consulting services (Contact@MCHowardConsulting.com).  I would be happy to provide solutions to your business needs, especially when it comes to preventing Statistical Bullshit!

For more information about MCHoward Consulting, LLC, please use the following link: https://MCHowardConsulting.com/

If you have any questions or stories about Statistical Bullshit, please email me at MHoward@SouthAlabama.edu.  Until next time, watch out for Statistical Bullshit!

More Issues with P-Values

What did Cohen (1992) have to say in The Earth is Round (p < .05)?

Few statistical topics have spurred as much controversy as p-values.  For this reason, I felt that another post on StatisticalBullshit.com about p-values could be helpful to all readers – any myself!  I always learn new things about Statistical Bullshit when I review p-values, and I hope you learn a little bit from reading this post.  If you have any questions or comments about this post (or anything else), please email me at MHoward@SouthAlabama.edu.  I’ll do my best to reply ASAP.


An overview of Statistical Bullshit associated with p-values can start with Cohen’s (1994) classic article, “The Earth is Round (p < .05).” The title is meant to be a pun on p-values. Given that we cannot obtain multiple measurements about the roundness of the earth and apply statistical tests (under normal approaches), it is impossible to determine the statistical significance of the earth’s roundness – but we know it is round. So, does that mean that there are certain important conclusions that p-values cannot derive? Probably so!

The first line of Cohen’s abstract describes null hypothesis significance testing (NHST), which makes research inferences based p-values, as a “ritual” which makes a “mechanical dichotomous decision around a sacred .05 criterion.” Cohen makes his disdain for p-values and NHST obvious from the beginning. To immediately support his claims, in the introduction, Cohen also cites previous authors who state that the “everybody knows” the concerns of p-values, and they are “hardly original.” My favorite quotation is from Meehl who claims that NHST is “a potent but sterile intellectual rake who leaves in his merry path a long train of ravished maidens but no viable scientific offspring.” (1967, p. 265). Quite the statement! And the earliest of these citations is in 1938! For a long time, researchers have been foaming-at-the-mouth over p-values.

After the initial damnation of p-values, Cohen gets into the actual concerns. First, he notes that people often misunderstand p-values. He notes the logic of NHST, when p-values are significant, is:

If the null hypothesis is correct, then these data are highly unlikely.
These data have occurred.
Therefore, the null hypothesis is highly unlikely.

Cohen dislikes this logic, and he notes that this logic can derive problematic conclusions. In the article, Cohen provides a slightly unusual situation which denotes the problem in the above reasoning. I am going to provide a similar situation which results in the same conclusions, but it (hopefully) will make more sense to readers than Cohen’s example.

Most employees will not be on the Board of Directors in a company. But, let’s say that we sampled a random person, and they were on the Board of Directors within a company. Therefore, we can state the following, using the logic of NHST.

If a person is employed, then (s)he is probably not on the Board of Directors.
This person is on the Board of Directors.
Therefore, (s)he is probably not employed.

We know that, if an individual is on the Board of Directors, then they are employed. So, the logic of NHST led us to an inappropriate conclusion.

Second, Cohen notes that the meaning of p-values, “the probability that the data could have arisen if the null hypothesis is true,” is not the same as, “the probably that the null hypothesis is true given the data.” Unfortunately, researchers often mistake p-values for the latter, and lead to some problematic inferences.

To demonstrate the problem with this thinking, Cohen provides an example with schizophrenia. Assume that schizophrenia arises in two percent of the population, and assume we have a tool that has a 95% accuracy in diagnosing schizophrenia and even a 97% accuracy in declaring normality. Not bad! When we use the tool, our null hypothesis is that an individual is normal, and the research hypothesis is that the individual is schizophrenic. So, the (problematic) logic is: when the test is significant, the null hypothesis is not true.

However, when we calculate the math for a sample of 1000, some problems arise. Particularly, in a sample of 1000, the number of schizophrenics is most likely 20. The tool would correctly identify 19 of them and label one as normal. This result is not overly concerning. Alternatively, in the sample of 1000, the number of normal individuals is most likely 980. The tool would, most likely, correctly identify 951 normal individuals as normal (980 * .97); however, 29 individuals would be labeled as schizophrenic! So, of the 50 people identified as schizophrenia, 60 percent of them would actually be normal! The problem in this example that we assume that the null hypothesis is false given the data (a significant result), when we should think about the probably that the data could have arisen (five percent chance) given the null hypothesis is true.

Third, Cohen notes that the null hypothesis is almost always that no effect exists. But is this stringent enough to test hypotheses? Cohen argues that it is not. He notes that, given enough participants, anything can be significantly different from nothing. Also, given enough participants, the relationship of anything with anything else is greater than nothing.  So, if authors find (p > .05), they can just add more participants until (p < .05).

Fourth, Cohen notes that, when only observing p-values, that results cannot be “very statistically significant” or “strongly statistically significant,” although authors like to make similar claims when the p-value is less than 0.01, 0.001, or 0.001. Instead, results can only be statistically significant when analyzing p-values. This is because p-values are derived from magnitude, variance, and sample size. If a p-value is significant at a 0.05 level and another at a 0.001 level, the results do not necessarily mean that the magnitude of effect was larger. Instead, the magnitude of the effect can be identical, and the sample sizes of the two groups are just larger. As the sample size does not reflect the effect itself, it is inappropriate to say that the former is only “statistically significant” and the other is “very statistically significant.” This is a mistake that I still see in many articles.

These four aspects represent the primary concerns of Cohen. In his article, he also suggests some modest solutions to p-values and NHST. Probably the most adopted is the use of confidence intervals. Confidence intervals indicate a range of values that we can expect the effect to fall within, based upon the magnitude of the effect and standard error. If a confidence interval contains zero, then it cannot be significantly different from random chance. In my experiences, most researchers view confidence intervals the same as p-values, and they see zeros as the on-off switch instead of (p < .05).

Once Cohen’s article was published, discussions of p-values grew even further. Several authors published responses. I won’t summarize all of them, but Cohen’s rejoinder (1995) downplayed most of the criticisms in a single-page article; however, one of the few responses that Cohen seemed to ponder was that of Baril and Cannon (1995).   The authors criticized Cohen’s NHST examples and considered them “inappropriate” and “irrelevant.” Cohen replied that his examples were not intended to model NHST as used in the real world, but only to demonstrate how wrong conclusions can be when the logic of NHST is violated. Further, in response to a different article, Cohen claims that he does not question the validity of NHST but only its widespread misinterpretation.

My Two Cents

In general, I think p-values are alright. Before you gather your torches and pitchforks, let me explain. I see p-values as a quick sketch of studies’ results. A p-value can provide quick, easy information, but any researcher should know to look at effect sizes and the actual study results to fully understand the data. No one should see (p > .05) or (p < .05) and move on. There is so much more to know.

Also, while confidence intervals provide more information, their application usually provides the same results as p-values. Researchers often just look for confidence intervals outside of zero and move on.

Lastly, p-values are a great first-step into understanding statistics. They aren’t too scary, and understanding p-values can lead to understanding more complex topics.

Summary

A p-value indicates the probably that the observed data occurred due to random chance. Usually, if the p-value is above 0.05, we do not accept the result as being different from random chance. If it is below 0.05, we accept the result as being more than random chance. While p-values are good first-looks at data, everyone should know how to interpret other statistical results, as they are much better descriptions of the data.

Keep looking out for Statistical Bullshit!  And let me know if you have any questions or topics that you’d like to see on StatisticalBullshit.com by emailing me at MHoward@SouthAlabama.edu!

Note:  This was originally posted to MattCHoward.com, but I felt it was particularly relevant to StatisticalBullshit.com.

Is Big Worse than Bad?

Today’s post is about the concept of how being big could be worse than being bad in regards to Equal Employment Opportunity Enforcement Policies (EEOP).

So far on StatisticalBullshit.com, I’ve written about general Statistical Bullshit, Statistical Bullshit that I’ve come across in consulting, or Statistical Bullshit that readers have sent into the website.  I don’t *think* that I’ve written about Statistical Bullshit that was pointed out by an academic article.  For this reason, today’s post is about the concept of how being big could be worse than being bad in regards to Equal Employment Opportunity Enforcement Policies (EEOP).  Most of the material for this post comes from Jacobs, Murphy, and Silva’s (2012) article, entitled “Unintended Consequences of EEO Enforcement Policies: Being Big is Worse than Being Bad,” which was published in the Journal of Business and Psychology.  Rick Jacobs was my advisor at Penn State, so I am happy to have his article discussed on StatisticalBullshit.com.  For more information about this concept, please email me at MHoward@SouthAlabama.edu or check out Jacobs et al. (2012).


As stated by Jacobs et al. (2012), “The Equal Employment Opportunity Commission (EEOC) is the chief Federal agency charged with enforcing the Civil Rights Acts of 1964 and 1991 and other federal laws that forbid discrimination against a job applicant or an employee because of the person’s race, color, religion, sex (including pregnancy), national origin, age (40 or older), disability, or genetic information” (p. 467).  In other words, the EEOC ensures that businesses do not discriminate against protected classes, and this includes in business employment practices.

When a disproportionately low number of peoples from a protected class are hired, most often relative to the majority class, this is called disparate impact.  But how do we know whether a “disproportionately low number” has occurred?  In EEO cases, there are many methods, but the 80% rule and statistical significance testing are among the most popular.

The 80% rule specifies that disparate impact occurs when members of a protected class are hired at a rate that is less than 80% the rate of the majority class.  Let’s take a look at the following example to figure out what this means:

Hired

Applied

Ratio

80% Rule

Caucasian

20

40

1:2 (.50)

.50 * .80 = .40

African American

5

20

1:4 (.25)

.40 > .25

In this example, 40 Caucasian people and 20 African Americans applied for the same job.  The organization selected 20 Caucasians and 5 African Americans for the job.  This results in 50% of the Caucasians being hired, but only 25% of the African Americans being hired.  To determine whether disparate impact occurred, we take .50 (ratio for Caucasians) and multiply it by .80 (80% rule).  This results in .40.  We then compare this number to the ratio of African Americans hired, .25.  Since .40 is greater than .25, we can determine that disparate impact occurred on the basis of the 80% rule.

Although it may seem relatively simple, the 80% rule works quite well across most situations.  But what about the other method – statistical significance testing?

Many different tests could be used to identify disparate impact, but the chi-square test may be the simplest.  The chi-square test can be used to determine whether the association between two categorical variables is significant, such as whether the association between race and hiring decisions is significant.  So, we can use this to test whether disparate impact may have occurred.

To do so, you can use the following calculator: https://www.graphpad.com/quickcalcs/contingency2/ .  Let’s enter the data above, which would look like the following in a chi-square calculator:

Hired

Not Hired

Caucasian

20

20

African American

5

15

The resultant p-value is .06, which is greater than .05.  Not statistically significant!  Although this is the exact same data as the 80% rule example, the chi-square test determined that it was not a case of disparate impact.  Interesting!

But what happens when we double the size of each group?  The 80% rule table would look like this:

Hired

Applied

Ratio

80% Rule

Caucasian

40

80

1:2 (.50)

.50 * .80 = .40

African American

10

40

1:4 (.25)

.40 > .25

Again, the resultant ratio for Caucasians is .50, which is .40 when multiplied by .80 (80% rule).  The resultant ratio for African Americans is .25, which is smaller than .40.  This suggests that disparate impact occurred on the basis of the 80% rule.

On the other hand, let’s enter this data into the chi-square calculator, which would look like this:

Hired

Not Hired

Caucasian

40

40

African American

10

30

The resultant p-value is .009, which is much less than .05.  Statistically significant!  While the ratios are identical for the two examples, the latter chi-square test determines that disparate impact occurred.

This is the idea behind “Being Big is Worse than Being Bad.”  Although both examples had the same ratio, and thereby were just as bad, the chi-square test indicated that disparate impact only occurred in the latter example, which was bigger.  Thus, significance testing has concerns when identifying disparate impact, because the sample size strongly influences whether a result is significant or not.

So, do we just apply the 80% rule?  Not necessarily.  Jacobs et al. (2012) call for “a more dynamic definition of adverse impact, one that considers sample size in light of other important factors in the specific selection situation” (p. 470), and they also call for a less-simplified view of disparate and adverse impact.  While the 80% rule can certainly address certain problems that significance testing encounters, it cannot satisfy all the needs within this call.

Issues like these is why we need more statistics-savvy people in the world.  Disparate and adverse impact are huge issues that impact millions of people.  And many of these decisions are not made by statisticians.  Instead, they are made by courts and companies.  Even if you aren’t interested in becoming a statistician, the world still needs people that understand statistics – and know how to watch out for Statistical Bullshit in significance testing!

That’s all for today.  If you have any questions, please email me at MHoward@SouthAlabama.edu.  Until next time, watch out for Statistical Bullshit!

Excel Statistics Help

Excel HelpSorry, but there is no Statistical Bullshit this week!  No, the world did not run out of it – trust me, there is still plenty.  Instead, I’ve been developing a section of another one of my websites: MattCHoward.com.  The Statistics Help Page of my academic website has been getting lengthier and (fortunately) more popular.  One of the most common requests that I receive is for guides on calculating statistics in Excel.  This is understandable.  Other stats programs can be expensive, whereas most everyone has access to Excel in their workplace or home.  So, I’ve been spending time writing short guides on my Excel Statistics Help Page.

A primary method to avoid Statistical Bullshit is to understand statistics yourself.  If you are unsure about calculating statistics in Excel, be sure to check out this page.  I’ll be updating it regularly throughout the current academic semester.  So, if you need a guide on a certain topic, let me know by emailing MHoward@SouthAlabama.edu.  I’d be happy to create a guide sooner rather than later!

Correlation Does NOT Equal Causation

Your variables may be related, but does one really cause the other?

Most readers have probably heard the phrase, “correlation does not equal causation.”  Recently, however, I heard someone confess that they’ve always pretended to know the significance of this phrase, but they truly didn’t know what it meant.  So, I thought that it’d be a good idea to make a post on the meaning behind “correlation ≠ causation.”

Imagine that you are the president of your own company.  You notice one day that your highly-payed employees perform much better than your lower-payed employees.  To test whether this is true, you create a database that includes employee salaries and their performance ratings.  What do you find?  There is a strong correlation between employee pay and their performance ratings.  Success!  Based on this information, you decide to improve your employees’ performance by increasing their pay.  You’re certain that this will improve their performance. . .right?

Not so fast.  While there is a correlation between pay and performance, there may not be a causal relationship between the two – or, at least, such that pay directly influences performance.  It is fully possible that increasing pay has little effect on performance.  But why is there a correlation?  Well, it is also possible that employees get raises due to their prior performance, as the organization has to provide benefits in order to keep good employees.  Because of this, an employee’s high performance may not be due to their salary, but rather their salary is due to their prior high performance.   This results in current performance and pay having a strong correlational relationship, but not a causal relationship such that pay predicts performance.  In other words, current performance and pay may be correlated because they have a common antecedent (past performance).

This is the idea behind the phrase, “correlation does not equal causation.”  Variables do not necessarily have a causal relationship just because they are correlated.  Instead, many other types of underlying relationships could exist, such as both having a common antecedent.

Still don’t quite get it?  Let’s use a different example.  Prior research has shown that ice-cream sales and murder rates are strongly correlated, but does that mean that ice cream causes people to murder each other?  Hopefully not.  Instead, it is that warm weather (i.e. the summer) causes people to (a) buy ice cream (b) and be more aggressive.  This results in both ice-cream sales and murder rates.  Once again, these two variables are correlated because they have a common antecedent – not because there is a causal relationship between the two.

Correlation does not imply causation

Hopefully you now understand why correlation does not equal causation.  If you don’t, please check out one of my favorite websites:  Spurious Correlations.  This website is a collection of very significant correlations that almost assuredly do not have a causal relationship – thereby providing repeated examples of why correlation does not equal causation.  If you do understand, beware of this fallacy in the future!  Organizations can make disastrous decisions based on falsely assuming causality.  Make sure that you are not one of these organizations!

Until next time, watch out for Statistical Bullshit!  And email me at MHoward@SouthAlabama.edu with any questions, comments, or stories.  I’d love to include your content on the website!

Bullshit Measurement

Are you measuring what you think you’re measuring? Could you be measuring something else entirely?

Accurate measurement of variables is essential for business success.  Sometimes, it’s fairly easy to record these variables – sales, revenue, profit.  Other times, it can be very very difficult.  For example, let’s say that you want to hire employees that are smart and conscientious.  How can we measure intelligence and conscientiousness?

Well, a good starting point is to develop a test or survey.  Many intelligence tests exist with varying levels of sophistication and accuracy, and you could pay to give these tests to applicants.  Many self-report surveys also exist that can measure conscientiousness, and you could pay to give these tests to applicants, too.  But what if you don’t want to use one of these existing measures?  What’s the worst that could happen?

In this post, we won’t talk about the worst that could happen, but we’ll discuss a pretty bad outcome: when your measure inadvertently gauges the wrong construct, which could result in a lawsuit.

I should also note that this example comes from an actual consulting experience that I encountered.  The names have been changed, but remember that these things actually happen in industry!


I was once hired along with a full team to review the new selection system of a trendy company.  Let’s call them X-Corp.  X-Corp wanted their selection to measure a construct that they invented: “the ideal X-Corp employee.”  They made a list of the ideal X-Corp employee characteristics.  It included the common constructs like intelligence and conscientiousness, but it also included some unorthodox constructs.  These included hip, stylish, savvy, sleek and so fourth.  X-Corp argued that the ideal employee needed to appeal to any potential customers, and therefore needed to have these characteristics; however, my team was already doubtful about the business relevance of theses constructs.

Even more concerning, X-Corp felt that their survey had to attract people to work for X-Corp.  For this reason, it couldn’t be a traditional survey.  It had to be different and exciting.  Once again, we were doubtful about how exciting a selection survey could be.

When we saw the survey to measure “the ideal X-Corp employee,” we began to worry even more.  The first question looked something like this:

Bullshit Measurement 1

What?

The text of the item read, “Using the scale, please indicate whether you are more like a sports car or a hybrid/electric car.”

…What?

Immediately, we asked X-Corp what this item was meant to measure.  Sure enough, they just said “the ideal X-Corp employee.”  We asked which subdimension, specifically, was the item meant to measure.  As they couldn’t respond, we realized that they didn’t really have an idea.  It seemed that they just put things in their survey that they thought would be a good idea without really thinking about the ramifications.

Do you think this item would help identify good employees?  Well, we first have to ask what is the “correct” answer.  According to X-Corp, the correct answer was being more like a hybrid/electric car.  So, anyone would indicated that they were more like a sports car got the item wrong.  Do you think this is fair?  More importantly, do you think those that feel more like a “hybrid/electric car” are necessarily better than those that feel more like a “sports car?”  I would guess that the answer is probably not.  There are probably many sports car people that are more intelligent, conscientious, hip, savvy, and so on when compared to hybrid/electric car people.  Thus, this item probably fails to measure “the ideal X-Corp employee.”

That item was bad, but it wasn’t the worst.  The worst was probably the following item:

Bullshit Measurement 2

Once again, what?

The text of the item read, “Using the scale, please indicate whether you are more or less like Kanye West.”

Once again…what?

X-Corp claimed that Kanye West was too narcissistic, and anyone who felt that they were like Kanye were not welcome at X-Corp.  Do you think that Kanye people are inherently worse than non-Kanye people?  Once again, I am guessing that the answer is probably not. Kanye people are probably just as good as non-Kanye people, and perhaps even better in some regards (i.e. creative, hip, etc.).  But can you think of anything else that this item might inadvertently measure?  Let’s look at the graph below, which is similar to the actual results.

Bullshit Measurement Graph

As some of you may have guessed, African Americans were much more likely to see themselves similar to Kanye than Caucasians.  This makes sense, as Kanye himself is African American.  Thus, this item partially measures the applicant’s ethnicity.

Remember when I said that those responding that they were more like Kanye were rated as worse applicants?  If this survey went live, that would mean that African Americans would automatically be penalized, thereby resulting in adverse impact.  This would almost assuredly result in a lawsuit, in which X-Corp could not justifiably defend – or, at least, have a very hard time defending that the Kanye question actually represented job performance.  This would have cost the company millions of dollars!

In the end, my team strongly recommended that the company should not use their selection survey, and should instead use a traditional survey.  The company wasn’t happy, and we were never asked to work with the company again.  But, they did guarantee that they would not use their selection system.  While it wasn’t the most satisfying result, I was happy that we were able to stop another case of Statistical Bullshit!

If you have any questions or comments about this story, feel free to contact me at MHoward@SouthAlabama.edu .  Also, feel free to contact me if you have any Statistical Bullshit stories of your own.  I’d love to include them on StatisticalBullshit.com!

Bullshit Charts

Is Statistical Bullshit possible when no numbers are involved?

Possibly the most widespread form of Statistical Bullshit is Bullshit Charts.  Charts are meant to provide clear and easy-to-read information, but Bullshit Charts are designed to mislead the reader – whether intentionally or unintentionally.  Often, these charts will alter common cues that the reader expects, hoping that the reader will not notice these subtle changes.  Through doing so, the chart is not “lying” per se, but it is certainly Statistical Bullshit!

Bullshit Charts are common in situations with little time to process all relevant information, such as during a commercial or business meeting.  And I’m sure you’ve experienced this before.  Maybe a commercial presented a chart for a split second, showing that their product is superior to others.  It may have looked reasonable, but if you could only pause the TV, you could have seen that the x- or y-axis was mislabeled.  In other words, it was indeed a misleading Bullshit Chart.

Below are some of my favorite examples of Bullshit Charts.  The Statistical Bullshit should be apparent in each of these charts, but please email me at MHoward@SouthAlabama.edu if you have any questions or comments about these charts.  Until next time, watch out for Statistical Bullshit!


1.  Need to make your argument seem more convincing?  Just give yourself a bigger slice of the pie no matter what the data shows…

Bullshit Charts 1

2.  Again, just change the distribution of the pie to help your case!  Or make up the data, as these labels and percentages seem to not make any sense at all…

Bullshit Charts 2

3.  Or, just ignore the size of the bars.

Bullshit Charts 3

4.  Does the data disprove your claim?  Just flip the chart upside down to make it seem like you’re correct!

Bullshit Charts 4

5.  Although those in the Philippines may only be ~.2 meters shorter than those in The Netherlands, you can always draw them as about 1/3rd the size to prove a point…

Bullshit Charts 5

6.  Again, you could just ignore the size of the bars altogether.

Bullshit Charts 6

7.  I’ve seen this trend catching on more recently.  Three-dimensional charts are often difficult to read.  If you want to prove a point, it is rarely a good idea to use 3-D charts.

Bullshit Charts 7

8.  Then again, some two-dimensional charts aren’t much better…

Bullshit Charts 8

9.  So, sometimes it’s just easiest to go back to giving yourself a bigger slice of the pie.

Bullshit Charts 9

10.  If all else fails, just give your chart nonsense labels and just hope for the best!

Bullshit Charts 10

Sources for these and other Bullshit Charts:

https://www.reddit.com/r/dataisugly/

https://www.reddit.com/r/shittydataisbeautiful/

https://www.reddit.com/r/badstats/