Data dredging

11/26/2023 0 Comments

Data dredging

I wouldn't rule out the possibility that some clever statistician could come up with a testing sequence that properly adjusts for this, but it would be a difficult theoretical exercise which would probably constitute a publishable paper in its own right. In such circumstances, it is difficult to "adjust" the second test to take account of the first test, and it would require some heavy theoretical development.

In practice, the conditional null distribution for the second test would be complicated, because it is conditional on the outcome of an optimisation result in the first test involving multiple variables that are related to the variables in the second test. (Indeed, there is a plausible causal relationship between these variables, which could be quite strong.) Consequently, conditional on the result of the first test, the null distribution of the second test would not be the same as if the first test had not been performed to get there. There is good reason to believe that the presence or absence of antibodies to the bacteria would affect the association between the bacteria variable and the sickness outcome in the first test. That is certainly going to be required in this case, and it will not be easy. As a general rule, when we "adjust for multiple comparisons" we are essentially adjusting the null distribution of a statistical test to condition on all the testing coming before/concurrently with that test. The second test you mention sounds very suspicious in this context, and my view is that it would not be appropriate to test this without a further adjustment for multiple comparisons (which would be extremely complicated and possibly prohibitive). A free resource from GRC Data Intelligence. Related glossary terms: decision tree, box plot However, because the concurrence of variables does not constitute information about their relationship (which could, after all, be merely coincidental), further analysis is required to yield any useful conclusions.

Data dredging is sometimes used to present an unexamined concurrence of variables as if they led to a valid conclusion, prior to any such study.Īlthough data dredging is often used improperly, it can be a useful means of finding surprising relationships that might not otherwise have been discovered. To make a valid assessment of the relationship between any two variables, further study is required in which isolated variables are contrasted with a control group. Many variables may be related through chance alone others may be related through some unknown factor. Data dredging is sometimes described as "seeking more information from a data set than it actually contains."ĭata dredging sometimes results in relationships between variables announced as significant when, in fact, the data require more study before such an association can legitimately be determined. Sometimes conducted for unethical purposes, data dredging often circumvents traditional data mining techniques and may lead to premature conclusions. The traditional scientific method, in contrast, begins with a hypothesis and follows with an examination of the data.

Data dredging (data fishing) - Data dredging, sometimes referred to as "data fishing" is a data mining practice in which large volumes of data are analyzed seeking any possible relationships between data.

0 Comments

YOUR CART

Data dredging

Leave a Reply.

Author

Archives

Categories