I met with a psychologist and professor I knew in high school the other day, and one of the topics that came up in conversation was the issue of trying to gender-neutralize standardized psychological tests. This is particularly an issue with politically-charged tests like the IQ test. I hadn't been aware this was done, but apparently standard IQ tests are constructed so that in no question does one gender show substantial better performance (although I'm not sure if it's anything of statistical significance or what). I forgot exactly what this procedure is called, something like "item-pairing." The thinking is that if you start with a large pool of questions which are purported to test for general intelligence, and some of these questions are answered correctly much more frequently by people of one gender, then those questions must be biased in some way so you reject them.

There are a lot of issues here. One preliminary question you might ask is who does this procedure help? I don't know the data, but my suspicion is that men probably score better on some types of questions, and women score better on other types of questions. Overall, men score slightly better on SAT tests, for instance, so one might infer that more of the questions showing a gender disparity favor men. This is obviously not necesarily true though, since men may just be showing stronger performance in general, but women may be the ones showing superior performance on certain types of questions. Therefore this protocol doesn't necesarilly serve a moderating function; it seems it could potentially exagerate already existing performance gaps.

The real question comes down to not what is the statistical basis for carrying out such a protocol, but what is the conceptual basis for doing so. Since you could always form arbitrary groups of people who perform differently on any given question, and argue that therefore the tests must be biased against them in some way. If you carried this scenario out to its logical conclusion, you would end in a situation where all individual variability in performance is banished - and this is obviously absurd. Of course, the argument will be made that genders are not an arbitrarily drawn group. But a group is only significant to the extent that it is correlated with other, confounding factors. The fact that members of a group perform worse on a certain question is not definite evidence that bias exists; it may just as well indicate that the question tests for something that members of that group are less able in.

The permissibility of this analysis depends on your fundamental philosophical view of what the test is testing. If you thought that there was a single quantity that all test questions were designed to measure, then it would be logical to only accept a test that has no significant variation across questions in the gender answering patterns, because presumably all questions are testing for the same quantity whose relationship with the genders should be static. Note that this doesn't necessarilly mean that the set point of correct responses should be 50-50; what if one gender really does have slightly more of the quantity that the tests measure? A "fair" test would take into account that that gender will invariantly score a given amount better on any question that validly measures that quantity. But in reality it's hard to know what this set point is without the input of a test, so you're back to square one. On the other hand, if you thought the test was measuring multiple discrete skills, you'd expect these would vary between genders. Hopefully, when it comes to IQ tests, political correctness isn't intervening and saying, "well let's just assume that men and women are absolutely the same cognitively, and therefore we'll reject any questions (or groups of questions) that show one gender performing better than the other as either ineffective or flawed" or holding some presumptuous, anti-scientific "intelligence is gender-neutral" dictum. This would be a good way to create a gender-equalized test in terms of scores, if that was your goal for some reason, but wouldn't necessarilly create gender impartial questions. Interestingly, the people who would would probably want to make each question as gender-neutral as possible would probably also tend to be the same people who adhere to the multiple intelligence theory of intelligence - which is logically inconsistent.

My general conclusion is that it seems really stupid to gender-neutralize a test based on rejecting individual questions that show a gender disparity. Probably a smarter way to do it would be to do a kind of latent variable analysis: that is consider GROUPS of questions that show a pattern of one gender answering more correctly, infer some kind of latent common cause, and try to identify (through non-statistical analysis) if it's due to something that the questions are TESTING or something about the way the questions are worded, expressed, or presented. You can't just summarilly reject any group of questions because they are answered more correctly by one gender before determining what the probable cause is, since it very well may indicate a difference in whatever the test is supposed to measure, rather than an unfair bias in the question.

