Skip to content

Interview with Paul Cooijmans on High-range Test Construction, High-range Tests, and Statistics

2025-01-01

Author(s): Paul Cooijmans and Scott Douglas Jacobsen

Publication (Outlet/Website): Noesis: The Journal of the Mega Society

Publication Date (yyyy/mm/dd): 2024/12

Abstract 

Paul Cooijmans founded GliaWebNews, Order of Thoth, Giga Society, Order of Imhotep, The Glia Society, and The Grail Society. His main high-IQ societies remain Giga Society and The Glia Society. Both devoted to the high-IQ world. Giga Society, founded 1996, remains among 

the world’s most exclusive high-IQ societies with a theoretical cutoff of one-in-a-billion individuals. The Glia Society, founded in 1997, is a “forum for the intelligent” to “encourage and facilitate research related to high mental ability.” Cooijmans earned credentials, two bachelor degrees, in composition and in guitar from Brabants Conservatorium. His interests lie in human “evolution, eugenics, exact sciences (theoretical physics, cosmology, artificial intelligence).” He continues administration of numerous societies, such as the aforementioned, to compose musical works for online consumption, to publish intelligence tests and associated statistics, and to write and publish on topics of interest to him. Cooijmans discusses: 1994; the realizations about the tests; g; common mistakes in trying to make high-range tests valid, reliable, and robust; the counterintuitive findings in the study of the high-range; the core abilities measured at the higher ranges of intelligence; skills and considerations; proposals for dynamic or adaptive tests; remove or minimize test constructor bias; listed norms; the most appropriate means by which to norm and re-norm a test; the structure of the data in high-range test results; homogeneous and heterogeneous tests; “real I.Q.” computable from multiple tests; English-based bias; questions capable of tapping a deeper reservoir of general cognitive ability; roadblocks test-takers tend to make in terms of thought processes and assumptions around time commitments; the intended age-range for high-range tests; sex differences; frauds and cheaters; identity verification; the level of the least intelligent high-range test-taker; ballpark the general factor loading of a high-range test; precise or comprehensive method to measure the general factor loading of a high-range test; appropriate places for people to start; test constructors Paul considers good; learned from making these tests and their variants; Mahir Wu; test item answers with ambiguity; sufficient clues for discovery and solution; a mere guessing logic; a test’s quality; the reduction of the references to specific test items used by other test authors; issue of test logic and design schema close-but-imperfect replication from one author by another; scale and norm; Matthew Scillitani; a stigma around high-range tests; test construction and norming processes; the easiest and hardest parts of norming and constructing of a test; tests – 51 in-use & 57 retired, which ones are special; articles in Netherlandic on test design; some submitted questions anonymously; geniuses; yourself as a genius; others who you see like yourself in studying high ranges of intelligence; most common mistake people make when submitting feedback; aspects of people’s test feedback seem confusing; Marathon Test Numeric Section; creating high-range questions; books or literature, even individual articles or academic papers, on psychometrics. 

Keywords: Cooijmans intelligence tests development, counterintuitive findings in IQ testing, difficult intelligence tests creation, high-range intelligence measurement, early IQ test construction insights, intelligence scale development, guitarist talent assessment, high-range IQ test insights, IQ testing beyond mainstream limits, high-range IQ testing, IQ tests for Giga Society. 

Scott Douglas Jacobsen: You have written high-range tests for a long time. You are thorough regarding high-range tests in a warning, the reasons to take them and not, the goals, psychologists’ access to test answers, test protection, what high-range tests measure, insights from 25 years in I.Q. testing, hypothesizing on an extended intelligence scale, humor, negative reactions, potential fraud, megalomania, and terminology. Your first test conception began in 1994, tests spread in 1995, and then the Giga Society was founded 1996 and Glia Society was founded 1997. When in 1994, or earlier if earlier, did this interest in test construction truly come forward for you? 

Paul Cooijmans: I have examined the sheets of paper on which I created the first test, as well as my agendas from that period, and it appears the interest started in the spring of 1994, like April or May. 

Jacobsen: At the time, what were the realizations about the tests and the need to develop yours? 

Cooijmans: The first test was meant to assess the progress of guitarists, and I had many guitar students then, even over a hundred, including those of jobs as a replacement teacher. I was astounded how well a guitarist’s level could be graded on this scale, and also noted that guitarists were not necessarily advancing, and that beginners were sometimes way ahead of some long-term students, which made me realize there was something like talent, and that only limited progress within one’s range of talent was possible. And I observed that the level of a guitarist on this scale seemed to reflect a more general property than just musicality or guitar-technical ability, which is why I called this instrument “Graduator for human and guitarist”. Later I realized that this general property was mostly intelligence, and that when you measure specific skills or abilities, you also catch general intelligence, often even primarily so. 

In this period (1990s) I was taking some mainstream intelligence tests myself. I tended to get the maximum scores they could (or would) report on tests like Cattell Culture Fair, the Netherlandic WAIS, and the entire Drenth test series (the last were the hardest and highest-level tests available in the Netherlands) and when I asked what my real level was and how far I was above the reported maximum, I was told it was not possible to measure intelligence beyond about the 99th centile and that they had no tests that gave meaningful scores in that range. I also asked a few giftedness researchers about this, with the same answer. This, and the success of the Graduator, gave me the idea to create difficult intelligence tests to find out whether it was possible after all to measure intelligence at those higher levels. 

[Editors’ Note: https://en.wikipedia.org/wiki/Cattell_Culture_Fair_Intelligence_Test] Jacobsen: You found g does not diminish, or not much, at the high range. Why? 

Cooijmans: For a large number of my tests, I computed the estimated g loadings separately for the bottom half and the upper half of scores, the separation point being the median of scores. The upper half loadings were not generally much lower than the bottom half ones, although they were somewhat lower. This is reported in more detail at: 

https://iq-tests-for-the-high-range.com/statistics/differentiation_hypothesis.html 

If the question is for the real reason behind this, I suppose it is so that when a test contains sufficiently difficult problems and is not purposely neutered to hide differences between people, it will not lose g loading in the high range as much as mainstream psychological I.Q. tests do. 

And, the limited amount of loading it does lose may be due to the statistical phenomenon of attenuation by restriction of range, in other words may be an artefact and not a real loss. 

I should explain that g loading is computed from correlations, and that correlations rely on variance. If you consider a restricted range (like the high range, or even the upper half of it as meant above) you are obviously restricting the variance compared to the full range, and therefore you are restricting the possible correlations you may find, and thus also restricting the possible g loading. This is a statistical artefact, not a real decrease of g. There may be a real decrease going on as well, of course. 

Jacobsen: What are common mistakes in trying to make high-range tests valid, reliable, and robust? 

Cooijmans: I am not so certain if many other test creators are even trying to make their tests valid, reliable, and robust, but if so, mistakes are the following: 

(1) Making the test too short. This is bad for reliability, which increases with test length, and therefore also bad for validity, because reliability (correlation of a test with itself) is the upper limit of a test’s validity (correlation of a test with what it was intended to measure, or with anything else outside the test). Something can not correlate higher with something else than it correlates with itself. 

(2) Making a test one-sided, homogeneous, only containing one item type. This reduces validity with regard to general intelligence, and makes the test more vulnerable to fraud and to score inflation through increasing familiarity with the item type, so less robust. 

(3) Making it likely that test answers will leak out in ways as follows: Publishing the test itself online, revealing answers to candidates after test-taking, publishing item analysis so that everyone can see how difficult each item is, allowing retests (which allows people to figure out what the intended answers to some problems are), giving feedback as to which problems a candidate had wrong, answering questions about the test to candidates who are taking the test, and possibly more. 

(4) Subjective scoring of problems that do not have a single correct answer. This reduces the reliability and validity of the test; scores are not comparable between candidates. 

(5) Relying on face validity regarding what a problem measures or how hard it is. This tends to be far off. 

(6) Omitting verbal problems, thinking they are biased or unfair. This greatly limits a test’s validity with regard to general intelligence. Verbal problems span by far the widest range of abilities and hardness, and one should not throw that away. Of course it should never be about idioms or pronunciation, as those are localized and transient. Verbal problems should transcend language barriers and fashions or trends. 

(7) Omitting knowledge-requiring problems, thinking they are biased or unfair. It is only trivial, transient knowledge that one should avoid. Fundamental, general knowledge that transcends barriers greatly adds to a test’s validity. 

(8) Finally I have to include a mistake that I made myself on several occasions: helping or cooperating with the wrong persons, who later proved unreliable, deceitful, or otherwise misbehaving. Promoting tests by someone who later turned against me or denied my role, co-authoring a test with someone who then leaked out the answers, things like that. 

So, not being selective enough when deciding whether to cooperate with someone. Jacobsen: What are the counterintuitive findings in the study of the high-range? 

Cooijmans: The first counterintuive finding is that test problems are much harder for the candidate than for the test creator, and that a fair number of (in the eyes of the latter) ridiculously easy problems need to be included to obtain a score distribution with a discernible left tail. Going by one’s intuitive notion of item hardness, one gets a distribution with a mode at zero right or so, and a steeply tapering right tail from there. 

The second counterintuitive finding is the huge sex difference in participation. I would never have guessed there would be 4.5 times more males as females taking high-range tests, and on the level of test submissions the ratio is even 10.5 because males take more tests per person. Because of this sex difference, I have recently started reporting the “proportion of high-range candidates outscored” within-sex. After all, sports like boxing have separate competitions per sex too, have they not? And nearer by, even mental sports like chess have women’s competitions, although the naive observer will have difficulty understanding the necessity for that. The sex difference in participation should be seen in the light of the general phenomenon that, on almost all types of psychological tests, the highest and lowest scores tend to come from males. This greater male spread may explain why a test focused on the high range receives more male participation. 

The third counterintuitive finding concerns a small but significant negative correlation of high-range I.Q. with various indicators of psychiatric disorders and deviance, such as actual reported disorders, disorders in relatives, and personality test scores. I had not expected this, based on the popular notion of “giftedness” as a problem that requires “help”, and based on remarks of highly intelligent people who told me things like, “I am certain that those of very high intelligence are more inclined to depression”. I do not know why this correlation is negative; maybe a high I.Q. suppresses the expression of a disorder, or maybe the disorder depresses one’s I.Q.? My observation in communication with people of known I.Q. test scores over many years is consistent with the negative correlation: the higher the I.Q. of people, the more normal they behave in the psychosocial sense (even the ones who believe that their high I.Q. makes them more inclined to depression). 

[https://prometheussociety.org/wp/articles/the-outsiders/ -Editors’ Note] 

Jacobsen: What are the core abilities measured at the higher ranges of intelligence or as one attempts to measure in the high-range of ability? 

Cooijmans: Since high-range tests are typically unsupervised and untimed, certain types of tasks can not be included: working memory, concentration, working under time pressure, dexterity, motor coordination, clerical accuracy and such all require supervision. To our good fortune, most of those abilities are known to have relatively low g loadings compared to what can be included in unsupervised untimed tests: verbal, numerical, and spatial or visual-spatial problems. So a good indication of g is still possible via unsupervised testing. The factors known to have the highest g loadings are present. 

The absence of tasks as meant in the first sentence of the previous paragraph might lead one to think that high-range tests have some bias in favour of theoretical, abstract-logical, clumsy, wooden bookworm types, but this should not be taken for granted, and is perhaps even contradicted by the negative correlation of high-range I.Q. with indicators of psychiatric disorders. Also, spatial and visual-spatial tasks, which are present, are known to correlate positively with practical, performance, hands-on tasks involving motor coordination and dexterity, so that part of the missing task types are more or less covered still. And visual reasoning or visual-spatial problems have no bias against persons of low verbal ability. 

On a more general level, high-range tests can be said to demand strict reasoning, as well as the ability to recognize patterns of any kind. Pattern recognition may be related to what I have called “associative horizon”, and may include what others call “thinking outside the box” or “stepping out of the system”. The higher levels of pattern recognition, I think, require awareness, and that would imply that scores above a certain level be only possible for aware entities. Seeing the rise of artificial intelligence, this may become important. As long as artificial intelligence is not aware, constructors of high-range tests will need to try to limit new tests to types of problems that can not yet be solved by artificial intelligence, to avoid fraud by people consulting artificial intelligence for problem-solving. Once artificial intelligence acquires self-awareness, it should be able to solve any test problems that humans can solve. 

Jacobsen: In an overview, what skills and considerations seem important for both the construction of test questions and making an effective schema for them? 

Cooijmans: I would say that if one is highly intelligent with a reasonably balanced profile as well as conscientious, almost any skill can be learnt. The primary skill is being an autodidact. I know some have a disdain for autodidacts and consider them crackpots. But if you are doing something original, anything that has not been done before, you had better be an autodidact because no one can tell you how to do it. There exists no path to where no one has gone before. A further handicap of autodidact originality is that often you can not refer to “sources” as is customary in mainstream science. If you are the first to think of something, you are yourself the source and there is nothing already extant to refer to. 

Skills that may need to be learnt for constructing test items include expressing oneself properly through language so as to truly communicate, making positive use of comments from others, drawing, image editing, statistics, programming, organizing one’s time (days, weeks, months) in a disciplined way, getting out of bed daily, and more such obvious things. 

Examples of habits to be urgently unlearnt are the use of idiomatic expressions and abbreviations, anonymity and pseudonymity, inappropriate communication while under the influence of substance abuse, and not responding punctually to bona fide work-related communication (as in regularly letting people wait for months). This paragraph may yield some angry “Do you mean me?” responses, but it has to be said. 

There are also requirements that, unfortunately, can not be learnt, such as sincerity and sense of righteousness. 

Jacobsen: Any thoughts on proposals for dynamic or adaptive tests rather than – let’s call them – “static” tests consisting of a single item or set of items presented as a whole test, unchanging, instead of a collection of algorithmically variant or shifting items adapting to prior testee answers in a computer interface? 

Cooijmans: Firstly it occurs to me that if one is going to use a computer interface and software to assess an individual’s intelligence, analysis of observed behaviour (including communication) and of the candidate’s responses to a computer-conducted interview should already provide a quite accurate estimation. Observation and interview are the primary means of gathering information in psychology. The interview could be made adaptive, with subsequent questions depending on prior answers, but a standard interview might work just as well. In the age of artificial intelligence, this is the way to go first. 

Secondly, if one is going to use a computer interface and software anyway, the testing of elementary cognitive tasks like reaction time, decision time, perceptual threshold, and working memory capacity should probably be the next thing to do. After observation and interview, testing is the third method of collecting information. A practical problem is that one may need to use the same quality of hardware to get reliable results. When letting people use their own computer, the results may be affected by the quality and speed of one’s graphical processing unit, and whether or not one has a dedicated one, for instance. 

Finally, adaptive psychometric testing might be tried. But there are problems; it is not for nothing that static psychometric tests are so much more common in practice. Adaptive testing relies on item-response theory, wherein statistical properties like difficulty and discrimination are first determined for each item by letting a group of people try to solve it. These values are later used to compute the score of the candidate being tested adaptively, the set of items used being different for different candidates. 

One problem is that statistical properties of single items are not constant in my experience, but change depending on the context in which the item is presented, and depending on the group of people attempting the item. For instance, if an item is presented among other items that are somewhat similar to it, it will likely behave as an easier item than when it is presented among items that are more different from it. And if an item is attempted by a group of conscientious people, it will have higher discriminating power than when it is attempted by unconscientious people. So the values of these item properties used in adaptive testing may be off, or as already said, single items do not have constant statistical properties, and that undermines the idea of adaptive psychometric testing. 

Also, adaptive psychometric testing as it is normally thought of requires timing and supervision in my opinion. But the worldwide high-range testing population is used to unsupervised, untimed tests, and only a tiny fraction of them may be willing to travel to the hypothetical location where one has set up one’s million-euro adaptive testing system. 

Jacobsen: How do you remove or minimize test constructor bias from tests? 

Cooijmans: It is best to prevent such bias by creating a wide variety of item types and subject matter, and by trying to think of new such types and matter with every new test. Studying comments from candidates may also help to avoid item types and subject matter that have become familiar among test-takers and that they appear to expect from you. Statistical item analysis may also indicate that there are problems with particular items, and by looking into that one may in some cases discover that the problem lies in the item’s being too similar to other items one used before. 

A few concrete methods to avoid bias are as follows: When creating knowledge-dependent items, consult a high-level thematic index of all the branches of human knowledge. One may find such in the Propaedia of the Encyclopaedia Britannica, or in old-school web directories from before search engines dominated web search. Strive to make each knowledge-dependent item come from a different branch of knowledge. This prevents the inclusion of only fields of knowledge that the test creator happens to be acquainted with. 

Vocabulary-dependent items may be constructed with the aid of dictionaries and use of a random element when choosing words to include. 

One may look over one’s earlier tests when creating a new one to avoid repeating item types or patterns that were used before. Not that such repetition must be avoided totally, but it should remain limited, and a significant part of the new test should be novel. 

Finally, to provide oneself with a broad pool of inspiration for possible test problems, one should expose oneself to a grand diversity of subject matter in the form of books and documentaries. This should also include materials that provide a basic understanding of fundamental sciences like mathematics, physics, chemistry, biology, astronomy, and so on. One should aim to understand nature, reality, the universe, and awareness at the deepest level. The desire to understand existence is behind all great works of art and science. 

Jacobsen: How do we know with confidence listed norms are, in fact, reasonably accurate on many of these tests? What is the range of sample sizes on the tests, even approximately, now? Practically speaking, for good statistics, what is your ideal number of test-takers? You can’t say, “8,128,000,000.” 

Cooijmans: For the norms that I have made, the norming method is explained in the statistical report of the test in question, and some further explanation is referred to from the report. The reports contain about all the statistics that can be revealed without violating candidates’ privacy and without damaging the security of the test. So if one understands the report, one knows how much confidence to have in the norms. In fact, I have devised a measure of quality of norms, based on the number of score pairs used and on their correlation with the object test. 

Since the norms are anchored to other tests and not based directly on the general population (as opposed to the high-range population, for which I do have direct norms) it remains a question how close the high-range norms would be to the general population norms in that range, if tests existed that were normed directly on the general population and extended into the high range. The best indication thereto that I know of is the Mega Test by Ronald K. Hoeflin, which was normed mainly on the old Scholastic Aptitude Test and Graduate Record Examination, which did seem to give meaningful scores into the high range, and thus form an anchor point between the general United States population and the high-range population, albeit that the G.R.E. was administered to a clearly above-average sample of the population so that the S.A.T. is ultimately the true anchor point. 

Hoeflin’s Titan and Ultra Tests were normed to be consistent with the Mega Test norms, I think. The same goes for my early tests, and over the years I have tried to keep the norms in accordance with that anchor point over many generations of norms. To facilitate this, I have invented protonorms, which form an extra layer between raw scores and I.Q.’s, so that adjustments can be made in the relation of protonorms to I.Q. without having to change the norms of every single test. So, the question as to how we know that the norms are reasonably accurate, in one sense, goes back to Hoeflin’s interpretation of reported Scholastic Aptitude Test and Graduate Record Examination scores, and scores on possible other tests used in norming the Mega Test, such as Cattell Verbal (also called Cattell B). Someone once sent me the data from the “Omni sample” of Mega Test scores, with known scores on other tests and correlations, which is how I know that the two mentioned educational tests provided the bulk of the norming data. I assume that Hoeflin had the population percentiles of the S.A.T. scores and used those as the main source of the Mega Test norms. 

But there is more. Over time I have come to understand that the high-range score distribution itself contains information that is likely of an absolute nature and may help to anchor the norms or keep them consistent over time: The mode or modal range of high-range scores (when many scores are aggregated, for instance by combining the scores from many tests) occurs in the I.Q. 130s by current norms; below it, scores taper off steeply, above it, shallowly. This mode seems to be the point below which people feel less or not attracted to take high-range tests, and as such it should represent an absolute intelligence level; the level from where people are interested in intellectual endeavours, one might say. 

Also, the level reached by the very highest scorers seems about constant over time, and falls between I.Q. 180 and 195 with the current norms. I am even carefully evolving to the viewpoint that this may be the highest intelligence level possible for any brain. So one could say that the norms in the high range are also defined by these two absolute (though coarse-grained) indicators (mode and maximum), not just by equation to scores from other tests. And, the number of scores that occur at these respective ranges are such that the current norms appear to be correct, that is, roughly in accordance with what one would expect given the predicted rarity in the general population of those I.Q. levels in a normal distribution. In fact one could theoretically norm the high range using these two indicators as anchor points, not needing scores from mainstream tests at all. And one could extend those norms linearly downward to include the normal range of intelligence, and the resulting scale might be better than that of actual mainstream tests normed directly on the general population. This is so because the general population and its average intelligence are changing, and therefore the norms of mainstream tests adapt to this change and are merely relative to the current population, not absolute. The high-range norms are the real, absolute indicator of intelligence. 

The sample sizes of high-range tests vary from 0 to about 400, but for those with good norms mostly from 36 to 225 or so. The ideal number of test-takers to norm a test is about 64. More is not necessarily better, because as the submissions keep coming in and go into the triple-digit range, the later scores may not be fully comparable to the earlier scores anymore due to things like answer leakage and increased familiarity with item types, and the norms may be affected by that and become unfair to the earlier test-takers. This can be countered by replacing problems that have become too easy (have leaked answers) but that changes the test, which also makes later scores less comparable to earlier ones, and if you change more than a little bit, you have to call it a revision and start over at zero collecting statistics for that new version. 

High-range tests that appear to have very large samples, like around 300 or more, have generally achieved this through undesirable manipulations like retesting under false names, or combining retests with first attempts in the same sample, and so on. 

Jacobsen: What are the most appropriate means by which to norm and re-norm a test when, in the high-range environment so far, the sample sizes tend to be low and self-selected, so attracting a limited supply and, potentially, a tendency in a restricted set of personality types? Dr. Ronald Hoeflin was claimed to have the largest sample size of the high-range test constructors. Do you have the largest legitimate sample size of any high-range test constructor at this time, now, based on over a quarter century conscientiously gathering data? You were the most recommended person to interview for this series. 

Cooijmans: In my experience, the best way to norm a high-range test is to rank-equate its raw scores to normed scores of the candidates on other tests. The other tests to be used should be selected based on their correlations with the object test; one sets the correlation threshold such that one obtains enough pairs. I have recently begun to set the threshold so that it maximizes 

the quality of norms, as given by a mathematical expression that uses the number of pairs and the weighted mean correlation. Thus it is objective, avoiding human decision. The expression that represents quality of norms is operational and may be improved as insights advance; I mention this because I know some are inclined to take these statistics as final and absolute, but they are parameters or controls that one sets to tune the system. 

I deny that high-range sample sizes are low. They are in the dozens to hundreds as I said above, and that is well into the range of mainstream test samples and more than enough for good statistics. Considering that the high range consists of only a fraction of the population, it is to be expected that the samples are smaller, and in fact they are not much smaller at all. The notion that mainstream I.Q. tests have enormous samples is mistaken. Typically they have several hundred per norm group. Norm groups exist for age ranges, but sometimes also for educational levels. In the Netherlands there are different levels of secondary schools, and mainstream I.Q. tests may have separate norms per level, sometimes even based on only a few dozen per level (like in a Netherlandic version of the WAIS some years ago). A test often used by Mensa was normed on 3,000 people, but divided over five age groups from 13 to 16 years, so the actual norms were based on 600 per age group. In other words, they used high school students. And such norms have often been used for decades, ignoring the inflation of scores called “Flynn Effect”. But in the minds of some people, the illusion is persistent that these “standard tests” are normed on hundreds of thousands or even millions, and form a kind of gold standard of I.Q. testing. 

The largest samples are found in educational tests, but not as large as some think. In the Netherlands, a test called Cito-toets has long been used in the last year of primary school, yearly taken by about 100,000 children. But not normed on that number! The norms were established by administering an anchor test to a sample of about 4,500 shortly before the actual test, and then equating the anchor test scores to the actual test scores. This helped to keep the standard scores stable throughout the years (the contents of the anchor test would remain the same for a number of years, while that of the actual test changed per year). 

My own Cito report from 1977 shows a percentile of 100, which is uncommon but probably means the actual value was above 99.95, as a later statistical report by Cito I got to study contained a table where percentiles were rounded to 100 if the actual value was above 99.95. I have asked Cito in the mid-1990s what the precise value was, but they could not tell me, they only had kept percentiles as whole numbers. Similarly, I inquired about my scores on a comprehensive test given to us in secondary school around 1980, something like the Differential Aptitude Test, but was told those scores had not been saved. We never got a score report for that test at the time, but I understood from teachers I had done extremely well, and on a parent’s evening (which my parents never attended) a teacher told the public that I was a one-off (“unicum” was the Netherlandic word used). This teacher died in 2013, incidentally. 

On the whole, I believe that high-range psychometrics is much more careful than mainstream psychometrics when it comes to the quality of norms and handling of score inflation by causes like answer leakage or people becoming more familiar with particular item types. 

I might have the largest sample size of current high-range test constructors. It includes over 3,000 individuals, over 6,500 scores on I.Q. tests scored by me, over 2,900 reported scores on other tests, and over 22,000 data points on personal details, including personality tests. But more importantly, I have organized that data in an accessible way and automated the processing of it. I did all the programming myself, including the statistical functions. 

Regarding a potentially restricted set of personality types and self-selection, it is inevitable that persons in the high range of intelligence differ in personality from those in the normal range and from those in the low range. This does not invalidate the norms in the high range. In fact, intelligence itself is a major aspect of personality. Self-selection is less of a problem than it seems because in general, people like doing what they are good at, so those attracted to taking high-range tests will mostly be intelligent. This is also illustrated by the rareness of low scores; only 3.5 % of scores fall under I.Q. 120 and 15 % under 130 (and no, this is not because the norms are too high, as self-doubting candidates sometimes suggest). Precisely what is going on with intelligence, non-cognitive personality traits, and brain-related disorders in the high range, and how this leads to creativity and genius in some, is an interesting question and I hope to understand more of it later on. 

Jacobsen: What is the structure of the data in high-range test results? Do homogeneous and heterogeneous tests change this? 

Cooijmans: Data structure is so important that someone who starts out collecting data for some purpose should ideally think out the database design beforehand. Once you have collected a lot of data, it becomes hard to make big changes to the design. The data structure of high-range tests looks as follows: 

At the top level there are five sections: 

(1) Descriptive information records for each test or type of personal datum. Each test or datum has a record here, and each record contains fields that hold information such as the test name, its maximum score, its contents types, and whatever further descriptive information there is. Conceptually, one may even imagine the tests themselves residing here in their respective records, but in practice one will probably not store actual tests in a database but think of the database as referring to tests that exist in a reality outside the database. 

(2) Candidate records. Each candidate has a record here, and each record has fields that hold the personal data of the candidate, and the candidate’s scores on the respective tests. Notice that a record here has hundreds of fields, but most or all candidates have only part of those hundreds of fields filled, depending on how many tests they have taken (each test has a field). 

Conceptually, one may even imagine the candidates themselves residing here in their respective records, but in practice one will probably not store actual candidates in a database but think of the database as referring to candidates that exist in a reality outside the database. 

Technically speaking, the test scores stored here are redundant insofar as they are also available from section (3), but for reasons like faster processing and reducing load on the processor, redundant fields are sometimes included in databases. 

(3) Test submission records. In this complex section, each test has a table, and each table has one record for each submission to that test, and each record has fields that hold information like some personal details of the candidate (corresponding to a record in (2)), score and possibly subscores, and the item scores, so for each item typically 0 or 1 for wrong or right, but any range of item scores is possible. Conceptually, one may even imagine the submitted answers themselves residing here in their respective records, but in practice one will probably not store actual submitted answers in a database but think of the database as referring to submitted answers that exist in a reality outside the database. 

Do make certain to understand the difference between “test” and “test submission”. Some say the first when they mean the latter, but the above paragraph illuminates the necessity to distinguish the one from the other. 

In this section in particular there is some appropriate redundancy in the form of for instance sex and age of the candidate (also available from section (2)) and scores and subscores (can also be computed dynamically from the item scores). But for reasons like faster processing and reducing load on the processor, redundant fields are sometimes included in databases. 

(4) Test norm records. This complex section has a table for each test, and each table has one record for each possible score on that test, and each record has fields that contain the raw score and the corresponding norm (in my case this is a protonorm). 

(5) Norming scale records. This section has one record for each norm as may be contained in (4), and each record has fields that hold the norm and corresponding values on other scales for that norm, for instance percentiles, proportions outscored per sex, and I.Q. if the actual norm is not an I.Q. (such as in my case, where protonorms are the norms contained in (4)). 

This structure has emerged over time as a natural reflection of the data itself. Someone who starts from scratch may well find that a completely different approach works too. Perhaps one would rather avoid any redundancy? As long as one has thought it over carefully. 

Jacobsen: What should be done with homogeneous and heterogeneous tests? 

Cooijmans: I consider only heterogeneous tests able to give a good enough indication of general intelligence, and use the term I.Q. only for heterogeneous tests, not for homogeneous tests. Also I refuse to administer homogeneous tests because I do not want to confront people with a score that is a less good indicator of their intelligence, and do not want to facilitate people who want to show such a less good indicator to others and thus give a misleading impression of themselves. 

Heterogeneous tests are tests that contain at least two different items types out of verbal, numerical, and spatial (sometimes I use “logical” as a type too). If one wants to study the intercorrelations of different homogeneous tests, the best way to do so is to use a heterogeneous test that has different homogeneous sections or subtests. One can then do correlation analysis or even factor analysis within such a sectioned heterogeneous test. That is also how factor analysis is traditionally done. A great advantage of this approach is that the sections or subtests will always have been taken by exactly the same group of candidates, and that is required for proper factor analysis. 

Some of my heterogeneous tests have homogeneous subtests that are normed in their own right to “standard scores” (on the same scale as I.Q.), and in that case one can also compute the correlations of such a subtest with homogeneous subtests that reside in other such compound heterogeneous tests. But I dislike this complication and am striving to move to having only non-compound heterogeneous tests; that is, with sections not normed in their own right, or without sections, just with different item types mingled throughout the test. Another disadvantage of correlations between the subtests from different heterogeneous tests is that those subtests have been taken by different groups of candidates, so that proper factor analysis will not be possible, if one was thinking of that. 

Jacobsen: People take multiple tests. They crunch those numbers. An implied claim of a real I.Q. from this crunching of numbers between multiple tests. Is there such a “real I.Q.” computable from multiple tests? 

Cooijmans: In theory there is, but in practice there are problems that hinder the computation of a real I.Q. across tests. In the high-range community of candidates, many have taken enormous numbers of tests, dozens at least, and sometimes more than a hundred. It is problematic to compute a real I.Q. in the usual way from all taken tests for reasons like the following: The intercorrelations of the tests are mostly unknown, and there are too many intercorrelations for them to ever be known in the first place. Some tests may have bad norms. Some scores may be fraudulent. If a selection is made from the taken tests to narrow it down, this may be a non-representative selection. For example, a candidate having taken thirty tests may like to have a real I.Q. computed from one’s top several scores, which are already way above the real level of the candidate, and then the computed I.Q. will be even higher than the average of those top several scores due to the formula used. 

The formulas for computing a real I.Q., such as “Ferguson’s formula”, take the average of the input scores and add something based on the correlations between the tests. With a perfect correlation, the outcome is simply the average. The lower the correlation(s), the higher the outcome. With zero correlation, you get something like a full unit of spread on top of the average. This may be correct in theory, but in practice leads to inflated outcomes. Apart from using a non-representative selection from one’s scores, another cause of inflation with these formulas is the fact that the known correlations between the tests are often underestimations of the true correlations due to incompleteness of the data and restriction of range. The groups who have taken the respective tests have only limited overlap for any pair of tests, and this overlap may suffer from selective reporting, and all in all this depresses the correlations. And lower correlations mean that the formula will yield a higher outcome. Underestimated correlations inflate computed “real I.Q.’s”. 

Also, when a person takes multiple tests, a learning effect may take place as a result of which the scores become somewhat higher. This increase then comes in addition to the compensation for imperfect correlation that is built into these formulas for “real I.Q.” 

For tests scored by me, I have devised a “qualified average I.Q.”, which tries to avoid the problems with these “real I.Q.’s”. Since I always have the complete data, no selection bias can inflate the average. The problem of underestimated correlations inflating the outcome is avoided by not using the computed correlations but assuming perfect correlations. If it seems unfair not to compensate for imperfect correlations, one may imagine that the learning effect from taking multiple tests replaces this compensation, so to speak. Finally, the computation is resistant to outliers. This is not claimed to be someone’s real I.Q., but I believe it is better than something like “Ferguson’s formula”. The exact formula of the qualified average I.Q. is operational and may be perfected over time as needed. 

Jacobsen: Is English-based bias a prominent problem throughout tests? Could this be limiting the global spread of possible test-takers of these tests rather than limiting them to particular language spheres? Although, these tests are taken, to a limited degree, in many countries of the world in all/most regions of the world. 

Cooijmans: Such bias is a problem, but how prominent it is depends on what one’s native language is and on whether one knows English. For other Germanic languages it is a smaller problem than for non-Germanic languages, and it is worst for East Asian languages. The fact that reference aids are allowed solves a big part of it, but for a nonnative English speaker there remains a disadvantage, which I have estimated at up to 5 I.Q. points. Without reference aids (on a verbal test that disallows reference aids) this would be more like 30 I.Q. points for this non-native English speaker, and for someone who does not know English altogether it is in my opinion better not to attempt the tests at all. 

It certainly limits the global spread of test-takers, especially in the areas where few people know English and the local language is very different from Germanic languages. I have always thought that the best solution for this is that people in such areas create their own tests in their own languages. 

In recent years it has become somewhat common for people to try tests in a language they do not know. Of course one has an unpredictable disadvantage then. 

Jacobsen: When trying to develop questions capable of tapping a deeper reservoir of general cognitive ability, what is important for verbal, numeral, spatial, logical (and other) types of questions? 

Cooijmans: That reservoir will likely be tapped almost regardless of the questions, as general intelligence expresses itself through virtually everything a person does or says. Important are things like having a wide diversity of questions and types of questions, and avoiding localized 

transient subject matter like idioms, abbreviations, pronunciation matters, and local or fashionable knowledge, as such does not transcend barriers of language, culture, and age. Fundamental knowledge that is the same for everyone in the world is good; knowledge that is bound to a geographic area, in-group, or period is bad. For these reasons, and contrary to what some think, high-brow vocabulary and subject matter are more culture-fair than low-brow vocabulary and subject matter. 

One should also be aware that learnt skills have no g loading; it is novel tasks that have g loading. Candidates sometimes complain that they have no idea what is expected from them when taking a test, or how to tackle it; but that is exactly the intention, that is how intelligence testing works! And candidates may be happy when they see a type of problem they have solved before because they know what to do then; but that is where their intelligence is NOT being used. Those problems have lost their g loading for them. So one should try to create problems that are different from what has been seen before, to enforce the use of intelligence. 

To illustrate that even esteemed test constructors not always understand the loss of g loading of learnt skills, here is an anecdote: Some years ago on a social medium, I saw a test author proudly mention that his young child had scored over I.Q. 160 or so on one of his father’s tests; after extensive coaching by the father/test creator, of course! 

Another observation regarding tapping into general cognitive ability: Good test problems are such that solving them is similar to making discoveries in the real world, unravelling the laws of nature and the universe. 

Jacobsen: What are roadblocks test-takers tend to make in terms of thought processes and assumptions around time commitments on these tests? So, they get artificially low scores on high-range tests. Also, what is the confusion made by smart (and, potentially, not-smart) people about time taken for a test to get a score and the intrinsic intelligence to get said score? You noted the latter point in one of the recent videos answering questions on your YouTube channel. 

Cooijmans: The idiomatic use of “roadblocks” is an example of what should not be in an intelligence test. Such an idiom is only understood within a narrow linguistic region and a restricted time period. It can not be understood without already knowing what it means. It can not be understood from the word itself or its context. The avoidance of idioms requires high intelligence and an abstract-literal mind. 

The test instructions state that there is no time limit. Yet some think that their score will be unrealistically high and invalid if they spend “too much” time. It has happened that someone said, “I have now been looking at this test for so long that I can not submit it any more, I found all the answers, it would not be fair”. But that is exactly the intention with untimed tests; that one continues until one finds no further solutions. 

The confusion meant in the question is probably the notion that someone who uses less time is smarter than someone who uses more time to arrive at the same score. But the principle of untimed testing is that this is not so, and that “until one finds no further solutions” is the right amount of time, irrespective of how long that is. This principle is based on the finding that when the allowed time is increased on a timed test, the test’s g loading rises. With supervised tests one needs to have a time limit for practical reasons, but with unsupervised tests one can leave out the limit entirely. 

I must add that I have nothing against supervised tests, provided they have a very broad time limit, something like three hours for a comprehensive test. But this is not feasible in the high-range testing practice. I can not get people from all over the world to travel to a place here where I can test them, and I can not set up testing locations worldwide in all countries. I tried, but the number of candidates willing to make use of such was negligible compared to regular unsupervised tests. So I stopped. And then there is always someone who says, “I would be willing to travel to you if you started with that again”. But one or two people is not enough to justify the significant effort and time put into such a project. If others wish to try it, go ahead. 

Jacobsen: What is the intended age-range for high-range tests? How do these account for individuals younger and older than this range? 

Cooijmans: From about 16 upward with no upper limit I would say. Older people do decline, but it is important that they participate in order to enable the study of this decline. Younger people are allowed to take the tests, and in practice, 12 years is about the lower limit. But they should be aware that they have not reached their adult level and will score lower than they will later be capable of. The steep increase of intelligence in childhood tapers off at about 16 and becomes shallow then, hence the idea that one enters one’s adult intelligence range at 16. 

Another way to answer this would be “after puberty”. Individuals, sexes, and ethnic groups differ in their childhood development, then puberty messes everything up, and after puberty things have mostly settled. That is why childhood studies of mental ability are so misleading; they misrepresent possible sex and ethnic differences. Puberty has normally completed by or before age 16-17. Age of onset of puberty varies greatly per individual, sex, and population, and tends to be one to two years earlier for girls than for boys. 

There are no separate norms per age group as that would hide the development of intelligence with age. And one wants to reveal that development, not hide it. Also, all candidates are treated and addressed as mentally mature adults, regardless of age. The development of intelligence with age plausibly differs per sex, which is why it should be studied within-sex; the most recent tabulation I made is at http://iq-tests-for-thehigh-range.com/statistics/age.html 

Jacobsen: A modestly common/uncommon knowledge of sex differences in the measurement of intelligence: Men do better at visuo-spatial subcomponents and women do better at verbal-emotional processing. What is important in constructing and norming a test if these and other differences exist? What similarities exist to not change this process? 

Cooijmans: There are indeed sex differences in aspects of mental ability. In constructing unsupervised high-range tests, it is not possible or meaningful to take these into account. One should just include the widest variety of item types usable in unsupervised tests and focus on high mental ability regardless of sex. 

Women have the bad fortune that the aspects on which they are known to outscore men mostly require supervision and timing, and can therefore mostly not be included in unsupervised tests. According to Arthur Jensen in Chapter 13 of his book The g Factor, these aspects are simple arithmetic, short-term memory, fluency (for instance, naming as many as possible words starting with a given letter within a limited time), reading, writing, grammar, spelling, perceptual speed (for instance, matching figures), clerical checking (both speed and accuracy, things like underlining certain letters in a text, or digit/symbol coding), motor coordination, and finger and manual dexterity. This problem is less serious than it seems because these are mostly lowly g-loaded tasks (not by anyone’s decision but because it happens to be so) so that the overall score will not be affected much by their absence, but it may be affected somewhat. This is related to what was observed in my answer to “What are the core abilities…” 

In norming, the proportion of high-range candidates outscored should be provided within-sex for reasons of transparency. I.Q. norms should be sex-combined as is usual. 

Jacobsen: Cheaters exist. Frauds exist. How do you a) deal with frauds and cheaters on tests and b) prevent fraud and cheating on those tests? Have reference texts been a problem in this? Does artificial intelligence complicate matters more? (If so, how?) 

Cooijmans: When I discover that someone committed fraud I will discard the fraudulent score in the database and make a note so that I can exclude that person from further testing and from society admission. This is sometimes complicated by the use of multiple false identities by such a person. If the person is a member of a society I am an administrator of, I will expel the person. In communication with other test creators or societies I may reveal what I know about the person if that seems appropriate. I do not believe there exists an organized system for sharing information about frauds between test creators, perhaps there should. 

Attempting to prevent fraud is done, for instance, by not publishing the test itself, letting people prepay, not sending tests to known frauds and so on. And if I find out that answers to particular test items are published or spread somehow, I will do something about it; mostly it comes down to replacing the items, sometimes leading to a revised version of the test. Sometimes a test is withdrawn entirely. 

I am not aware of reference texts that were involved in fraud. Artificial intelligence complicates things because frauds might consult it to solve test problems, which is not allowed as the test instructions state not to obtain answers from external sources but only use answers that one thought of by oneself. To reduce this complication I try to create problems for new tests so that current artificial intelligence, insofar as I apprehend it, can not solve them. I try to make the problems so that, once artificial intelligence becomes able to solve them, it will also be able to take tests and join societies on its own accord. I believe that will happen one day, but fear this day lies quite far into the future. If I had to guess I would say half a century. 

Jacobsen: It helps to have other data from other tests and personal data for identity verification. What information from other tests is helpful/necessary for research purposes of high-range tests? What is an efficient and appropriate format to provide this score information? What personal data is necessary from candidates, if any? What information would be helpful for research purposes from candidates? 

Cooijmans: Scores on other tests should best be reported in a format as follows, insofar as known: 

[Test author or issuing organization] [Test name] [Raw score] [I.Q.] [Standard deviation of I.Q. scale used] [Percentile] 

Scores should best be grouped by the first field (Test author), of course starting a new line for each score. Nowadays there exist hundreds of tests, and I can not know from the top of my head which test is from which author or organization, so if that first field is left out when reporting scores, which is common, this causes many minutes of extra work in processing that information. 

Concerning personal information, at least name, sex, year of birth, country of origin, and highest achieved educational level. Some further information I collect is the educational levels of the biological parents, the presence of a psychiatric disorder, and the presence of such disorders among parents or siblings. 

Notice that I find the exact date of birth not strictly needed. It is about studying the development of raw intelligence with age, and with adults, year of birth suffices. In childhood testing, one would want it to the month. 

Regarding psychiatric disorders, I do not ask for the particular disorder as that would require too much detail, too many options, too much complication in the statistical processing of it. 

And country of origin is a pragmatic imperfect proxy of origin. One might consider asking for race or ethnicity, but such categorization is logistically problematic when one looks into it, has many complications, may be considered unethical by some, and some may refuse to reveal their status. 

Other data that might be useful to collect include religiousness and femininity/masculinity (independent of sex). The possible correlation between religiousness and high-range I.Q. could then be established, which many are wondering about. And one could verify the anecdotal observation that intelligent men are more feminine than average men, and intelligent women more masculine than average women. In other words, there is more gender diffusion in the high range, which would point to an optimum for intelligence somewhere between the average male and female positions on the femininity/masculinity dimension. Notice that the term “gender” is for once used correctly here. I am uncertain whether people would be able to simply report their own position on femininity/masculinity, or whether this would require a test or questionnaire. 

Jacobsen: What is the level of the least intelligent high-range test-taker now? What is the level of the most intelligent candidate now? What is the mean, median, and mode, of the scores of test-takers’ data gathered so far? Within a range of I.Q. 10 to 190 on an S.D. of 15, when should a candidate consider taking, or in fact take, high-range tests? 

Cooijmans: The least intelligent seems to be in the I.Q. 80s. The frequency of such is one in thousands of high-range candidates. The most intelligent is plausibly between 185 and 195. One can not be certain yet about the accuracy of the norms there. And with candidates apparently far below the general population average, a problem is that they tend not to report usable information, so one has to resort to observation, life history facts they happen to mention, and aids like an online writing-to-I.Q. estimator. 

The median is protonorm 401 (I.Q. 139) according to the latest computation I did of highrange quantiles. I never compute the mean, but that should be several I.Q. points higher because the distribution is skewed to the high side. The mode is protonorm 387 (I.Q. 137), but one could also say there is a modal range in the 130s. A mode always depends on how wide or narrow one chooses the classes of the frequency table. 

People should consider taking high-range tests if they score above the 98th centile on some mainstream test, which is I.Q. 131 (or 130 on some tests that round differently). Below I.Q. 120 there is no reason to try high-range tests, but there is no objection to doing so anyway. There is a grey area from 120 to 130 because one does not score the same on different tests. 

Jacobsen: What is efficient means by which to ballpark the general factor loading of a high-range test? 

Cooijmans: I have always used the square root of the weighted mean correlation of the test with other I.Q. tests as an estimation of the g loading. This works well for comparing different high-range tests. It is not a true g loading because the tests have not all been taken by the same group of candidates, but by different groups with limited overlaps, correlations obviously being computed for those overlaps. Also, when reported scores from other tests are involved, those may suffer from selective reporting, which depresses the correlations. If all the involved tests had been taken by exactly the same group of candidates, one would be close to a true g loading. 

Another thing to consider is that high-range tests as I use them are almost all heterogeneous tests, so combining different item types within the test. But, in classical factor analysis, one uses a set of different homogeneous tests that have each been taken by each individual from a group. Typically, these are school exams for the various subjects administered to a school class, or subtests of a comprehensive psychological test like Wechsler Adult Intelligence Scales. Via factor analysis, one then computes g and other factor loadings per subtest or exam, and these g loadings vary greatly and may be very low for some subtests. So this kind of analysis is not so much done with a set of heterogeneous tests; computing correlations between heterogeneous tests is more something seen in high-range psychometrics. In classical factor analysis, wide-range heterogeneous tests are considered pure indicators of g, and it is the loadings of the various subtests or exams that one is interested in. 

Jacobsen: What is the most precise or comprehensive method to measure the general factor loading of a high-range test, a superset of tests, or a subset of such a superset? 

Cooijmans: The first is answered in the previous question. For a superset of tests it is not needed to compute such a loading, the superset can be safely considered a near-perfect indicator of g

Jacobsen: What seem like the most appropriate places for people to start when taking your tests–taking into account their own skill sets, or others’ tests for that matter? 

Cooijmans: I would recommend the privacy of one’s home. If “places” is meant nonliterally – as one sees, I am not one of those pedants who take everything literally – then a real computer to view the test is best, or at the very least a decent laptop (although I am really against the unneeded use of battery-powered devices). A smart telephone is no place to start. 

In case the non-literalness is even more remote, always start with the easiest tests. It is bizarre how it can occur to people to start out with the hardest tests, and how they subsequently can not understand what a score of zero means and keep asking for years thereafter what their I.Q. was on that test and if it means they are “gifted”. By looking at the test norms one can know how hard a test is. Nevertheless, I have recently ordered the list of available tests by difficulty to accommodate this. 

Jacobsen: What tests and test constructors have you considered good? 

Cooijmans: Constructors: Kevin Langdon, Ronald K. Hoeflin, a Netherlandic person who withdrew from the I.Q. societies so I can better not name him, Edward Vanhove, Hans Eysenck, Bill Bultas, Laurent Dubois. 

Tests: Mega Test, Magma Test, tests from the self-test books by Eysenck, Chimera Test, 916 Test. 

Regarding Langdon, studying his tests and statistical reports was instructive, if only because it told me which approaches were not so successful in measuring high-range intelligence. This includes attempting to make it more or less culture-fair, using multiple choice with a small number of options, item weighting based on item analysis (gives too much weight to a small number of items, and also the fact that single items do not have constant statistical properties undermines the idea of item weighting based on those properties), selecting items for a shorter test based on their statistical behaviour in an earlier longer test (the items’ behaviour is different in different tests), and norming that shorter test based on statistics from the earlier longer test (norms become mostly too high then). Also, that statistics from classical psychometrics, such as the reliability coefficient, are woefully inadequate to assess the quality of a high-range test. 

Jacobsen: What have you learned from making these tests and their variants? 

Cooijmans: I assume this is about my tests, not the tests by others from the previous question. The main points have already been mentioned in the questions about counterintuitive findings and about g not diminishing much in the high range. I could add the observation that intelligence expresses itself in almost everything a person does or says. I did not know that when I started. 

In case the question is about the tests from the previous question, I already answered it there for Langdon’s tests. With Hoeflin’s tests, I learnt the norming method of rank equation, and the destructive effects of fraud through retesting, false names, and cooperation. In correspondence with others in the 1990s, I was appalled when people proudly told me they were collaborating in a group to “crack” the Mega Test. When they told me they had retested under their own name or another name (the instructions said the test could be taken only once but Hoeflin allowed retests in practice). When someone told me he had first taken all of Hoeflin’s tests under his own name, then his sister’s name, then the son of his sister’s name, with ever-increasing scores. When someone told me he had first taken all of Hoeflin’s tests with rather low scores, then had some friendly correspondence with the person from the previous sentence, then took the tests again with the same scores as the highest scores of that person. When someone told me he had missed the Mega Society pass level by half a standard deviation, then retested and qualified. With “retest” I always mean “to take the same test again”. 

Several of those meant in the previous paragraph showed me their answers (unasked) and suggested I use them to get into the Mega Society. I had rarely been so shocked and insulted as by the suggestion that I would be capable of such fraud. I do not understand how those types can live with themselves. Having said that, two of them committed suicide in that period. 

And of course, such people publicly display or mention their highest (fraudulent) scores, not their honest scores. I remember a phone call with a Netherlandic Mensa member as if it was yesterday; “Yeah, the ‘Mega Test’, I am working on it with some people in Spain and east Asia. Yeah, we have it mostly figured out now, that ‘Mega Test’, ha ha ha…” This person killed himself not long thereafter. Not the “Beheaded Man”, incidentally; the history of I.Q. societies is riddled with suicides, and some of them appear to have made the right decision for once in doing so. That is one thing that gives hope then; that some can indeed not live with themselves in the end. 

Jacobsen: I received some decent points about high-range tests from Mahir Wu. Credit to him for the raw materials and permission to reframe those points as questions here. He raises foundational points. First point: item answers should be rigorously unique. Why? 

Cooijmans: If multiple answers to a problem are correct, this has disadvantages: The one answer may be easier to find than the other, so that candidates with the same credit may not really be of the same level because they found different answers. And candidates who see more than one possible answer may be confused and not know which is the “right” one. Also, there may be subjectivity in scoring those answers. With only one correct answer, these problems are mostly avoided. 

Of course, no matter how hard the test creator tries to make items with unique answers, once people start taking the test, it may sometimes occur that alternative valid answers are still found, and then one has to solve this, for instance by revising, replacing, or removing the item. Sometimes this can be done “in place”, especially early on when there have not been many submissions yet, and otherwise this may be done later in a revised version of the test. 

And, no matter how hard the test creator tries to make items with unique answers, there will always be people who “see” alternative answers through apophenia when they can not find the real answer. The apophenic delusions stick rigidly in their minds and they become convinced they have solved the problem, although the logical flaws are obvious to an objective observer. In popular artificial-intelligence speak, people “hallucinate” when unable to find the real solution. But that is inherent to intelligence testing; escaping this delusional rigidity is part of high intelligence, or rather, is an aspect of having a wide associative horizon. You will need that mental flexibility too when solving real-world problems. Sometimes, you have to take a step back and make a fresh start to eventually find the real solution. 

To illustrate that apophenic delusions are very real and persistent, I want to give some examples: I have a Test for Extrasensory Perception, which is exactly what it says. It is not an intelligence test, I did not hide any clues or patterns in it. Still, some years ago an otherwise normal person sent me a document of many pages, describing his decoding and solutions for it in long association chains. He was convinced he had found patterns that I had deliberately put there. Since this was explicitly not the case, this example proves that candidates may suffer from apophenic delusions all by themselves, and that this is not caused by ambiguity of the test items. 

And long ago someone published articles in I.Q. society journals, explaining how he had found references to the appearances of particular comets hidden in poems of certain literary authors. 

And another one produced a long series of essays, analysing the dates of events related to the Roman Catholic church by counting the days separating the events, finding numerical patterns therein, and concluding that the Vatican was conducting a dirty scheme that would culminate in some horrific project (I am not allowed to disclose it I think) of which he predicted the exact date in the near future. 

I am not naming these examples to ridicule people, but to show that such delusions can be extremely strong in apparently sane people. When taking high-range tests, it occurs often. If it happens to you, rest assured, for only in a small minority of cases does it lead to full-blown psychosis. 

Jacobsen: Following from the previous question, the test item answers with ambiguity should be disallowed. Why should these not be allowed, if agreeing with Wu? If disagreeing with Wu, why? 

Cooijmans: I agree for the reasons given in my previous answer. But as said, sometimes you only discover ambiguity as test submissions are coming in, when studying comments by candidates. 

Jacobsen: Why should test items give sufficient clues for discovery and solution by a test-taker? 

Cooijmans: Because otherwise it is impossible to solve the items, obviously. 

Jacobsen: Following the last question, why would permission of a mere guessing logic spoil a test? 

Cooijmans: Because correct answers that result from guessing do not stem from the candidate’s mental ability being used. Such answers are random variance and thus reduce the test’s reliability, and therefore also its validity. Test items should be made so that the probability of getting them right by accident is so small that, on average, candidates will gain less than one raw score point in the total score by guessing. This does permit multiple-choice items, but they should be cleverly constructed so that the likelihood of a correct guess is very small. For instance, by letting the candidate choose several options from a list instead of just one. 

Incidentally, I have heard people suggest that multiple-choice items that can be answered correctly by guessing reveal intuition and/or psychic ability, but even if that is true, I believe that intelligence tests should not measure intuition or psychic ability. I am also not a fan of penalties for wrong answers with multiple-choice; supposedly, this corrects for guessing, but of course, a candidate who chooses a wrong answer, thinking it is right and not guessing, is then penalized for the wrong reason. The penalty does not distinguish between guessing and being simply wrong, and in the latter case, no penalty should apply. For clarity, a penalty constitutes a negative item score, typically a subtraction of a fraction of a point, depending on the number of answering options for that item. 

An anecdotal experience regarding multiple-choice tests: Once, the instructions to one of my multiple-choice tests said, “There are no penalties for wrong answers”. After a while I removed that instruction because some candidates demanded a perfect score based on it. They took it as, “You will always get a perfect score no matter what you answer”. Of course they were wrong, because one starts out with zero points at the beginning of the test, not with the maximum, so “no penalties for wrong answers” in no way implies a perfect score. But if people are so willingly and stubbornly taking it the wrong way, I am not going to pain my brain trying to formulate it even better than it already was. 

Jacobsen: How can the sufficiency of each test item’s uniqueness become integrated into the overall test (even test schema) to prevent the identical pattern from emerging too much in a single test (or test schema)? 

Cooijmans: If I understand the question correctly, I would say that a test should consist of a broad diversity of items with mostly different patterns. They need not be all completely different; maybe two or three of a similar looking pattern are acceptable, provided the implementation of the pattern is different every time, so that the candidate is forced to recognize what is going on in each case. 

In some of my tests, like “Problems In Gentle Slopes of the first degree”, I had a series of about ten problems of the same kind in ascending difficulty, and that did not work well. Many candidates were able to solve all the problems in such a series. The items work as examples for each other and become too easy. So I concluded that it is better not to have more than 1 to 3 items of a similar kind in a test, and even those should differ sufficiently in implementation. 

Jacobsen: How can the inspiration from, even addition of, other authors’ test items degrade a test’s quality by giving more clues to test-takers to test items otherwise unsolved without them? 

Cooijmans: If a test contains an item that is similar to an item in another test by another author, the one item may function as an example to the other and thus make it easier. I have experienced a few times that a difficult problem in one of my tests appeared to have become much easier. Eventually, a candidate told me that a test by another test creator had a very similar but easier problem, and that made my difficult problem suddenly solvable. I then replaced that item. 

And, if a candidate is familiar with a particular item variant from other tests and is thus better able to solve such items, those items also lose their g loading for that candidate. It becomes a learnt skill, and learnt skills have no g loading. 

Jacobsen: Following from the previous question, what about the reduction of the references to specific test items used by other test authors? 

Cooijmans: If the question is about references inside a test to specific test items by other test authors, I am not aware of such references, possibly because I never look at tests by others. If such references exist, they probably help the candidate, which one may not want. But it is better not to have test items that resemble items by other authors altogether. 

Jacobsen: In some sense, is it truly difficult to avoid this issue of test logic and design schema close-but-imperfect replication from one author by another inspired–by the former–author, especially as more high-range tests are constructed? Wu references his latest test, “[Mystery],” as an example of an adherence to the close application of this principle, where the evidentiary effects of others’ tests become hard to apply to it. Consequently, results for “[Mystery]” are submitted much less. 

Cooijmans: With so many high-range tests in existence, it will be getting harder to avoid similarities between tests by different authors indeed. I myself never look at tests by others and create problems independently. In an earlier question about avoiding or minimizing test constructor bias I name some independent sources of inspiration. These do not include tests by others. One should never look at tests by others for inspiration for new test items! 

Jacobsen: Why should scale and norm not be overly subjective? Wu references T. Prousalis–link– and you–link, link. Also, why does a median score for many tests with a corresponding IQ of 145 (SD15), or higher, make little sense? 

Cooijmans: Norms should be objective and correct, otherwise they are not comparable between tests. A few possible causes of incorrect norms are the following: When a beginning test scorer starts out administering tests, initially one will only have reported scores by candidates to base the norms on. Unfortunately, many candidates are dishonest in reporting scores, leaving out lower scores and reporting the higher ones, or even reporting retests or fraudulent scores. This gives an upward bias, and the norms based thereon may be ridiculously too high, even 10 to 20 I.Q. points too high on average. In the longer term, this may sort itself out as one acquires more, and more true, data about the candidates’ scores. Theoretically, this could also be solved by different test constructors sharing their candidate data to thus make the candidates’ true scores on other tests available, but I believe this might be unethical and a violation of privacy. I know some test designers are currently publishing candidate scores online, but that too seems unethical, and also I do not know if that published data is trustworthy and am hesitant to use it. 

For information, a few test creators have sent me their complete data for a particular test of theirs, including candidate names, and I have scored a test by another author (Bill Bultas) myself in the past, so for those tests I have unbiased data. 

Another cause of incorrect norms is megalomania by the test creator. There exist authors who delusionally reckon themselves to be profoundly intelligent, but really have much lower I.Q.’s, typically in the 130s to 140s at most. So when they receive test submissions by people whom 

they perceive as being at roughly their level of understanding, they feel compelled to give out much too high I.Q. scores, otherwise they would have to admit to themselves they are not really as intelligent as they believe. 

A median of I.Q. 145 or higher is unrealistic. The high-range population is roughly the upper segment of the general population, cut off at about I.Q. 130. This is not a perfectly clean cut, but if it were, and for the sake of illustration, the following would be necessarily true: With a clean cut at 131 (98th centile) the median would be 135 (99th centile, so halfway the cut and the top). With a clean cut at 135 (99th centile) the median would be 139 (99.5th centile). A median of 145 (99.87th centile) would imply a clean cut at 142 (99.74). This is not consistent with the known population of high-range candidates; most of them are below 142, or at least I believe the evidence for that is more than sufficient. 

My experience is that the median of many high-range scores is almost always between 136 and 141. The fundamental cause of this, I think, is that only from the low to mid-130s onward people are interested in intellectual endeavours like taking difficult tests. Below that, it tapers off steeply. Above that, it tapers shallowly, and that shallow curve reflects the actual distribution of those 

high I.Q.’s in the general population. And this distribution is apparently such that the median of people wanting to take high-range tests ends up around 136-141. The mode is several points lower than the median, the mean several points higher. The mode probably represents the point from whereon the high-range distribution follows the general population distribution (upward). The mode is, more or less, the cut-off point meant in the previous paragraph. 

Jacobsen: The following are questions formulated based on input questions provided by Matthew Scillitani. What is the process of making preliminary norms before submissions have been given for a test? 

Cooijmans: If it is a fully new test and no data exists for its contents at all, I estimate the minimum raw scores that a Glia Society member and a Giga Society member, respectively, should obtain. So for each problem, I look at it and ask myself, “Should a Glia/Giga Society member be able to solve this?” Then I interpolate between those two scores, and extrapolate outward until I reach the edges of the test, where I taper with 5 protonorm points per raw score point. The edges are each sized half the square root of the total possible raw score range. 

Jacobsen: There seems to be a stigma around high-range tests. Is there a process to normalize taking them or having them exist in the first place? 

Cooijmans: There are indeed many who do not take high-range tests seriously, and this includes prominent figures like the late Hans Eysenck. In one of his “test yourself” books, I remember he was skeptical about the possibility of measuring intelligence in the high range, and even ridiculed it. He provided a number of absurdly complex problems “for the super-intelligent”, which appeared to be a parody on high-range testing. 

Much of the distrust and denial regarding high-range testing stems from the fear that one might not oneself belong to the most intelligent; it is comfortably reassuring to say to oneself, “Those tests are just puzzles by amateurs and their scores are meaningless, we can not measure intelligence beyond the 99th centile”. It is a way to protect one’s delusion that no one is verifiably smarter than oneself. 

Another cause of the stigma is the inescapable fact that there are fewer women than men in the high range. This is such a taboo that denying the validity of high-range testing is imperative to the politically correct academic, if only for that reason. 

A possible process to normalize high-range testing would be to establish it as a recognized branch of psychology at universities. I suspect this would require that we first reverse the decades-long neo-Marxist occupation of academia and make universities into places of genuine science practised by the most intelligent again. A concrete application of high-range psychometrics would be to devise proper admission procedures for universities to undo the dumbing-down that has taken place there over the past half century. The fact that the old Scholastic Aptitude Test and Graduate Record Examination were about the only mainstream tests with validity in the high range illustrates how appropriate high-range testing is in the context of college and university. 

For completeness, it should be mentioned that psychologist Lewis Terman (1877-1955) has tried to measure intelligence in the high range with two forms of his “Concept Mastery Test”. These were applied to subjects selected as children based on childhood scores of 140 and 

higher, and followed up in adulthood with the two Concept Mastery Tests. These were verbal tests highly loaded on vocabulary, not permitting references aids. In an unsupervised situation (which was and is how they are typically administered) it is exceedingly easy to cheat on such a test by using dictionaries and thus score absurdly far above one’s real level. Also, non-natives of the English language have a large disadvantage, in the order of 30 I.Q. points. So while these tests were non-robust against cheating and strongly culture-dependent, at least he tried. Since Terman has also been criticized for his belief in eugenics, heredity of intelligence, and racial differences therein, he forms an intersection between high-range psychometricians and hereditarians, so to speak. 

Having mentioned the Concept Mastery Tests, I should warn that the scores mostly quoted for them are raw scores, not I.Q.’s. Ronald K. Hoeflin has administered those tests for a while too, also unsupervised, so one should not rely too much on possible reported Concept Mastery scores from test candidates as they may be hugely inflated through fraud. 

Jacobsen: Have test construction and norming processes evolved in the aggregate for you? 

Cooijmans: Of course, when one has been doing something for decades, one has implemented improvements. If I have to give examples, I have become more concerned with locking in a unique answer and avoiding ambiguity and subjectivity in scoring, and I am also inclining more to having tests contain a surplus of difficult problems and a minority of easier ones. Regarding norming, one of the first things I learnt was that z-score equation – equating means and standard deviations – results in incorrect norms because raw test scores tend not to behave linearly, which is required for z-score equation to make sense. So I went with rank equation. Over the years I automated ever more of the process, so that now I can norm a test in 10 to 30 minutes mostly, while originally this took several whole days. 

I also learnt to formulate problems better to avoid misunderstanding. For instance, people skilled at mathematics may have a bizarre deformation that makes them interpret numbers differently from normal humans. If I say, “There are three apples on the table”, any sane person will understand that there are three apples on the table. But not mathematicians! The mathematician will understand that there are three OR MORE apples on the table. Because the mathematician thinks, “If there are four or five or six… apples on the table, there are three apples on the table too”. So to the mathematician you have say, “There are exactly three apples on the table”. 

Jacobsen: What are the easiest and hardest parts of norming and constructing of a test? 

Cooijmans: Easiest: Finishing off the eventual test once the problems have been conceived, and creating the database fields that will receive the incoming submissions. Also, norming is easy on the whole. Hardest: Creating the problems. This has got ever harder, the more tests I made. I try not to repeat myself too much, and try to take into account that the Internet as a search tool has become ever more powerful. The various types of fraud are hard to deal with. I have no sympathy or tolerance for the individuals behind it. The hardest nowadays is to create test problems that are robust against the developments that enable dishonest people to cheat. Those who have spread test answers should reveal the names of the recipients of the answers, so that we can clean up the statistics. And if they sold answers for money, they have to refund, and possible profit they made by investing the fraudulently acquired money should be donated to a good cause. 

Jacobsen: Of your tests–51 in-use & 57 retired, which ones are special to you? 

Cooijmans: To name a few, Test of the Beheaded Man, Cooijmans Intelligence Test (any form), Daedalus Test, The Nemesis Test, Test For Genius (any form), Only Idiots, The Gate, The Piper’s Test, Dicing with death, The Smell Test. Each in their own way, they demand the candidate to operate at the summit of cognition in ways that are not trivial but tie in to the essence of existence itself. That is what I have generally striven for. 

Jacobsen: In pre-2000, you wrote some articles in Netherlandic on test design. Are there any insights from those articles not replicated here or elsewhere worth replicating, or reiterating, here? 

Cooijmans: I looked through the articles, and the following points may be worth mentioning: 

Marilyn vos Savant occurs briefly in one article; she is known for having “the world’s highest I.Q.” according to the Guinness book of world records. I would like to add here that someone once showed me a copy of a page from Megarian No. 6 (October 1982) where her actual scores on the Stanford-Binet and preliminary Mega tests are reported. “Megarian” was the journal of the Mega Society then. 

Also nice is the early history of Mensa, as related by founder Victor Serebriakoff in one of his books, which was reviewed by David Gamon in the Mensa International Journal of January/February 1995. The founders at the time believed to be selecting members at the level of 1 in 3000 (some sources say 1-in-6,000) but later discovered a mistake in the procedure, as a result of which they had been selecting at 1-in-50. Not wanting to send the bulk of members away again, they left it as it was. 

Also mentioned somewhere is Kevin Langdon, creator of the Langdon Adult Intelligence Test (1977, I think) and founder of the Four Sigma society. If one is interested in high-range psychometrics, the statistical reports published by Langdon in the 1970s and later are worth looking at. Langdon’s approach differed from Hoeflin’s in that Langdon first expressed the candidate’s performance as “scaled score” (some conversion of the raw performance) and then equated means and mean deviations of scaled scores and scores on other tests, resulting in a linear relation between I.Q. and scaled score. Hoeflin, on the other hand, normed raw scores directly via rank equation, resulting in a non-linear relation that reveals the non-linear nature of simple raw scores. 

This is a good time to explain there are different ways to arrive at a scaled score: The simplest way is to scale raw scores linearly from 0 to 100 or 0 to 1,000, for instance. Some test constructors have done that (Alan Aax and Rijk Griffioen, I remember) but it brings no advantage compared to raw scores; the non-linearity of raw scores remains, obviously, when the relation between raw and scaled scores is linear. 

If the goal is to obtain a more linear (intervallic) scale, there has to be some weighting or balancing, and a crude but solid method is to give a certain class of problems that appear harder or more important extra credit a priori, regardless of item statistics. This was done by Hoeflin with the Ultra Test, where non-verbal problems get two points. This is effective and without problems, but the resulting weighted scores are still far from linear, if one had any concerns about that. 

A more refined way is to give items individual weights based on item analysis. In theory this should yield an intervallic scale, but there are serious disadvantages: (1) A small number of problems tend to carry most of the weight after weighting thus, which is always dangerous; (2) It adds an extra layer of sampling error because one relies on the correctness of the item statistics, and my experience is that item statistics are not constant but differ from sample to sample, so that one is building on quicksand as it were; (3) The intuitive simplicity of a raw score is lost; the candidate can not know the number of correct answers from the weighted score. 

My preference is to use a simple raw score, or in cases where it seems appropriate a crude weighting that does not rely on item statistics, such as in the example of the Ultra Test. If these methods do not result in a meaningful ranking of candidates, that test is bad to begin with and no advanced item weighting will fix it. I accept that raw scores are non-linear, and the conversion to linearity takes place in the norming of raw score to I.Q. 

That last sentence leads to the question, “How do we know that I.Q. is a linear scale?” The answer is that I.Q.’s are deviation scores; they denote a distance to the mean in a hypothetical normal distribution. Note the word “hypothetical”; it is not claimed that intelligence follows a normal distribution in the physical reality. But the tacit assumption in statistics is that when a distribution is normal (Gaussian), its underlying scale is linear (intervallic). So when you force test scores into a normal distribution, you create a linear scale, or that is the unspoken idea. This is expressed in the way we identify points on the scale in terms such as “2 standard deviations above the mean”. This implies an underlying linear scale; after all, if the scale were not linear, the one standard deviation would not be the same as the other, so it would make no sense to say “2 standard deviations above the mean”! In fact, the mere computation of an arithmetic mean assumes an underlying intervallic scale, as it involves summation. 

So the bottom line is, if we take care that the frequencies of I.Q.’s beyond various points of the scale do not differ too much from their theoretical rarities in the normal distribution, we may assume that I.Q. is linear. I say this without claiming that deviation I.Q.’s are the best way to express intelligence; but I do not have a better way at the moment. 

Jacobsen: Some submitted questions anonymously. These are the adaptations of those questions: Personally, do you know any geniuses? If you do not know any personally, where are all of the geniuses? 

Cooijmans: I have to say that when it gets anonymous, the quality goes down. Imagine that I answered “no” to the first question! How insulting that would be to everyone I know! Since a genius is someone who exercises a lasting influence in any field, inherently it can only be known in hindsight who was one, like long after the genius’ death. It is well possible that several people I know will turn out to be geniuses, but we do not know yet who they are. 

In history books you will find a lot of identified geniuses. 

Jacobsen: Why refer to these individuals in this way, i.e., as geniuses? What traits characterise them? 

Cooijmans: The word “genius” comes from the Latin “gignere”, meaning to conceive, to bring forth, to cause. Francis Galton used the word “eminence” for what is now mostly called genius. 

The traits of genius, according to me, are intelligence, conscientiousness, and a wide associative horizon. Genius is not talent. It requires talent, but talent alone does not suffice. One will need to apply that talent in order to make a lasting contribution. 

Jacobsen: Do you see yourself as a genius? If so, why? If not, why not? 

Cooijmans: Naturally, someone of my enormous modesty and humility would never call oneself a genius. I leave that to the scores of future generations who will devote their lives to the study of my work. 

Jacobsen: What do you think has been the contribution of your I.Q. Tests for the High Range? Is it a work for study by others or a hobby? 

Cooijmans: The contribution lies in studying the measurability of intelligence in the high range, and some other questions related to that as stated at https://iq-tests-forthe-high-range.com/mission.html . It is certainly worthy of being studied by others, and others should also undertake such study independently. It is not just a hobby, except in the sense that one can make one’s hobby into one’s work. 

Jacobsen: Who are others who you see like yourself in studying high ranges of intelligence? 

Cooijmans: This can only be answered properly for people who were (already) working longer ago, before the current generation of high-range testers. That would be Lewis Terman, Kevin Langdon, Ronald K. Hoeflin, Xavier Jouve, and Laurent Dubois. For the ones who came after these, it is too soon to judge their merit. 

In addition, there have been some people who created tests that looked truly good to me, but who only kept scoring their tests briefly and then withdrew from testing, so that little or no usable data resulted. These people exemplify what I said a few questions ago: that talent is not genius, but merely a requirement for genius. They had talent, but did not use it to make a lasting contribution to high-range testing. 

Jacobsen: What is the most common mistake people make when submitting feedback about your tests? 

Cooijmans: Assuming that they have understood a test item correctly, and then commenting on it from that assumption. 

Jacobsen: What aspects of people’s test feedback seem confusing? 

Cooijmans: It can be confusing if people send feedback before sending answers. I have to be careful not to help them by responding. Nevertheless, in case the feedback concerns a mistake in a test problem, it can be useful, especially when a test is very new. 

Jacobsen: The most common Marathon Test Numeric Section score is a perfect 44 out of 44. What lessons have you learned from this high-end score saturation? 

Cooijmans: That the problems are not hard enough. Also that a series of similar problems of increasing difficulty tends to be too easy on the whole. And that, to make a numerical test hard enough, either very difficult mathematics-biased problems are needed, or problems that implement a pattern that needs to be recognized. The latter seem the most fair, the former seem to give an advantage to people skilled at mathematics. 

Jacobsen: When creating high-range questions, is there a consideration of steering test takers toward wrong answers? Are extant questions ever modified in this way? 

Cooijmans: Obviously, steering test-takers toward wrong answers is the whole point of creating good test items, not only in high-range testing but also in mainstream intelligence tests. There is even a word for it: distractors. Multiple-choice tests, omnipresent in mainstream psychological testing, contain answering options that are wrong but appear more plausible than the intended correct answer. 

Thus, a candidate who really can not solve any problem at all will score below the chance guessing level, and this lower level is called the “pseudo-chance level”. For instance, if a test has 40 problems and 5 answering options per item, the chance guessing level is 8 correct, but the pseudo-chance level may be only 5 correct due to distractors. 

Extant questions are not generally modified in this way. 

Jacobsen: Which books or literature, even individual articles or academic papers, on psychometrics have provided helpful accurate understandings of psychological measurement, psychometric concepts, etc., for you? Others may find some fruitful plumbing there. 

Cooijmans: The most specific sources regarding high-range testing are the various statistical or norming reports by Kevin Langdon and Ronald K. Hoeflin, as issued by them in the 1970s through 1990s (Hoeflin only started in the 1980s, I think). These helped to see how high-range tests are normed, and also aided in the interpretation of scores on a lot of those old American tests like the Scholastic Aptitude Test, Miller Analogies Test, Army General Classification Test and more. The “Omni sample” of the Mega Test contains many scores of those versus Mega Test scores, and as such is an important anchor of high-range tests to the general population, especially so since those old tests in some cases did discriminate into the high range. This can not be said about the newer dumbed-down versions of the educational and military tests, whose validity tends to end at the 99th centile, and for whose interpretation one should consult the information provided by the relevant issuing organizations. 

What one can see for instance in the Omni sample is that the old G.R.E., S.A.T., and Army General Classification Test correlated quite well with the Mega Test, while on the other hand the Wechsler Adult Intelligence Scales and Stanford-Binet, by many regarded as the gold standard of I.Q. testing, appeared to lack any validity in the high range. This observation holds true until today in data collected by myself, except for the old Army General Classification Test on which I have almost no data. 

Then, an actual text book on psychometrics I have studied is the Netherlandic “Testtheorie” by P.J.D. Drenth and K. Sijtsma from 1990. This covers both classical psychometrics and the newer item-response theory. 

Another useful book, a bit more general, is “Applied statistics for the behavioral sciences”, second edition, by Hinkle, Wiersma, and Jurs, from 1988. 

An important book on intelligence testing is The g factor by Arthur Jensen, from 1998. While not intended as a psychometrics textbook, it does contain a lot of advanced information on psychometrics, including some factor analysis, often in the footnotes. 

As it happens, there is also an e-book called The g factor by Christopher Brand, from 1996, also containing information on psychometrics and some factor analysis. 

A book on statistics in general (not psychometrics) I have studied is the Netherlandic Statistiek in de praktijk by David S. Moore and George P. McCabe. I see there is an English version too, Introduction to the Practice of Statistics; 2nd edition (1993). 

A book dealing specifically with multivariate statistics such as correlation, regression, and factor analysis is Using Multivariate Statistics (third edition) by Barbara G. Tabachnick and Linda S. Fidell (1996). 

I also still have my mathematics books from secondary school, one of which contains chapters on statistics and probability calculation. Occasionally I look through those to refresh these basics of my knowledge in this field. 

Finally I want to add that the history of statistics and of mathematics is informative regarding psychometrics. Reading about such will teach you that statistics has been closely related to psychological testing since the 19th century, and that probability calculation was developed for the purposes of gambling and insurance. 

The history of mathematics in general, found for instance in A Concise History of Mathematics by Dirk Jan Struik, tells us that mathematics originates in the early days of agriculture, cities, and large-scale administration. That is, within the past ten thousand years or so, the holocene, after the last glacial period. Computing the area of parcels of land required mathematics. 

I suspect that the intelligence of the people coming out of the last glacial period was primarily of a visual-spatial nature, and as they became settled and practised agriculture, built cities, and administrated societies, they needed higher numerical ability as well as written language. I imagine that spoken language existed long before that, originally in the form of words without grammar some two million years ago to coordinate hunting in groups in early Homo, and later on with grammar, perhaps in the days of Homo sapiens. 

Language is not unique to humans incidentally, but exists in other beings too, such as birds, primates, and whales. Animals like crows are likely at the intelligence level of early Homo, but I am uncertain if their physicality will allow a further development such as has taken place in Homo. Key points like the manufacturing of tools and mastery of the fire may require arms, hands, fingers, and thumbs such as humans have. 

Visual-spatial ability is also not restricted to humans, but found in many animal species, in particular to enable predating. As such, visual-spatial ability should be a few hundred million years old, as that is when the first predators came. 

The importance of this history of abilities is that when we test abilities now, the results we get, such as the intercorrelations of various abilities, are as it were a fossil record of this evolution. A development that I believe takes place in civilized societies is the erosion of the original visual-spatial ability in favour of verbal ability. A high level of verbal ability in the absence of the foundation of visual-spatial ability, I think, leads to dishonesty, deceit, evil, decadence, and societal collapse. 

Jacobsen: Thank you for the opportunity and your time, Paul. 

Cooijmans: I never know what to respond to here.

License & Copyright

In-Sight Publishing by Scott Douglas Jacobsen is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. ©Scott Douglas Jacobsen and In-Sight Publishing 2012-Present. Unauthorized use or duplication of material without express permission from Scott Douglas Jacobsen strictly prohibited, excerpts and links must use full credit to Scott Douglas Jacobsen and In-Sight Publishing with direction to the original content.

Leave a Comment

Leave a comment