“Let’s take the average. ““Let’s analyze the keyword density of the page.” I hear these words often enough. The important thing is that in most of the cases- and by that I do not mean 51%, but something close to 95%- these techniques have absolutely no use.
Let us first talk about average, or mean. When we talk about taking the mean and it to be useful we usually assume(or should assume) that the data distribution is uniform. The estimation with average as a central tendency breaks apart when we are encountered with skewed data. In fact almost most of the data in the real world is skewed except for those textbooks examples. To bring home my point let me tell you a joke, about a statistician. It is said that there was a very tall statistician who was crossing a river with his family of a very short wife and 3 very small kids. He had to decide whether to cross the river or not. He being a mean guy(pun intended) he decided to take the mean or the average of the height of the whole family, and compare it with the depth of the river. He found out that the average height of his family just manage to top the depth of the river. And he decided to cross with his whole family. When he reached the other side, not surprisingly he found out that he is the only one who was able to cross the river, and the rest of the family drowned. The same happens in the real world estimation. The average or the mean is almost always a bad measure of central tendency. In fact nature works more on what is called Pareto Distribution, or 80-20 rule in layman’s parlance.
Now let us talk about word density. Let us for sometimes ignore the fact that the term does not satisfy the rigor of mathematical definition, and is more of a buzzword than actually something statistically useful. But the general idea is to match the most number of keywords pertaining to the supposed subject. Let us say you are manually looking for the page most relevant to the subject ‘apple computers’, and on your side you have a list of words pertaining to ‘apple computers’. One document you find that it contains the words– apple, steve, steve woz, steve jobs, mac,leopard etc etc etc…and it matches 90% of your word list. What is your conclusion? I would definitely say that the aforementioned document is NOT related to apple computers, but actually is a spam. So basically a simplistic keyword density spews out spam after spam and you are wondering what is wrong. It is not just that the word density technique is very easy to game, but that it also inherently is a mismatch to the real world situation. You don’t come across relevant documents with neatly placed word density. And to top it all, your list of relevant terms may not be complete and are likely to give lots of false positive.
I vote for banning these two words in the technical exchanges- average/mean and ‘word density’ so that we don’t fall into woolly thinking.