A New Approach to Statistics and Cluster Analysis

G.A.S.M - a Generic Adaptive Statistical Model

Based on Hierarchical Binary Partitioning of Sample Space

S.Roof, Jan 2006

This paper presents a simple generic methodology for doing computational statistical modeling that is free from many of the arbitrary assumptions often found in statistical analysis. This model does not make assumptions about the data conforming to a particular distribution or not. Instead it blends concepts from statistics and cluster analysis to create a dynamic adaptive cluster tree that shapes itself in real time as new data comes in. Likewise, it is also free of many of the limitations of cluster analysis because it is independent of arbitrary distance metrics and can also handle extremely large data sets with ease. Using N-dimensional datasets, minimal resources, and quick on its feet, this method has potential for use in artificial intelligence probability engines.

The main thesis of this paper is that every distribution is a distribution of distributions. Using this as our guiding principle, we shall develop a simple technique, which is essentially recursive in nature. To explain this technique will require a new approach to some old problems though, so we begin with an introduction to some of the perennial problems of conventional theory.

OUTLINE

1. Introduction
2. 3-Step Paradigm Shift
3. Every Datum is a Distribution
4. The Cluster Tree
5. Partitioning Sample Space
6. Virtual Data Window in Time
7. Getting Probability Output
8. Test Results
9. Views Into The Model
10. Summary

11. Extending and Using the model
12. Caveats, Flaws, and Technical Notes
13. Appendix1- Derivation of variance
14. Appendix 2 - Footnotes
15. Appendix 3 - Weighted Pseudo-Random Number Generator

 

INTRODUCTION

The basic problem with statistics is that there are too many numbers, too many assumptions, and too many experts who disagree with each other. When designing an artificial intelligence engine we need to work like naturally intelligent systems do. Animals and people, for example, do not wait for all the data to come in before processing it. They have the business of survival to attend to with no time to spare. Such intelligent systems must shape their expectations moment by moment, datum by datum and learn as they go. The model presented here can breathe new life into what many people regard as a dull and forbidding subject (myself included).

This paper explores a paradigm based on the idea of virtual internal representations that make few assumptions about the content or character of the data. It also acknowledges that data is fuzzy and subject to error and interpretation even at its rawest level. Recognizing that we cannot be free of this interpretive dilemma, we instead embrace it and reduce it to its lowest common denominator i.e. we will view raw data itself as a statistic or parameter. Once we make this leap, many doors open up and we become free to adjust the knobs on our virtual machinery.

We shall design a probability engine that is model-free at its core but universally open to arbitrary models at its access rim. It will be independent of proximity metrics so long as the semantics of nearness vs. distance are adhered to and its resolution is data driven along whatever natural bias prevails. Adapting to this bias the engine will restructure itself to minimize variance yet still be all filling across sample space. Built for speed and economy, the engine will only use resources it needs and will be able to assimilate new information without limit. This gives it a robust behavior that is always optimal with respect to itself and allows for graceful degradation.

Before jumping into designing such machinery, though, we need to explore the roots of this paradigm shift with a look at conventional theory.

The study of probability and statistics falls into two basic categories of approach. One is empirical and the other is inductive or logical. Empirical or statistical probability uses the concept of relative frequency of occurrence of events and is essentially a fact based or objective approach. Inductive probability in contrast deals more with logical propositions and degrees of rational belief in said propositions. Inductive probability is thus more of a subjective concept. Much confusion can occur when applying these ideas though because there is a great deal of overlap between them.

The author believes that the philosophy of relativism can be helpful here. This philosophy argues that everything one can imagine has both a subjective and an objective component. We can never separate the two completely. The extreme objectives of logical positivism and Cartesian reductionism eventually lead to pretzel logic and the uncertainty principle. Likewise, subjective inductive interpretations can lead to flights of fancy and dangling semantic referents. Somewhere between swatting a flies with a sledgehammer and missing the forest for the trees, there must lie an acceptable balance. The question is how to find this balance when doing statistics. A few examples are in order here.

The prime concept in statistics and empirical probability is the concept of relative frequency of events and the distribution thereof. The nominal method is to randomly sample some variable (make observations - hopefully unbiased) and construct what is called a frequency histogram. To do this though, requires assuming an artificial grid or number of intervals into which we will dump our frequency counts. The number of buckets we choose will affect the result we get. If we lump everything into one bucket all we get is just a total count of the number of observations made. We thus get minimal information out of our sampling. At the other extreme we might set up a million buckets and take say one hundred samples. Well, then it is unlikely that any bucket will have more than one sample in it, and the vast majority of buckets will be completely empty. Again, we are at a complete loss to make any interpretation.

When making statistical distinctions, the question then becomes how do we decide how fine our mesh should be and how many samples do we need to take? After one ponders these questions long enough, he may come to the realization that sample size and resolution are arbitrary. Real humans doing real statistics use their instincts, brains, and eyes to shape things within human scales. How can a computer program find a way to do this though? Does the reader really believe that sample size is anything more than an arbitrary window into the data? Throw in the resolution problem and one can begin to see that what you see depends on what you expect to see.

Here's another example. Suppose we are taking samples of some time series. Maybe we are even taking an election poll. How do we know when to stop taking samples and do our analysis. After the election is over? Traditional theory often says the more the better and then produces error estimates for arbitrary smaller sizes. How do we know that the thing we are measuring isn't drifting or exhibitng a trend though. Increasing the sample set will only smear our estimate of what is going on. Statisticians are aware of this problem and have ways of analyzing this sort of thing, but the point I'm making here is that only the data itself can drive the window of expectation or at least it should. Arbitrary assumptions are moot and can only be verified after the fact. And what do we compare our results to? How far back in time do we store data? How many resources do we have to do this?

A huge elaborate statistical engine that stores billions of observations and crunches them down into an astounding array of knowledge factoids is useless if it takes a thousand hours to recompile itself whenever a new snippet of data comes in the window. We need something better.

Here is one final example. Suppose we are measuring a bearing at a factory. Is this a reliable statistic? When we make an observation or take a measurement (I call this a datum) we acknowledge that this is our most objective piece of the business of doing statistics. This is 'just the facts ma'am', so to speak, but is it? Every measurement is subject to error and hard to repeat exactly, even under the best of circumstances. What is our tolerance for error? Every measurement is to some extent fuzzy and uncertain. The problem is that we can't really know the full extent of what we did without making more measurements and these too are error prone. After taking a number of measurements on something that we think should be constant, we finally give in and accept the awful truth. The truth is that every single raw data number we crunch is itself just a piece of a larger ensemble however virtual that ensemble may be.

Ok, now we are ready to take our first leap into a new paradigm. We will break this down into three steps:

3 STEP PARADIGM SHIFT

1) Every number, fact, or observation should be treated not as a thing, but as a representation of many things. I'll repeat this because it takes a bit of getting used to. Every statistic is itself a distribution. Still not there? Ok, every datum is virtual from the get go. It is already processed to some extent either by one's self or by external machination. Every raw datum can be considered to have a mean and variance all by itself. It's a statistician's nightmare. Every distribution is a distribution of distributions.

2) Sample set size and resolution of distinction making are both arbitrary. There is no magic window. There might not be a best or optimal fit to the data. You've already made some arbitrary choices when you picked your data set to begin with. How big is a population? The one you plan to observe, the one you did observe, or the one that went on doing it's thing while you sit here trying to decide which is which?

3) Ascend, ascend, ascend. Sounds good doesn't it? Top down or bottom up, what's your best method? Top down methodology (executive privilege without feedback?) can lead to blind alleys of distinction and no way to get from here to there. Bottom up methodology ( working without supervision? ) is laborious and complex. How much data do you need and how long are you going to keep it? Have any answers to those questions? How about a compromise. Suppose we start top down and do bottom up as we go. How about just stopping when we think we have enough data, and how about trusting the model we built long enough to actually throw out all the old data?

If you are experiencing a bit of impatience now, don't worry, just keep processing. We'll sort it out as we go. No special math skills beyond algebra are needed (although a little vector math is appreciated) and the whole model is pretty darn simple once it's grasped in its entirety. In fact, it's so ridiculously simple in a way that I'm hesitant to introduce it as being new.

Every Datum is a Distribution

( Every Distribution is a Distribution of Distributions )

What is a datum? We shall be using the term to mean any vector of numbers that the modeling engine can interface or access. The numbers have to come from somewhere though. Either a human creates those numbers or some sort of device or machine does. Manmade devices for capturing data have to digitize the information. Humans have to perform some sort of operation also to produce machine-readable numbers. Measurements have to be taken, things counted, or impressions of value recorded. The point is that numbers are what computers deal with and numbers represent the result of some interpretive process at their point of origin. This may range from digitized microphone voltages to human prepared 0-10 ratings of some emotional or perceptual value. Numbers therefore are descriptions of what they represent. Numbers appear to be absolutes but they never are. The meaning and value depends on context, scale, and reference origin.

Given that we do indeed have a source of numbers, it is convenient to encode them as n-dimensional vectors. We want to use vectors because we might want to say let each data point be one person's height, weight, education level etc. Each separate number is a component of the whole vector person point so to speak. The model we shall develop is not limited to any particular dimensionality.

When we use the word mean in this paper, it will be understood that it refers to an n-dim vector. Variance and frequency, however, should be understood as simple singular scalar numbers (i.e. 1-dim vectors)

The very heart of statistics is the concept of mean and variance. The world around us clumps into groupings of things. Things repeat and happen more often than other things. The mean or average conveys the idea of where a group is located i.e. where it's general center is. The variance or standard deviation ( square root of variance) is a measure of how spread out this group is. These two metrics are the prime core of statistics theory. Associated with these two metrics is the number of members in a group or sample and the idea of relative frequency when this number is compared to a larger population of samples. Frequency in statistics just means the number of items counted. If you want to think about probability at this point, it is ok, but please don't make the mistake of thinking that probability is something that resides in things or the outside world like it were a motivating force for what happens. If one persists in pursuing an attempt to objectify probability, it is my opinion that one should consider external probabilities as resulting from physical constraints on systems rather than as causal factors. The above three metrics will be the only quantities our model will track other than a few housekeeping details. Like I said, the model is going to be simple.

When I say metric, I mean it in the sense that we measured something and then computed a descriptive number that represents some concept we have about our measurement. In this sense, a metric is an abstraction and the result of some computation. Now we can take this whole process further and come up with the idea of a parameter. A parameter is not something we measure. It is something we adjust. Now suppose we calculate a metric and along the way we decide to adjust part of the calculation with a parameter. Consider it like knobs on a radio. We can vary the volume by turning the volume parameter. This parameter adjusts an internal voltage bias to the amplifier and thus the computation that it makes on the raw input signal. The metric established at the output terminal thus becomes adjustable, even while it remains a more or less faithful replica of the input signal to the radio. This in a nutshell is what we want to do with our statistics data - parameterize our metrics.

We have 3 metrics now we need to define mathematically: mean, variance and frequency. The arithmetic mean of a set of values is the quantity commonly called "the" mean or the average.

Given a set of samples {xi}, the arithmetic mean is

If the underlying distribution is not known, then the sample variance is

where x^_is the sample mean. Frequency as discussed above is then just the variable 'N' in the above equations.

The above definitions can be studied in more depth at: http://mathworld.wolfram.com/ArithmeticMean.html http://mathworld.wolfram.com/Variance.html

Now we are ready to start building the adaptive part of our model. From now on I will be using 'm' to stand for since its easier to type. Also note that m will be in bold type to indicate it is a vector. Scalars will be in regular type. I occasionally stray from this protocol for simplicity of description, but understand that the computer model when coded actually uses vectors.

Given a set of samples { X1, X2, X3...Xn } we can easily calculate the mean and variance but what if there are thousands of data points. The problem with statistics as stated at the beginning of this paper is that there are too many numbers. What if we have done this laborious number crunching and a new datum comes in? Do we have to start the whole process all over again to accommodate the new information? It turns out that there is a better way.

Note that using the definition for mean m = sum of Xi/n. Multiplying by n we have the sum of Xi = nm. Now suppose we add a new value of X to the sample set. The new sum of Xi is then just nm + X. Divide the whole thing by the new frequency n+1 and we have the new value of m which we'll call m':

new mean m' = (nm + X) / n+1

Bingo! We have a way to update our mean without doing the whole summation all over again. All we have to do is keep the numbers m and n and we can keep a running tally for the whole population as new data comes in. There is an even more general formulation that our adaptive algorithm is going to use:

new mean m' = (n1m1 + n2m2)/(n1+n2)

This form covers the case where we wish to add two sample sets together and compute the mean. The derivation is easy. n1m1 is just the sum of all the Xi for sample set one and n2m2 is the sum of Xi for set two. The total N count is n1 + n2 therefore we simply divide n1m1 + n2m2 by n1+2 to get the mean for both sets considered as one group.

Now the above was so easy that we are encouraged to see if we can do the same for our variance metric. It turns out that we can. The derivation is a bit more complicated but the final form is still pretty easy to digest:

Naming s' as the variance of two sample sets combined, we have

s' = (1/(n1+n2))[n1(s1+m1^2) + n2(s2+m2^2)] - m'^2

see appendix 1 for the derivation.

That's it! We have found a way to update m and s given any new data coming in and as a bonus we can combine any two sample sets to get a value of m and s and n for the total. Now one may wonder what all the hullabaloo is about. The point is that the algorithm can track n,m,and s for millions of points just by keeping updated as each new point is added. But there is more. Recall our initial paradigm statement that every datum is a distribution.

What? Yes. Consider that if we have a new datum x, we might just as well say that it represents a mean m because measurements and observations are error prone and difficult to repeat. We might just as well go ahead and bite the bullet. Call it a distribution. But what is its variance. Well, n obviously equals 1 so therefore s = sum of (x-m)^2/1 = 0 because x = m.

What we have done is to conceptually describe our sample measurement x as a distribution with three metrics i.e. n=1,m=x, and s=0. I call this a virtualization of the data. We have moved from raw data to meta-data or description of data. To fully grasp this is to realize that we have just conceptually enabled for ourselves a recursive description of distributions without having to actually cook up any models of said distributions. All we are going to track is n,m, and s. The distributions our model is going to deal with are unknown, and the model itself doesn't care. The idea is to create a hierarchy of distributions of distributions to represent our final overall distribution.

All we have to do is input new datum one at a time and our algorithm will adaptively reshape n,m, and s to keep up with the score. But isn't the object of this model to come up with a simple way to actually describe distributions with more than three numbers? Well, yes. That's where the recursive part comes in. In other words we can conceptually break any distribution of frequency samples down into smaller subsets or groups with each having its own set of 3 metrics - n, m, and s. But remember we are talking about an adaptive algorithm. It is actually going to throw away all those initial Xs, which we used to add up etc. The problem then becomes, just how do we keep subsets organized when we don't even have the total set anymore.

The real beauty of the model now comes into play. We can start a hierarchical data tree and put our sample-subset-numbers in each branch or node of the tree. The problem remains of how to manage these subsets, but it is easy once grasped in its simplicity. This brings us to the study of cluster analysis but don't worry, we are actually going to sidestep the difficulties of that subject with some simple methods.

One more note. The frequency count n as mentioned above can be parameterized. Step 2 of our paradigm was that n or sample set size is arbitrary. We haven't developed the model sufficiently yet to explain why, so first we shall go ahead and plunge into what a cluster tree is and how our model is going to use it. Hang in there and we will let n go virtual eventually.

The Cluster Tree

If one wants a good overview of cluster theory and different approaches, he can check out the following sites. They will give you an idea of how my method is a bit different.

http://www.statsoft.com/textbook/stcluan.html
http://en.wikipedia.org/wiki/Cluster_analysis

The basic idea of our model is to start a cluster tree with the very first piece of data we receive and then add new datums to the tree one by one and let the tree grow new nodes to handle the new data. Hopefully, the final tree will model whatever distribution the data set suggests and it will do it structurally so that clusters of data show up as higher level nodes in the tree. We can then simply walk down the branches and enumerate the clusters with each having its own set of statistics describing the overall center, spread, and weight or frequency count.

The rules we will follow are:

1) Every datum has to be put on the tree next to the node that it is closest to in value. This preserves the semantics of nearness and keeps the tree ordered. The left child of any node has to be less than the parent node and right child node. The right child node has to be greater than the parent node and left child.

2) Raw datums have to end up at the end or leaf points on a tree. This equates to optimizing the discrimination power of the tree and also keeps the updating process ordered.

3) When combining two pieces of data we join them together with a higher node that merges their separate statistics into one. The higher a node is on the tree the more generalized the sample group its statistics represent.

4) The root node represents the overall totality of whatever distribution the model is capturing.

5) All the statistics have to add up correctly.

Ok, let's build a data tree. Assume we have some data set or stream of numbers that we take one by one and insert into the tree

imagine the following stream of data numbers ... 3 5 1 9 2 4 1 ... etc. We are going to start a tree from scratch, one number at a time starting with the first number 3.

To add a new piece of data to our tree, we walk down the tree comparing the data with both branches at each node going down and choose the closest branch. When we get to an end or leaf point, we drop our data, make a new higher node, join the new datum with its closest neighbor, and then work our way back up the tree to the root updating our statistics as we go.

FIG.1 below shows how our tree grows with each new piece of data added. Black arrows show the next number to be added at each step. The numbers in red show the sample or frequency count for the virtual nodes. Virtual nodes are shown as circles. The raw datum leaf nodes are squares. Each node, which joins the two below it, is their mean or average. The mean, variances, and frequency numbers are stored in each node as its description or internal state.

The tree algorithm is pretty simple to implement and allows fast access times because only one path through the tree needs be accessed to find a node or add a new one. With each addition the stats for higher nodes on the current path all have to be updated or recalculated but this is efficiently done and doesn't require visiting every node in the whole tree.

This partitioning of the updating problem into a least path means tremendous savings in computation time. Suppose one has a probability table with 1024 slots in it. Ordinarily, to update it will require 1024 adjustments. A full balanced tree with only 11 levels, though, has 1024 leaf nodes and only requires 11 steps down and 11 back up to update itself.

Now one can argue that a straight frequency table would actually only need to increment one slot, and the probability can then be calculated only for that slot, but the point is that the tree structure carries a lot of information about how the data is distributed, so we are way ahead of the game efficiency wise. Each higher node in the tree represents a grouping or cluster of data points but only in an abstract sense. The raw data points themselves are only stored at the bottom leaf nodes and this presents an artful solution to a difficult problem. Setting a limit of how many raw data points we wish to keep easily solves the question of resolution and resources.

The 5th addition to the tree ended up with a 1 and a 2 at the bottom tied into a virtual node with mean = 1.5. But what if we wanted to limit ourselves to only 4 nodes. Easily done. Just calculate the new 1.5 virtual node and drop the 1 and 2 nodes. Compare FIG.2 below with FIG.1. The virtual node now replaces the original raw data node and our numbers up the tree haven't changed. The tree has now become totally virtual and we can add to it add infinitum without abusing our resources. The tree will constantly update itself correctly within the limits of precision of our floating point math numbers and will always be a true representation i.e. the numbers will always add up to what we would have gotten if we did it all as one flat calculation on a set of numbers all at the same time.

So now our model has a way to partition sample space into optimal varying resolution that is data driven. It also dynamically updates its internal statistics using a least path approach. The tree stays ordered at all times and is robust even if we trim nodes from the bottom. It just looses a little resolution in the process. The variance and mean of each node will shift as new data comes so the tree keeps an optimal footprint on sample space. Looking at the lowest level of our tree pictured above, we see that the leaf nodes across the bottom level break the sample space down into unequal spacing between points. The 1.5 and 3 node are closer to each other than the 5 and 9 node. The closer spacing tends to occur under groups with higher sample counts. This is not an absolute given initially and depends early on with how the specific data lies, but as more and more points are added the means draw together and the variances diminish. This means that the tree in effect increases its resolution where needed at the bottom while maintaining its more abstract cluster structure at the top.

The data-driven partitioning of sample space is a crucial enhancement because we overcome the limitations of how fine to make our grid or mesh resolution for grouping data into groups. The data itself determines how close the distinctions are. As mentioned in the introduction, this was a problem with the standard histogram frequency tables, which arbitrarily assign a fixed set of intervals or buckets to put the data into.

Since the access down the tree depends on a distance metric for doing comparisons, we might be concerned about the method used. My model simply chooses the standard Euclidean distance measure which is the square root of the sum of the squares of the difference between dimensional components of the datum and the node mean vectors. We are only interested in which branch is closer though, so we more simply use the square of the distance. In addition, the tree really doesn't depend on which type of metric we use so long as it can establish the closer of two comparisons.

Since our method of forming clusters does not require iterating across every member of a cluster or with respect to some other cluster etc., we have a much faster methodology than some conventional techniques. No centroid or linkage measures are needed to identify membership in a cluster. We just walk down the tree and drop the data in its proper bucket. Thus we have also partitioned the computations optimally as well as the sample space.

The methodology is also something of a mixture of top-down and bottom-up approaches. Conventional techniques tend to use either a divisive top down approach or an additive bottom-up approach. Sticking to their strategy they iterate across all the data to determine membership. My method uses top-down comparisons to get location and then does bottom-up gluing together to ascend one step. The tree sort of jacks itself up on top of the data as it grows.

Partitioning Sample Space

Sample space is the domain across which the data samples or random variable lies. As mentioned in our introduction, a conventional histogram arbitrarily divides this domain into equal intervals that may or may not be most appropriate. What we want is a method that allows the data itself to partition the intervals it lies within. The method described above is based upon a distance metric between the input datum vector x and the vector mean m at any node where it is being compared. If |x - m1| for the left child node is less than |x - m2| for the right child node, then we go left. Otherwise we go right. The algorithm thus seeks the branch of the tree that the datum is closest to. It continues this process until it is all the way to the closest leaf at the very end of a string of branches.

Now it might be sensible to think that since our adaptive update algorithm uses a weighted mean, we might want to weight our distinctions going down the tree, but this doesn't work as well and can be shown to lead to a conundrum where equipartitioning picks one path but the weighted method picks the other. The weighted method tends to equalize the clusters and spread them out, while equipartitioning sharpens the clusters.

The net effect of our model's strategy is that clusters tend to minimize their variance and remain true to the underlying distribution. The model equipartitions the sample space at the lowest level. It is only when going back up the tree, that the means embrace and shape the final cluster. The algorithm's nearness distinction equipartitions the sample space locally, but under the umbrella of the higher weighted means, it squeezes and moves the means around as the data changes. The footprint on sample space has varying global resolution, but locally at any insertion point, the modus operandi is to pick the midpoint and make a choice as to which side the datum belongs on. Once inserted, though, the data point may be closer or farther from its neighbors as determined by its value.

This equipartitioning is local and relative to the resolution of the nearest neighbors. Across the whole space it will vary as the data points vary. We thus let the data itself bias the partitioning or resolution from point to point. Hopefully the drawing below will help make this clearer.

What we see above is two bottom leaf nodes lying in their positions on sample space but their frequency counts are different so the net mean above is weighted toward the left node making the true mean 2.0 instead of 2.5 which is the simple midpoint between m1=1.0 and m2=4.0. Now if we add a new datum x=2.3 it goes to the left of 2.5 and will make n1 = 3 and m1 = ((2)(1) + 2.3)/3 = 1.4333. The new mean at m1 thus shifts right a bit and the m' above will have to be updated also.

If we had chosen the right node because 2.3 > 2 which is the weighted mean, then the m2 node would have shifted left instead and the two clusters would be more equalized in frequency counts losing the essential distinction between clusters as the tree builds. I have actually tried using that method and it works but the final distribution of the whole tree is smoothed and doesn't fit the data as well.

Below is a partial data dump of leaf nodes from the 5th layer of a distribution shown later in the test results section.

L00003) D05-001:R n:124.00000 s:0.00123 M[0]= -1.167018
L00004) D05-002:L n:177.00000 s:0.00071 M[0]= -1.080486
L00005) D05-004:L n:353.00000 s:0.00055 M[0]= -0.940918
L00006) D05-005:R n:564.00000 s:0.00040 M[0]= -0.866599
L00007) D05-009:R n:220.00000 s:0.00030 M[0]= -0.311818
L00008) D05-010:L n:295.00000 s:0.00041 M[0]= -0.248781
L00009) D05-013:R n:558.00000 s:0.00049 M[0]= -0.047623
L00010) D05-016:L n:616.00000 s:0.00033 M[0]= 0.129478
L00011) D05-019:R n:420.00000 s:0.00032 M[0]= 0.316722

Notice that the means ( labeled as M[0] ) stay ordered but the spacing varies. Also look at the frequency numbers ( labeled n: ). This data represents the tree having adapted to 10,000 data points and the final resolutions on sample space are evident. We could have achieved higher resolution with more nodes but the data itself would determine where the intervals are smallest. Notice the peak in frequency at node L00010) D05-016:L where n=616. It also has a low variance s=0.00033 and thus represents a relatively significant cluster at the leaf level. Higher up in the tree of course this will have an impact on the more general cluster shape.

One last note before proceeding. The above illustrations have somewhat obscured the fact that, in the algorithm, the means are vectors and not just simple numbers. The updating of means and variances is linear vector math though and unaffected by this higher dimensionality. The partitioning distinction, though, requires computing distances in vector space, so the simple sample space diagram above is somewhat misleading. See the partitioning problem section for a discussion of the partitioning problem. Also you can look at actual partitioned outputs in the test results section.

Virtual Data Window in Time

Now we are in a position to elucidate our earlier comments about the sample count or frequency being virtual also. So far we have simply added counts and stored those numbers in each cluster node. We tend to think of those numbers as whole integers and thus inviolate. This is not necessary though. Notice that our equation for updating the mean is linear with each sample n so we can divide numerator and denominator of our equation by any number D without changing the mean.

mean m' = (n1m1 + n2m2)/(n1+n2) = [n1m1/D + n2m2/D]/[(n1+n2)/D]

We have to be careful at this point though. The entire tree will have to have all its sample counts thus divided (or multiplied) to stay correct. There is an efficiency issue with having to update every node on a tree, but the point of this exercise is to demonstrate that sample size is arbitrary in one sense. It has a very real effect, though, on how a tree responds to new data. Lowering the sample size equates to narrowing the temporal window on which you wish to focus your statistical attention. The tree becomes more responsive and each new datum added has a bigger impact on how the means and variances shift.

Similarly, increasing the sample size slows the tree's responsiveness to new data and gives a more long-term analysis. Hopefully one can now see that we can virtualize our frequency numbers and gain another control knob on our model's behavioral characteristics. Philosophically, this involves a shift in the normal paradigm (at least for me it did) and forces one to concede that any statistical analysis is relative to the data set it chooses to consider. This appears obvious perhaps but it also is easily forgotten when the final results are displayed after lengthy analysis. If our model is developed and extended to real life AI problems, we would like to keep the above technique in mind as a way for a model to shift its focus dynamically in time as well as spatially across sample space.

The problem of exactly how to actually virtualize the sample size still remains, though, so we need to analyze this a bit further. A digression into how we would update a conventional flat frequency table will help us understand how to proceed.

Suppose we set up a frequency table with an arbitrary number of slots and begin adding sample data. Given a set of samples {xi} we can establish after some period of sampling the relative frequencies for each slot centered on a particular xi. The relative frequency is given by f(xi) = n(xi)/N where n(xi) is the number in the i'th slot and N is the total count for all slots.

Now suppose we add a new datum at the i'th slot. The new relative frequency will be f '(xi) = (n(xi) + 1)/(N+1) but since n(xi) = f(xi)*N we thus have after a little rearranging of terms:

f '(xi) = (N/N+1)f(xi) + 1/(N+1)

The above form shows how to update relative frequencies with each addition and it is all based on the total count N. This method is actually impractical though. Each and every slot would have to be updated for each addition because relative frequencies all have to add up to unity. It is much more efficient to simply keep the actual counts n(xi) and divide by N when a relative frequency needs to be accessed. The above form does have a use for this discussion, however, by providing a conceptual tool for us to explore. Notice that the update to the new value f '(xi) involves a (N/N+1) correction factor to the old f(xi) plus an additional 1/N+1 which represents the relative frequency introduced by the new sample itself.

Ok, here's the problem - If one keeps adding to the table, N just keeps getting bigger and bigger. Since the impact of each new sample is given by 1/(N+1), we can easily see that the relative influence on the whole from each new input is progressively less and less. This is natural and obvious in one sense but profoundly disturbing in another. Wouldn't we like all our samples to be treated equally without bias? Isn't that why statisticians are so fond of the idea of a randomly sampled variable?

Well, the situation just gets worse from there. What happens when N gets really big? Imagine going through life without ever sleeping. Would not each moment of each day become progressively less and less significant until your whole life seems to blur into one static monolith of piled up experience? Maybe we need to sleep and cleanse the system so we can wake up refreshed and experience each moment as significant within the context of that new time period.

Imagine entering a business and finding to your joy that you are the only customer. You get excellent service and are soon on your way. Now think about what happens when you walk in and find that you are just one person in a crowd of people being randomly served. Yours odds of satisfaction suddenly take a nosedive. You have just entered the 1/(N+1) zone! But wouldn't it be nice if large organizations could somehow keep the same responsiveness they had when they were small? The law of large numbers seems insurmountable and the only quick fix seems to be throw more resources at the problem. More resources creates more managerial headaches, though, and eventually things still grind to a halt. Ever wonder why libraries have book sales?

Living systems seem to be able to regenerate themselves and remain adaptive. Their responsiveness appears to be geared around small temporal windows enabling them to quickly adapt to changing situations. They don't call it the 'quick and the dead' for nothing! This observation was the impetus for my idea of building a statistical modeling engine, that has a fluid and shifting sample window through which it makes its estimates of what the data is doing.

All was well and good with this idea. I began constructing a binary tree model keeping tabs on statistics like frequency n(xi), mean, and variance. The mean and variance after some work turned out to be easily virtualized into the hierarchy without trouble and it was possible to do a least path single trip up and down to access data and perform updates without having to recalculate the whole tree. The frequency counts and the total N remained problematic though. It turns out that since N is a global variable distributed across the whole tree, there didn't seem to be any way to arbitrarily fix N to some specific sample size (call it S) in order to establish a given temporal window on sample space without recalculation every node in the tree every time a new datum was added.

A simple multiplication or division factor applied across every node in the tree was an obvious but undesirable solution. I wanted my model to be real-time adaptive and quick on its feet. There had to be a better way. Perhaps one could do this sort of dividing N downward only periodically as N got bigger than S. Dividing the tree frequencies down to some value less than S would keep all the relative frequencies intact but we would periodically bounce around our collective sample window S. It is either that or do it each and every input, or so I thought. After a periodic shakedown, initial samples would have a larger effect than the ones at the trailing period just before the next shakedown. We thus end up with a situation better than cleansing after each new input, but we introduce a bias back into our calculations.

This problem had me stymied for a while. Living systems do indeed seem to react more responsively immediately after each downtime period and then gradually become less responsive, but since I was building an idealized system, I wanted something better. The idea still seemed to have some merit though. Finally I found a solution of sorts. What if we just fool the computational model into thinking it was adding data at the fixed level of significance desired? In other words, how could we add our datum so that its relative importance was 1/(S+1) instead of 1/(N+1) but still keep the real count N distributed across the tree intact? Here's the solution:

Going back to f '(xi) = (n(xi) + 1)/(N+1) we see that the plus one is where the frequency count for one additional sample gets added in. What if we use N/S instead? We then have:

f '(xi) = (n(xi) + N/S)/(N+N/S) = ( f(xi)N + N/S )/(N+N/S) Multiplying by S and

dividing by N we then end up with:

f '(xi) = (S/S+1)f(xi) + 1/(S+1)

This has the same exact form as our original relative frequency update equation. This is exactly what we were looking for. Now we have a way to arbitrarily keep the impact of each new datum fixed at a constant 1/(S+1) relative degree of significance. No bias! Each sample gets treated exactly the same now, and we can adjust S as desired to broaden or narrow our focus in time.

There is a hidden catch to this though. We are still tracking an absolute value for each n(x) in our internal representations within the model. All we are doing with S is pretending that we are adding N/S samples of that xi instead of just one sample. This has a profound consequence to the internal representations. As N keeps getting bigger and bigger, N/S has to get bigger also just to keep the same level of significance going. At some point, the binary representations of our numbers will overflow. It appears we are still trapped within the same problem. Suppose we do have to do those periodic cleansings. Maybe they don't have to be done so often. At least we will have solved the input bias problem.

Well we aren't home yet. It turns out that the numbers using this scheme will expand exponentially. Let's look at this analytically.

if we add a frequency count of N/S to our total with each new sample, we have a new total count N' = N + N/S. We can rearrange this to N' = N(S+1)/S. Now imagine we start out with N = 1 and starting adding samples. Each new total N will then be a factor of ((S+1)/S) times the previous N. This is a simple product which we can express exponentially as:

N = ((S+1)/S)^I where I is the number of inputs.

Solving for I we can determine how many inputs we can accept before N grows to some number which exceeds our resources.

Calling N as Nmax and taking the log2 of both sides we have:

log2Nmax = I log2((S+1)/S) and

I = log2Nmax / log2((S+1)/S)

Lets look at some numbers. Curiously if we divide I by S and get a ratio of inputs before refractory cleansing compared to sample window size, we end up with an interesting result:

I/S = log2Nmax / S log2((S+1)/S)

The denominator S log2((S+1)/S) appears to have a limit value of around 0.434 it converges very rapidly from 0.301 for S=1 to .434 for S = 100,000. Our ratio thus seems to hinge mainly on log2Nmax, which is just the number of information bits required to represent our total frequency counts. This means that we now have a way to calculate our refractory period based on resources available and relative to our sample window. By the way, when we have a fixed Nmax, I/S is approximately a fixed ratio ( about 22 with the value I use ). This means that every I/S number of window periods we need a refresh.

It would be nice to break this down into time intervals as well so:

if R is our sample rate and t is the time to take I input samples we have I = Rt

so Rt = log2Nmax / log2((S+1)/S) and

t = log2Nmax / R*log2((S+1)/S)

This is the length of time that our system can accept inputs before having to be refreshed, and it will accept those inputs without bias i.e. with a constant significance level of 1/(S+1).

Now we have a way to virtualize our sample set size. We just add each new datum as N/S datums. The relative impact to the system total will then always be 1/(S+1) and our samples are unbiased. Our total count will eventually add up to Nmax, though, and then we need to do a refractory division of all the node frequencies so that the total count boils back to 1.0. The tree remains with all its relative frequencies intact and proportioned exactly the same (within limits of floating point precision) and the model is ready to continue as before.

Our model now has another knob we can adjust at will to suit our needs. If we aren't exactly sure what those needs are, we can just set our sample set size very large (S = Nmax) and we will, at worst, be back where we were before virtualizing S. As a bonus, we have uncovered a mechanism for ensuring our tree doesn't exceed its resource bounds.

 

Getting Probability Output

Our model is now almost complete except we have not described how we use it or get information out of it. If one is merely interested in the major clusters and where they are located, it is easy to just walk down the tree and look at them to with increasing level of resolution as desired. A dendrogram graphic can be plotted if a visual result is desired. Of main interest here, though, is creating a probability engine that can specifically generate a probability value for any arbitrary datum we give it. The problem then becomes how to do this.

It is tempting to just take the closest neighbor at the bottom of the tree and output its relative frequency. That's the way a conventional frequency histogram type of distribution would work. This model doesn't have fixed intervals on sample space though, so if the datum falls in a high resolution (small interval) area it will produce a low or almost zero relative frequency (statistical probability) number. And this contradicts the fact that it may lie within a larger group or cluster with perhaps a very high relative frequency. The solution I came up with was to acknowledge that the actual probability is distributed across contributions from each level in the tree. Therefore the model starts at the bottom and walks back up to the root while adding the contribution from each cluster node in its path.

But what is that contribution from each node? The datum may be closer or farther from each node's mean so how does that affect its probability. Well, with conventional analysis, we access our distribution model such as normal gaussian, binomial, Poisson or whatever and calculate the drop off in probability density for a given distance from the mean. We have assumed no distributions in our model though. That was supposed to be its generic bonus. Ok, here's the answer. We just use any model we choose. The whole idea is similar to wavelet analysis and Fourier theory. We're using it in a different kind of domain though. In fact the model actually does work with no model at all. Simply returning the value of 1.0 for the distribution density function gives a passable total distribution even if a bit blocky.

The model has been tested using a simple step function and it works well. This function simply outputs a 1.0 if the datum is less than one standard deviation away from the mean and outputs zero otherwise. I have also tested the normal bell curve gaussian distribution as well as a simple tent function. The normal gaussian gives a smoother fit while the step function has abrupt cutoffs and the output looks more like a bar chart. The tent function I use appears to gives results somewhere in the middle. The best choice of all turned out to be what I'll call a t-distribution. I developed it ad hoc but I've noticed that the form I use is very similar to the so-named Students Distribution or t-distribution, which is a best guess estimate for small sample sizes and unknown distribution. See test results below for a demonstration of how different external models affect the GASM model's output.

Now we have a way to get probabilities from our leaf nodes. Basically it is the relative frequency times the dropoff factor per distance from the mean. We also have a scaling issue though. To address the scaling problem at the individual node, I multiply by 1.0/sqrt(variance*twopi) which is the form used in the 'normal distribution'. This doesn't mean we are using the normal gaussian curve. We are just using a similar scaling value. This value normalizes the gaussian curve to give an area of 1.0 so the probabilities sum up to 1.0. Our case is a bit different though.

I have not been able to mathematically establish theoretically how my method of adding up contributions from each node is or could be normalized (probabilities should add up to 1.0 in their totality), but for the purposes of using it as an adaptive engine, I don't think it matters so long as one is aware that the output generated is not a conventional true statistical probability. Actually it is closer to what is called a probability density. My model tested very well indeed against a straight histogram distribution when a scaling factor was applied to the histogram. To scale the histogram, I multiplied the histogram relative frequencies by the number of histogram intervals divided by the range. This equates to dividing by the distance covered per interval. The model always fits optimally no matter how many samples are taken or how many nodes are used, so the scaling solution turns out to be correct. To see a technical resolution of the above scaling issue read tech note 2.

The arbitrary use or insertion of an arbitrary distribution model occurs only at the external or interfacial access methods to the model. Internally the model remains free of those assumptions.

The final form of our summation of probability densities looks something like this:

sum of ProbabilityDensity_subi = 1.0/sqrt(s1*twopi)* n1/totaln * tent(x-m1,s1) + 1.0/sqrt(s2*twopi)*n2/totaln * tent(x-m2,s2) + etc. + ....

on up to the top of the tree. Each contribution thus gives an output adjusted by distance from the mean multiplied by its relatively frequency at that node and a variance scaling value. The total sum is then divided by the number of layers or steps to give an average.

Since the above methodology gives excellent empirical matches to various arbitrary distributions independently of sample size, node counts, and output function used, I believe there must be a theoretical justification also.

There is just one caveat to how the output functions need to be structured. They all need to return a maximum value of unity at X=M and their form uses the common standard z variate defined as (x-m)/stddev.

 

Test Results

To test the model I needed lots of data that had some distribution, which I could chart with a conventional histogram and compare the model to. My solution was to simply use a weighted random generator of my own design (see Appendix 3) which outputs points at random which tend to cluster around some target value which I use as a parameter. To complicate things a bit, I use 2 or 3 different target values chosen at random to establish various distributions.

As the generator spits out data points one by one, I add them to the model's cluster tree. The model thus gets the conventional treatment of having a so-called random variable to sample. After adding 10,000 points in the runs below, I plot the conventional histogram values in white. Then I plot the model's estimate of the distribution in green. Each pic below is a separate run with the model initialized to progressively fewer leaf nodes.

As one can see below, the model performs well even with just a few nodes. It degrades gracefully all the way down to one node, which basically gives the typical bell curve approximation to the data. The model thus does not need extensive resources to track complicated distributions well.

It should be emphasized at this point that when the model reaches its arbitrary limit of nodes, it stops adding new ones and just readjusts itself to reflect the statistical changes with each new datum thereafter. The data numbers themselves disappear while the descriptive informational change they induce is distributed throughout a single pathway in the tree each time.

The tree algorithm is a least path computationally and the statistics of any node are exactly (within the limits of double precision floating point math on a computer) determined by the true statistical values inherited from the nodes below. For all runs of 10,000 data points, the top node ends up with 10,000 as its frequency count. Any number of nodes can be chopped off the bottom of the tree at any time without affecting the statistics! The tree always computes its stats and ends up with the same numbers as if the entire data set were present and computed at the same time. Of course, we add the caveat that the resulting resolution across sample space is diminished.

To test the above assertions, I also programmed in the ability to arbitrarily set the output function to only penetrate a given depth into the tree. The result of only computing 2 layers deep into the tree turns out to be equivalent to using 2 leaf nodes as depicted above. The two-layer of nodes plot below was made after exposing a 64-node tree to the same numbers as above. The plots are essentially identical although the number of nodes used are radically different. The model thus has the ability to output at full maximum resolution as well as varying stages of additive summing from any point in the hierarchy.

 

***

The next test was to just plot a purely random distribution. As expected, we get a flat distribution. Notice that the model curve in red (64 nodes) appears to have a similar degree of overall noise or variance as compared to the histogram.

Another note of interest is that the model appears to give a more balanced view of the data given small sample sets. The plot below was made using 500 sample points and then only 20. Note that the histogram in white becomes very noisy while the model in green tends to average out the fluctuations. At twenty data points, the histogram is so far below its number of fixed intervals (which was set at 200 equal intervals), it looses meaning as mentioned in our introduction. Our generic model though, still manages to approximate the basic shape. The model also always presents itself as a continuous function.

Next we show below the original distribution computed using different assumed external output functions i.e. t-dist, gaussian, tent, step, and unit shapes.

Notice that even the unit (just output a 1.0 value) works and would look even better if we added a final offset and scale factor to it. The tent function (i.e. '/\' shape ) worked very well and appear smoother but doesn't track the tail data on each end as well. The step function in contrast, adds too much noise but pulls the tails in tight.

The above results suggest that the model is indeed as free of arbitrary distribution models as one can have without taking the model to infinity. With enough nodes, of course, we could just represent each data point exactly. But that would be missing the point of our design to begin with.

To assure the reader that the model has not been fixed to work only with the specific distribution above, here are a few randomly selected distributions and the model's probability density plots.

 

Now what about the effect of changing the sample size as in our discussion of the virtual data window? The plots below demonstrate various sample set sizes for the same distribution with the same number of actual samples input. Recall that I was using 10000 random input samples in our first demos above. We will use this data set again but vary our virtual data window S.

Notice that the 30-sample window above is less than the number of nodes in our tree (64) and was super sensitive to any additions under the sharp peak on the left end of the distribution. Even at 250 samples, this sensitivity is still evident. The 100 sample window doesn't show the same peak, but remember we are using that virtual sliding window across the data and the data will be shifting, so the sensitivity may be there, but the data points weren't. By 1000 samples though, the model more or less has settled in on the final distribution. We should stress that the underlying distribution created by my weighted random generator is fixed and not really changing, so this demo does not really track a changing distribution as one might want under certain situations. The problem still remains as to how to determine the best sample size for changing distributions and I haven't developed any methodology at this point, but at least the model now has the ability to shift its scope in time.

Moving To Higher Dimensions

Now for a test of the model in higher dimensions. For this demonstration, we will use a 2dim image as our data set and encode the red, green, and blue color components for each pixel at various (x,y) positions. This gives a five dimensional input vector X = { x,y,r,g,b } which we sample and input into our model.

The picture of the small child below has approximately 300x300 = 90,000 pixels. We will sample 10000 points and add them to the model. Obviously we cant do the simple linear distribution plots as above without throwing out a lot of information. We can however scan the image and just use the output from the model as an intensity scale factor. This is what we get with 64 leaf nodes and 10000 samples.

The lighter areas represent higher probabilities and the darker lower. Notice that the T-shirt area seems to be cut up into pieces. This occurs as a result of the partitioning of sample space. We are using a 5dim space with a large range of component values and attempting to cover it with just 64 nodes. The net result is that out model shapes itself to the data and represents variable areas with more detail. Uniform areas get partitioned across fewer nodes.

We can illustrate this more starkly by pointing our model at a plain whitespace image! Below is the result. Since the data is random sampling of a uniform field, the nodes adjust themselves more or less equally across the domain.

Notice that model automatically creates a patchwork quilt of random sample domains. The right picture actually shows the location and variance of nodes over the domains they represent. The domains on the right are different from those on the left because it was a different experiment and we are taking random samples but the pattern is similar. The nodes plotted are from the top half of the tree so they represent higher level averaging of the leaf nodes. Notice also that the largest (i.e. root node) and top two child nodes below it partition the 2dim space pretty equally. One can also see the lower nodes doing the same with their children. Eventually one gets to the lower output level and nodes which lie in the centers of their domain while the averaging nodes tend to lie on the dividing lines. The shading correlates with distance from the node means and also with the relative frequency of data points actually input at that node (remember we are using random samples).

We can overcome the discontinuities by using total output from all the nodes in the tree as in the pic below, but it is moot as to whether we want the processing overhead in our normal use of the model. Still it is available when needed. We also show a node map for the same experiment. The colors are the actual RGB color means and the diameters are the standard deviations.

We can even do a partial reconstruction of the original image by just plotting a node map using a really large number of nodes (10000) as below. This is not the purpose of the model, but it does illustrate the validity of its behavior.

Views Into The Model

There are two main aspects of the model, which we can visualize. One is the actual numerical data contained within the nodes of the tree. The other is the structure of the tree itself. When we presented the histogram view in our original test result, we were only looking at the probability density data on the y-axis versus the means on the x-axis. When we switched to the image of the small child we were using x and y as our mean data and pixel intensity as our probability density. Both views were looking at the data, not the tree structure.

A text dump of the numerical data is another view into the model. This is the most complete and accurate view but does little for visualization purposes.

The node map view maps two of the components of the mean data as (x,y) positional coordinates and the remaining mean components as color values. The variance of the nodes is represented by the diameter of the colored circles. The node map sort of gives a view looking at the tree from the top down.

We can also graph a dendrogram showing the structure of the hierarchical tree more explicitly. This view gives quick visual reference to clusters and how they are structured. Below are two variations on this theme. We are using the same data as presented in our first test results section. The left side shows the original histogram probability density plot. Underneath it is a simple layout of the binary tree linkages. On the right side is a bar graph of the nodes. The height of each rectangle represents the relative frequency of the node, the width represents the variance, and x-position represents the mean. The color is depth down in the tree. Red is thus the top node and has two green nodes under it. The height (relative frequency count) of the child nodes added together always equals the total of their parent. Each green node has two blue children; each blue child has two yellow children and so on down the tree.

Notice how the resolution increases (variance gets smaller) with depth down the tree. The two green rectangles are the basic two peaks seen in the histogram with the taller one on the right having more counts. Notice also that the left green cluster has a right child (in blue) which does its best to span the sparsely partitioned area between the two main clusters. This creates the small hiccup in the histogram view, which could be solved by adding more nodes to the tree.

Notice also the dendrite view on the left shows pretty well how fewer nodes span the sparsely partitioned areas in order to cover the entire sample space.

***

Summary

Well, that about finishes the model. We have a data-driven binary tree that partitions sample space with changing resolution as more data comes in. In addition we have stored in the nodes of the tree, the mean-variance-frequency stats which establish each sub-distribution's relationship to the whole distribution. The core premise that every distribution is recursively defined as a distribution of distributions has been justified.

Since the model adapts its resolution to match what it is given, no prior assumptions about the extent of the sample domain and number of intervals are needed. This means it can work with very small sample sets and at the same time grow to encompass massive amounts of data. When it reaches its own resource limits it simply goes virtual and records only changes in the shape of the distribution. This makes the model robust and versatile.

A simple probability density output function makes accessing the distribution simple and straightforward.

The model works with arbitrary depth of resources and arbitrary temporal focus (virtual sample size).

The model is essentially independent of choice of output model and is thus generic, and makes no assumptions about the final distribution either.

The model adaptively shapes itself to the data in real time and shows graceful degradation as resources are lowered.

The model has the ability to refresh itself as needed so overflow is not a problem.

The model can be viewed in various ways for visualization purposes.

 

 

Extending and using the model

We can take the probability density function of our model and easily compute entropy values across the sample domain. This is given by the sum of -p(x)dx*log2(p(x)dx) across the domain X which we divide into suitable intervals of dx. Other measures such as 'mathematical surprise' are made possible also by our probability density function.

As a tool for a future AI engine, I think the model has good potential. Surprise and temporal focus might be coupled to enable the model to shift its focus toward data that it finds interesting.

The model can be easily merged with other instantiations of the model built on other data sets. This could have use in a system that compiles new information stores into a comprehensive system.

 

Caveats, Flaws, and Technical Notes

1) Ordering Sample Space
While developing and describing the cluster tree an inaccuracy was allowed for the purposes of simplicity. We used a one-dimensional demonstration and obscured the fact that the model really uses vectors. Now with just one-dimensional numbers it makes sense to put a new node to the left if it is less in value compared to its closest neighbor. Vectors have a direction though, and we cant simply add a node left or right based on magnitude alone and still maintain the simple description we used when discussing adding leaf nodes. The partitioning distinction is independent of how the nodes are actually added, though, so this isn't perhaps a real problem for that aspect of the model.

I noticed this problem when writing and testing the actual model. The solution I finally ended up using was to just take the first dimensional components of the vectors and compare them. This keeps the bottom leafs sorted by their first dimension. I actually tried using vector magnitudes and even tried randomly putting the new node left or right of its nearest neighbor. All methods work. The partitioning search algorithm still finds its way to the correct location. It's just the ordering of the leaf nodes that changes. This situation points out a difficult and interesting problem of partitioning higher dimensional sample space.

If one imagines a real 3dim living tree, he can get a sense of this problem by thinking about how the branches have room to twist and intermingle in the extra dimensions. The binary branching still applies though.

2) The Scaling Problem
Scaling the histogram values to match the probability density output of the model turned out to be easily done. Considering that the probability from a small interval on a probability density curve is given by p(x)dx, we can conceptualize this in our histogram as the actual relative frequency in one particular slot of our distribution. We thus have relative frequency = F(x) = p(x)dx. Dividing by dx gives p(x) = F(x)/dx. All that remains is to determine what dx is in our histogram. dx is simply the interval in which x is put. This equates to the total range across sample space which we arbitrarily established, divided by the number of slots or intervals in our histogram. Therefore the actual scaling value applied to our histogram to get p(x) is scale = (intervals/range).

p(x) = scale * F(x) This scaling gave very good matches to the output from my model and tends to suggest that the output issues associated with the model are indeed resolved at least empirically if not theoretically.

3) Output functions
Below is listed the actual source code I used for various arbitrary output functions. The common thread is that they all must return the value 1.0 for X=M and they all use the standard z-squared variate form which is ((x-m)/stddev)^2. Changing this format from that as listed below causes the shapes to widen or narrow and this destroys the proper scaling relations needed to make the model correspond to a conventional flat histogram.

//The following distribution funcs use 0.5 *Z^2 where Z = (x-m)/stddev
// Z = the standardized normal variate of statistics
// Z^2 = delta*delta/var = deltasqr/var
// where var = stddev squared

double unit_t( double deltasqr, double var )
{
//this is same as 1.0/(1.0 + 0.5*deltasqr/var)
return var/(var + 0.5*deltasqr);
//this is the real Cauchy t-distribution for 1 deg of freedom
//return var/(PI*(var + deltasqr));
}

//normal bell
double unit_gaussian( double deltasqr, double var )
{
if( var > 0.0 )
return exp( -0.5*deltasqr/var );
else return 0.0;
}

double unit_tent( double deltasqr, double var )
{
if( var < SMALLVAR )var = SMALLVAR;
double y = 1.0 - 0.25*sqrt(deltasqr/var);
if(y < 0.0)y = 0.0;
return y;
}

double unit_step( double deltasqr, double var )
{
// same as |x-m| < sigma
if( var < SMALLVAR )var = SMALLVAR;
if( deltasqr/var < 1.0 )return 1.0;
else return 0.0;
}

double unit_unit( double deltasqr, double var )
{
return 1.0;
}

4) Is Random Sampling of Data Assumed?

The model doesn't specifically depend on random sampling, but there is a side effect. Suppose we received datums in sequential order that had a progressive increase in value. As we add nodes to the tree they will get used up one by one till we hit our preset arbitrary maximum or resource limit. Any new data outside those bounds in sample space will then tend to stretch the variances at the extreme end nodes and move their means outward. We might thus end up with a strongly biased model, which has high resolution in one area only. If one thinks about how we humans react to input though, this still seems in line with how we might want our model to actually work. If this does indeed become a problem, we might modify our model's methodology by initializing it ahead of time with a set of nodes equally distributed across sample space. I haven't actually tried this, but I believe it would work.

5) Input Standardization.

This hasn't been discussed but it is probably best to standardize the individual components of input vectors so that they all lie within the same or similar range. Otherwise the partitioning distinction might be more sensitive to differences in one component over another. In my own tests, I have used ranges of -1.0 to 1.0 and 0.0 to 1.0 with equal success. I don't think this matters theoretically, but as a practical issue when it come to looking at various views into the model, it can become an interpretive issue.

6) Vector Math

I coded the model to deviate a bit from traditional vector math by allowing vectors with different dimensionality to co-mingle with each other. I did this with the intent to allow for missing data fields in certain situations. The idea is to simply return zero when accessing a component field past what the vector holds. Sums, differences, dot-products, magnitudes etc. seem to work fine using this protocol but I haven't actually tested situations where it would occur in my model. Sequence order would have to be rigidly adhered to, though, so applying different dimensional input vectors into the same model would have to be done with care.

7) Divide by Zero Problem

There are numerous occasions in the program code where division by zero is possible. Mostly it has to do with handling of variance values. I resolve this problem by using a small additive variance which I call SMALLVAR = 0.00000001. This has worked well and no system crashes or complaints have cropped up. The variance of all input datums thus get initialized to a very small SMALLVAR value. Subsequent inputs at that location then tend to fatten the variance of the leaf nodes and all is well. Interestingly, the relative frequency counts at the leaf nodes tends to balance out the small variances so the output scaling manages to normalize itself accordingly.

8) Partitioning Sample Space Problem

When using higher dimensional vectors, the partitioning is difficult to visualize. The best we can do graphically in this paper is a 2dim output on the xy plane. The projection though clearly illustrates the sort of partitioning that is going on. When working with random samples and a model that shapes itself dynamically to the data, areas with higher detail variance will get more nodes assigned to them. Uniform areas will get fewer nodes, and the partitioning becomes apparent. Refer to the 'small child' example in the test results. Now one may view this as a problem or a virtue. The partitioning shows up because our output function goes straight to the nearest leaf node to begin its ascent collecting the probability densities. The partitions represent the critical boundaries between these leaf nodes and the shading occurs relative to distance from the node. The boundaries themselves evidence discontinuities of shading because of the discrete differences between the relative frequencies of neighboring nodes. We can overcome these discontinuities by using output from every node in the tree as above, but it is moot whether we want to use this mode as it is not an accurate probability density. On the other hand, maybe it really is a best guess estimate of what the true underlying distribution is.

 

 

Appendix 1

Derivation of variance for two sample sets combined.

NOTE: this derivation is done using scalar numbers instead of vectors. I have performed the same analysis using vector math and the result happily turns out to be the same with the condition that where one sees a square term it is really the square of the length of a vector. The math involved dots products and since they are sums of products of components, I was able to separate and rearrange summations similar to what we do below.

 

Recall, variance is defined as:

First we need to reformulate the above into something we can use with limited text editing so:

We shall use the term s as the variance but leave it understood that we are talking about the term as defined above with a square in its designation. The sum involves squares of deviations from the mean, so we shall write this as summation of (x-m)*(x-m). Carrying out the multiplication of (x-m)*(x-m) gives each term of the summation as x^2 - 2xm + m^2.

therefore s = 1/n times ( sum of x^2 - 2 times sum of xm + sum of m^2)

Multiplying by n and rearranging terms gives:

sum of x^2 = ns + 2 times sum of xm - sum of m^2

Now since m^2 is a constant added n times in the summation, we have sum of m^2 = nm^2.

sum of x^2 = ns + 2 times sum of xm - nm^2.

Similarly, 2 times sum of xm = 2m times sum of x

but the sum of xi = nm so we have 2 times sum of xm = 2nm times m = 2nm^2.

Going back up to sum of x^2 and substituting we have:

sum of x^2 = ns + 2nm^2 - nm^2 = ns + nm^2 = n(s+m^2)

Now we are ready to use the above manipulation to add two sample-sets and get their variance. First we reformulate using s' and m' as the new variance and mean and n1 and n2 as the sample sizes or frequency counts.

This gives s' = 1/(n1+n2) times ( sum of x^2 - 2 times sum of xm' + sum of m'^2)

Note that sum of x^2 can be broken down into two groups, the sum of x's from sample set 1 and the sum from sample set 2. Also recall that we derived for a given sample set that

sum of x^2 = n(s+m^2). We thus have the

total sum of x^2 = n1(s1+m1^2) + n2(s2+m2^2). Substituting in the equation for the new s' we have:

s' = 1/(n1+n2) times [n1(s1+m1^2) + n2(s2+m2^2) - 2 times sum of xm' + sum of m'^2] using the same substitutions from above,

- 2 times sum of xm' = - 2(n1+n2)m'^2 and sum of m'^2 = (n1+n2)m'^2 so:

s' = 1/(n1+n2) times [n1(s1+m1^2) + n2(s2+m2^2) -(n1+n2)m'^2 ]

simplifying we have:

s' = (1/(n1+n2)) [ n1(s1+m1^2) + n2(s2+m2^2) ] - m'^2

 

Appendix 2 - Footnotes

Definitions
http://mathworld.wolfram.com/ArithmeticMean.html
http://mathworld.wolfram.com/Variance.html

Cluster Theory
http://www.statsoft.com/textbook/stcluan.html
http://en.wikipedia.org/wiki/Cluster_analysis

Student's Distribution - to read a description of this distribution go to http://mathworld.wolfram.com/Studentst-Distribution.html

The student t-distribution for 1 deg of freedom is given by:

f(t) = 1.0/(PI*(1.0 + t^2))

using t^2 = deltasqr/var we have after rearranging terms:

f(t) = var/(PI*(var + delatasqr))

This is very similar to the ad hoc function I found to work best with my model:

f(x) = var/(var + 0.5*deltasqr)

This affirms my belief that the GASM model does indeed function as intended since Student's t-distribution was designed on theoretical grounds to cover small sample sizes of unknown distributions. This is exactly the case with our model since we are treating individual datums as distributions with n=1 and adding them to other distribution nodes with small sample sizes. I regard the whole exercise as theoretical justification for the construction methodology of our model as well as empirical reaffirmation of the t-distribution as being what it says it is.

 

Appendix 3 - Weighted Pseudo-Random Number Generator

Below is the actual source code for my weighted random distribution. The // slashes are comments

//weighted random generator return 0.0 - 1.0 but weighted toward
//target value; -1.0 = error
//this gets a rand and compares it with target to get the
//spread for getting the next rand
//The further away it is from target the more chances are likely
// and the more likely that the target will turn up in the final
//result. The degree of steepness of the distribution is controlled
//by degreewt 0.0 = flat random distribution
// degreewt = 1.0 = steep peak at target value
// degreewt < 0.0 gives a step function heavy on the side less that target!!
// degreewt > 1.0 just makes it even steeper
//cool function!!!
// degreewt 0.5 gives what looks like the normal bell curve!
// target must be between 0.0 and 1.0
double wtdrand( double target, double degreewt )
{
static double r = 0.0;
static double s = 0.0;
if( target < 0.0 || target > 1.0 )return -1.0;
r= randdoub(0.0,1.0); //uniform distribution
s = degreewt * fabs(r - target); //spread
if( r == target )return target;
else if( r < target ) return randdoub(r, r + s);
else return randdoub(r - s, r);
}