Back
SYNOPTIC LINKS:
[ weighted measure between intersection vertices counting analysis Cluster connection inverted weights general original process structure information relations overlap defined - Synoptic Hyperlinks generated by Shlep - A Paisley Product from TinkerSoft; Click on (*) links to return to this point]

Toward A Theory of S.P.A.N.s

(Sorted Pair Associative Nets)

by Swinton Roof

creator of

Paisley Products by TinkerSoft

July 2001

 

INTRODUCTION

This paper is an attempt to formalize an approach to dealing with a class of problem conventionally represented by what is called an undirected weighted graph. A graph in this context is a set of points or vertices some or all of which are connected by lines or edges. Real world problems associated with this class of graph can be extremely complex and enormously difficult to solve.

An example of a weighted graph would be a road map with each path between cities given a weight or value of distance or perhaps gas needed to make the trip from any city A to city B. Most approaches to these sorts of problem center around brute force algorithms for traversing the network of edges or finding out if a path with certain criteria exists. The theory I am proposing, however, is more interested in seeing what kinds of general information can be gathered about the structures, patterns, or groups of behavior that might characterize the network as a whole. Throughout this paper I will be using the words 'net' and 'network' in a loose sense to refer to a graph or portion of a graph. This model is not a statistical model, yet it can easily be modified to include statistical methods. This model is not is a technique for getting at specific problem solutions. It is a tool for general feature analysis only.

While this will not be rigorous mathematical treatment, nonetheless, an attempt will be made to formalize some of the concepts which I have been developing in the course of a programming project I have been working on. I have been working on a way to analyze text documents abstractly in terms of basic frequency and proximity relations between words as they occur. It is my contention that useful meta-information can be gleaned from any data stream without actually knowing anything at all about the meaningful content of the data. This idea is at the heart of most file compression schemes and has real world value. I wish to extend it's application to a level where it might have a descriptive use in many different areas.

As I made progress with my programming effort, a persistent problem kept creeping into the picture. I found myself creating new and more complex data structures at each level of analysis. In addition I found myself dealing with complex recursive functions which were difficult to implement, understand, and verify. Curiously, though, there was a recurrent pattern to my algorithms. This situation inspired me to attempt a formalization of my ideas in the most general sense possible in the hopes of arriving at a general methodology. An abstract generalized algorithm that was not tied to the particular datas involved was what was needed. That is the main goal of this theory.

This paper in no way pretends to be rigorous or complete. It has been written mainly for the author's benefit, but the methods and ideas may be of use to others also. In addition, the process of writing so that someone else can understand, helps to clarify my own thinking.

 

BASIC PREMIS, ASSUMPTIONS, CAVEATS etc.

This theory, in it's initial offering, will concern itself with static graph problems. The data will be assumed to have a fixed state open and accessible to analysis to any extent required. Real world problems might not be so simple. One might be dealing with data points that vary over time or that have a certain degree of uncertainty or lack of precision or with changing connectivity. It is my belief that a more generalized version can be fairly easily incorporated by introducing windowing techniques and probability measures or fuzzy set 'weighting' into the picture, but for present I will stick with this simplification.

My basic model will be expressed in some of the jargon of finite set analysis or maybe it's called finite model analysis. I have neither the mathematical skills nor intent to do real theorems etc. I will instead be using whatever symbolic tools I feel appropriate to clarify my ideas. Since what I am doing is aimed at achieving a framework for writing computer code in C++, some of my text will use programming language or concepts. It is appropriate at this point to say that I will in the main part be using names for variables, entities, or terms in a referential sense. By this I mean that the symbol or name refers to some entity, class or data structure, but not usually to a real number value. This is different from say, algebra, where a variable x stands for a real number or integer. It is my intent to get at real numbers or integers where needed, by providing functions which return said values. This is common in practice when getting at an actual number might entail processing a complex data set.

The basic premis of this theoretical approach will be that certain basic concepts of set theory such as the 'intersection' and 'union' of sets can provide useful techniques for extracting linkage information and general features from a finite connected graph with unknown topological connectivity. Along the way, an assumption will be made that one can simplify things enormously by sorting relationships by degree and separating the wheat from the chaff. This assumption keys into the fact that all the edges (relations) of the graph under consideration are weighted and thus have a measure which can be used for sorting purposes. This approaches differs radically, I believe, from sophisticated theoretical treatments that grapple with the actual topological issues headon. I believe, in this way, I can side step many of the real nightmares one encounters when working with graphs or networks.

An objection to sorting and selectively skimming the results is that some information is thrown out. I believe it can be shown, however, that one can control this aspect with heuristics. One can even do a statistical analysis on the items being sorted and thus confine the items to be kept to within a certain range based on the standard deviation of values. I believe one must agree, however, that if one is not able to perform such a heuristic, it means either that the net is totally connected to begin with or that the weigted edges do not have sufficient deviation or resolution to do a reliable sort to begin with. In such cases, I maintain that the graph itself is not amenable to my type of analysis, except to perhaps to state that very fact. This could in itself be useful in certain cases, and should be detectable by sufficiently aware coding traps.

Speaking of weighted edge relations, it should be pointed out that those data points are assumed to be provided by some source outside the main scope of this treatment. We will not concern ourself with extracting those values, only accessing them as givens - typically in an array or as a set. In my own programming project, I parse words into a dynamic tree structure of my own design which provides in addition a flat access array for indexing and sorting. All access to these datas is via references or pointers so that a minimal of copying and memory space is used. Thus we assume that our object of study - some undirected weighted graph - is a given. Actually I have a caveat here. Some of my discussion may actually involve reconstructing a graph or linkages from primitive or lower level datas. The chief assumption, however, is that the weights along edges is fixed for our purposes, and will not be monkeyed with, except as the raw data for analysis. We can view them as constants associated with a static graph as said before.

 

INITIAL ENTITY ASSIGNMENT - VERTICES & PAIRS

In general there are relations between n points. Since we will only consider undirected graphs, the edge or path from A to B is equivalent to that from B to A. If we ignore a vertex's relation with itself, the number of possible edges reduces to (n/2)*(n-1). The second condition is that the edges will in general be weighted to have some value or length. In general, the number of actual edge connections or links between vertices will be some intermediate number and may even be unknown or perhaps a function of the weighting values. It is our hope to somehow sniff out some of the structure lurking within such an unknown mess of connectivity.

Suppose we have a graph G = { V0, V1, V2, ... Vn } which is simply a set of n vertices.

note: As stated above we will be using references to data only. In programmer speak, each of the V's above will be a void *pointer. Such a pointer can reference any datatype we desire and gives great generality and freedom from worrying about data structure (at least for now).

We can define an unordered pair P = { Vi, Vj } for any two vertices Vi and Vj that are within the set G. For our purposes we will assume P{Vi,Vj} = P{Vj,Vi}.

Any unordered pair thus defined represents an edge within the graph and connects vertex Vi to vertex Vj. We will interchangeably refer to such pairs as edges, lines, relations*, associations or links. It is very important to not be misled by my using these terms in different contexts. The important thing is that the raw data structure remains the same. The differing terminology, however, helps to highlight the semantics of a particular context. Vertices may be similarly referred to as points or nodes and later even as stars. The protocol will be to stick to vertex and pair as the nominal entities however. Speaking of protocol, the intent is to highlight the symbolics, functions, and data sets in bold text. Failure to do so will be evidence only of typographical error on my part.

We assume that a method exists or will be provided externally for generating these pairs and evaluating their weight or strength. Such a method can be declared as say, r = pairwt( pair& P) where pairwt() is a function which returns a real number value r and the argument to the function is a reference to a pair called P. The & symbol stands for 'reference' in C++ language lingo.

In my own program I use a simple alogrithm for getting and accessing the pairs and their weights. Starting with an array of vertices (words extracted from a document in my case) I iterate over a double loop and get the pair combinations. I evaluate the strength of a word pair association by using data previously garnered when parsing the words from a document. I use a dynamic tree to store the words along with a list of all their positions as they occur in the document. My premis for association (remember to think 'pair' here) is that words which occur frequently in close proximity to other words have some sort of semantic ties or other contextual unity which bind them together. The associative potential or edge strength or weighting factor (see what I mean about alternative terminology?) will thus be a function of both frequency of occurence and proximity or closeness spatially. Thus we have a measure of the spatio-temporal relatedness between two words in a pair. This idea, while not to be seen as the core concept of my model, nonetheless was the initial concept that led me down this path. Striving for a generalized model, we want to only assume that such pair or edge weights are somehow arrived at. They could be physical values like road distances, airplane fuel costs, etc. or they could be abstract linkages like resonance, associative potential, personal charm, monetary liquidity or what have you. The goal, as stated initially is to somehow see what kinds of information we can gather about the structure of the whole network or how it's connectivity leads one to visualize groups or subsets of the whole with their own features.

FIRST ORDER GRAPH ANALYSIS

Let us define an inverted graph IG = { P0, P1, P2, ... Pn } where we simply have replaced all the vertices Vi with all the edges. The set now consists of all the unordered pairs between all the vertices.The number n in this case = (N/2)*(N-1) where N is the original number of vertices in G. The inverted graph IG is entirely congruent to the original graph G. This is merely a homologous reformulation of the same information with a lot of redundancy thrown in. We shall soon see how we can put this redundancy to good use.

One might have a graph or net in some particular case where some connecting paths simply don't exist as in say a road map where there aren't roads connecting every single city with every other city. This is ok. The pair evaluation function can generally be set to return zero or some error indicator to indicate a null path. This useful device will enable us to sort the set and throw away the stuff we dont want. In fact, we are counting on it.

This inverted graph definition is simply the set of paired vertices with a weighted value for each pair element within the set. This is to be our prime data structure for starting our analysis. It is important to notice that all the vertex information is embedded in the pairs and can be used to reconstruct a new graph at will. After sorting, skiming the cream off the top, and recombining, however, the new graph will be a simplified version of it's previous uninverted self. It is hoped that successive stages of this winnowing technique will reveal a meta structure of information about the full blown net and it's connectivity.

So there you have it. The reason for the extra redundancy is to enable us to dissect the graph into pieces. We can literally explode the graph into a random pile of edges now and still reconstitute it by snapping all the edges to each other where they have matching ends. Below is a visualization of an exploded graph and it's partial re-assembly. Notice the structure with vertex a at the center. This vertex has five links to other vertices. We have taken each pair with an 'a' and snapped it onto the a vertex. When viewed thus on its own as a sub-graph, it appears as a star. I will be using this term for such stelliums of linkage. Notice the redundancy in the exploded or inverted view. Also notice that we have thrown away some of the possible pairs and achieved a simplified restructuring of the whole. This simplification is what I call 'first order' analysis.

 

When we first create the inverted graph set with all paths initialized, we might want to throw out any extraneous linkage or null links. In my own word graph, I found it useful to simply ignore words of a certain minimum length because they weren't really significant to the content of the text, but merely serve as syntactic modifiers etc. Similarly, when we create any graph there will in general be ways to heuristicly simplify and clean up the starting data set before analysis begins. This is no problem with the inverted graph form of the data.

Once the data is thus paired and put in some type of array (set) we are ready to sort it. Once sorted, we can arbitrarily skim off however much we desire off the top to get the salient edges, and then relink. The relinked structures highlight the major infrastructure components that may be the backbone of such a net. We are dealing with an n-dimensional entity here and normal visualization methods may fail us. This method at least appears to extract major players in such a inherently complex structure.

 

SORTING AND SKIMMING

The ability to sort edges by their weighting factors is at the very heart of this method. Sorting enables us to rank the pairs and discard some portion whose value fails to meet our criteria of significance. The S.P.A.N. (Sorted Pair Associative Net ) model depends on the ability to rank pairs and sort them. Much of our anaylsis from now own will thus center around forming higher order pairs and extracting a weighted value so that may be sorted. Our lowest level first order pairs are assumed constant and given by some pairwt() function. We will have to come up with our own methods for getting the higher order pair weights.

A typical sort function compares the value of two entities and returns a comparison value. The usual return is 0 if the two items are equal, -1 if the first is less than the second, and +1 if it is greater. Sort functions thus from the get-go deal with pairs. It is a natural extension to sort pairs themselves.

Since the sort function is at the heart of making analytical distinctions in our model, we can tune our methodology in various ways with the sorting and skimming process. If we are interested in a minimal solution, we might use an ascending sort, while a descending sort would be useful for maximal solutions. The skim or cutoff function is where we make the actual decisions to keep or discard. There is opportunity at this point in the methodology to introduce various criteria or seives. These can be statisically based or just a simple threshold cutoff will perhaps suffice.

In my own project, I created what I call a skim sort buffer. To deal with the monstrous combinatorial aspects of large inverted graphs, I decided to forego a full blown structuring of the graph data. My skim sort buffer has a fixed (but adjustable) windowing size. Since the exploded inverted graph has the property that we are free to juggle the edges around at will, I simply take one pair at a time and add it to the skim sort buffer. When the buffer fills up, it initiates an automatic sort, and throws out the rejects, leaving room for a new round of additions. The program can thus throw as much data as it wants to at the buffer and it just keeps on skimming and stays up to date with a final sorted extraction.

The beauty of the sorted pair inverted graph data structure is that analysis can proceed immediately from any entry point or range and run in more or less linear fashion. No recursive function calls or link chasing is needed except in a minimal way.

 

STARS

In our original re-assembly of the illustrated graph, we assembled what I call a star.

Define a star S(V) = { P0, P1, P2, ... Pi } as a set of pairs that all share one vertex in common. For every P{Vi,Vj} in S, either Vi or Vj = V. Thus a star or star vertex is an inverted vertex in the same vein as our inverted graph was. This common vertex will be called the identity vertex.

Since the 'star' formulation of a vertex has the same structure as the inverted graph, it in fact 'is' an inverted graph and amenable to the same sorting techniques. This is not our intent yet, however.

If we thus start with an expanded data set, sort it, skim the cream, and reassemble we end up with

Reduced Graph RG = { S0, S1, S2, ... Si } where i <= number of pairs skimmed.

Now we have achieved a first order simplification of the original graph G. In general we will usually end up with fewer stars than we had original vertices. These 'star vertices' share the 'staring role' in the infrastructure because we have sorted out the less significant material.

How do we actually assemble RG? Note that the illustrated re-assembly made only one star of note. In general we may have many stars and desire to know how they relate to each other. That will require a second order analysis. First, however, note that any edges common to the illustrated vertex a and other vertices were coopted into a's set. This leaves some stars incomplete. It will be implicitly understood if a star has a pair linking it to another star, then that other star inherits this link reflexively. Actually every star, by implication, must have a reflexive partner, because the stars are built from pairs of vertices to begin with. We will need to flesh out these missing links. This can easily be done during the assembly process. I tried to formalize the situation with full blown ordered and duplicated pairs, but links could still turn up missing when we do our arbitrary cleavage during the skim process. Completion upon assembly turned out to be the best choice. We must have completion when we do a second order analysis.

We shall consider a star S(V) complete if it has the identity pair {V,V} which contains it's identity vertex V, and also contains a pair link to every other star that has a pair link containing V. Any star not complete shall be called degenerate and labeled with a small s.

Another problem is whether or not we consider stars which have no direct reflexive linkage to belong to RG or to another graph. We made no restrictions on connectivity during our definition of our original graph G. It is desirable for analysis to make this restriction now. We want to define RG such that there must exist some path that connects every vertex star in RG. In other words it must be connected. The end result of this definition will be that we may end up with more than one graph during re-assembly. This will provide extra information about any tendency toward clustering in the original graph G. Our model now requires that in general re-assembly we have G = { RG0, RG1, ... RGi} where i may be any number of connected reduced graphs from one to the number of skimmed pairs.

To fulfill this this connectivity requirement we shall have to come up with a way to ascertain whether two stars directly connect via one edge pair.

For any two stars Si and Sj , we may define a boolean operator '--' that tests to see if the two stars are directly connected - call it the contiguity operator. Then Si and Sj are contiguous if Si--Sj = true.

Si--Sj = true, if there exists a pair P in either Si or Sj which has a vertex equal to the other star's common identity vertex. We can come up with a more explicit formulation later, but this will suffice for now.

....

Now back to explaining the assembly process. First we must assemble a list of complete stars.

1) Take the first pair P{Vi,Vj} from our skim sort.

2) Create two stars

S(Vi) = { {Vi,Vi}, {Vi,Vj} } and S(Vj) = { {Vj,Vj}, {Vj,Vi} }

Notice that we start them both out as complete.

3) Get the next pair. Check each star we have created and see if either element of the pair is the identity vertex of that star. If it is, add that pair to the star and make note of which element it was. The other element will be the identity vertex of the reflexive partner to that star. We hold onto this information till we have scanned the whole list of stars and either added a copy of the pair to a star containing the other vertex as its identity or created another new star with that vertex as its identity vertex. This step involves duplication of every pair we add, but we need to make our stars complete for reasons to be explored later. If the pair is already found to exist in a star, we simply ignore that star and continue looking for it's partner.

4) Repeat until all pairs have been assigned to stars.

After we have assembled a list of complete stars, we now assign them to their appropriate reduced and connected graphs.

1) Arbitrarily create a graph with the first star; RG0 = { S0 }

2) Take the next star S in our list. For every Si in RG0 test for contiguity with S. If S--Si is true, add S to RG0. If not, create a new graph RG1 = {S}

3) Repeat for every star in every graph, testing for contiguity and adding if true. At the end, if not added anywhere, start a new graph.

Thus we end up with our desired final graph G = { RG0, RG1, ... RGi}

None of the completeness requirement has helped us yet but it is necessary for the second stage of analysis.

FIRST ORDER SIGNIFICANCE

Now that we have done a first order analysis. What have we achieved? At the very least we have a list of the most important vertices in the graph depending on our sort and skim criteria. These vertices are significant because they had the most links and the links they had were the stongest This can be interesting and useful in itself. In my own, project it was enlightening to see a list of words which had the highest significance to the overall structure of the document. Even more interesting was that when the individual links to other words was listed for a star (i.e. all the pairs that were elements of that star), it became obvious that the words shared some sort of semantic connection (as I was hoping) with the central vertex significator word of that star. In general*, this means that our final reduced stars show which links are most significant to that star! The significance thus extends internally into the structure of the reduced stars as well as the structure of the graph as a whole.

It would also be informative to rank the stars themselves now. We can get the sum of all the internal pair weights for a particular star. This value will be a function of the number of pairs in S and also a function of the individual pair weights. The added bonus here is that we can now sort the stars in our reduced graph RG according to a star's weight. We can even write a function like r = starwt( star& S). A reduced graph RG is thus not only a weighted graph, but a super-weighted graph wherein the vertices as well as the edges are weighted.

The S.P.A.N. method adds redundancy to aid in dissection, throws out the least significant chaff, relinks and ends up with a simplified version but heavier with significance because of the added weight factoring.

Another payoff is that we have a list of graphs, now. The original inverted graph has been simplified and restructured into perhaps more than one sub-graphs. These sub-graphs reveal whether the graph has a clustering tendency and which clusters are larger. Since we now have weighted measures for star vertices, we can even give these sub-graphs their own summary weights also if we choose to rank them. In fact, every element in our entire reduced graph data set analysis now has a weight value. We may at will use this to do special sorts or rankings etc. All this extra information can be used in our analytical first order presentations etc.

Another thing we would like to know is how these most significant star extractions relate to each other. That takes us into second order analysis. A side bar might be informative at this point, however. During my researches on the web, I encountered a technique called Euclidean Cluster Analysis. This method analyzes data sets to see if they can be organized into clusters or groups. The main technique is to create a proximity matrix for all pairings of the data. The data is then usually progressively merged into higher and higher groups of sets with close proximity. The methods and variations of cluster analysis are many and generally start out with a flat data set suggesting no graph type linkage. It is my belief that many aspects of my process mirrors cluster analysis, but we are dealing with something inherently more complex here which is not amenable to direct cluster analysis. On the other hand, my model perhaps might be a way to do cluster* analysis on flat data sets. I'm not mathematically schooled well enough to answer this question and am not currently interested in pursuing this, but the reader might be.

 

SECOND ORDER ANALYSIS

As stated above, our goal now is to see how stars relate to each other. First we need to see how pairs relate.

Given a pair set P{Vi,Vj} we can define the counting measure of P as |P| = 2. The counting measure of a pair is always 2, because it has two elements.

We may also define a weighted measure of P as <P> = pairwt(P) The weighted measure of P is a given in our context and inviolate.

What is the intersection of two pair sets? given two pairs Pa and Pb:

Pa^Pb = {a1,a2}^{b1,b2} = {a1^b1}U{a1^b2}U{a2^b1}U{a2^b2}

I am using '^' to stand for intersection of sets and 'U' to stand for union of sets.

The intersections of individual elements is null if different or is the element if the elements are equal to each other. a^a = a. If a != b, a^b = the null set{} The intersection of two pairs is not a pair and may contain zero, one, or two elements.

This enables us to define the counting measure of the intersection of two pairs.

The counting measure |Pi^Pj| = 0,1, or 2. It is simply the number of elements in the intersection.

For example:

|P{a,b}^P{c,d}| = |{a^c}U{a^d}U{b^c}U{b^d}| = |{}| = 0

|P{a,b}^P{a,c}| = |{a^a}U{a^c}U{b^a}U{b^c}| = |{a}| = 1

|P{a,b}^P{a,b}| = |{a^a}U{a^b}U{b^a}U{b^b}| = |{a}U{b}| = |{a,b}| = 2

Now we need to define the weighted intersection of two pairs: <Pi^Pj> There is no obvious way to do this as pair weights are inviolate and belong to a pair as a whole. Reason tells us, though, that the weight of the intersection should be proportional to both weights and that it should depend on the degree of overlap or intersection. The degree of overlap is just the counting measure of the intersection so we can put it all together as:

<Pi^Pj> = <Pi>*<Pj>*|Pi^Pj| = pairwt(Pi) * pairwt(Pj) * CM where CM = |Pi^Pj| = 0,1,or 2

Now we get to the good stuff!

What is the intersection of two stars? Since a star is defined as S = {Pi,....} it is simply a union of pairs. The intersection of two stars is the union of the intersection of all pair combinations between the two stars.

Therefore Sa^Sb = {Pa0^Pb0}U{Pa0^Pb1}U{Pa0^Pb2}... ... ...} for all combinations of Pi,Pj in Sa and Sb

Sa^Sb gives the degree of overlap between any two stars and tells us whether or not they are connected to each other (remember the contiguity operator?) It also tells us what other links are shared between the two stars. The intersection between stars depends on the stars being complete in order to function correctly. This is why we make a special effort during assembly to ensure completeness. Interestingly, it can be shown that if two stars intersect, they are at most two edge links removed from each other. This interesting property is a result of the way they are defined because each can have a link to a star that shares that common link between the two - an intercessor so to speak. The intersection then is diagnostic of two cases - direct contiguity and overlap* by proxy.

The intersection thus defined functions as an interfacial syphon for information without drawing out too much info. Too much information about the surrounding territory would render a star useless for analysis as too many other vertices would group to it. Any less, such as contiguity alone, would be too drastic a simplification that would short circuit any attempt to get a weighted connection between stars with any real utility. Having giving these observations, let us proceed.

Define the identity vertex of a star as I(S(v)) = {v} given by the pair {v,v} in S(v)

The contiguity operator may be defined as the counting measure of the intersection of the identity vertex of one star with the totality of the other star. Therefore:

S(a)--S(b) = |I(S(a))^S(b)| = |{a}^S(b)| = 0 or 1

Continuing in this vein, we may define the counting measure of the intersection between two stars as the sum (because of the Unions of intersections) of the counting* measure of all the possible intersections.

|S(a)^Sb)| = Sum of all possible |Pai^Pbj| between S(a) and S(b)

Finally, we need also the weighted measure of the intersection* between two stars. This is likewise simply:

<S(a)^S(b)> = Sum of all possible <Pai^Pbj> between S(a) and S(b)

Now we have all we need to proceed to analyzing the stars in our first order final graph G = { RG0, RG1, ... RGi}

 

SECOND ORDER PAIRS

Look carefully at what we just defined* above:

<S(a)^S(b)> = Sum of all possible <Pai^Pbj> between S(a) and S(b)

This is a weighted measure! Recall all of our pair weights in the original inverted graph IG = { P0, P1, P2, ... Pn } and the pairwt(P) function.

<S(a)^S(b)> is entirely analogous. The two stars are inverted vertices and we have just derived a weighted connection strength between them! This weighted edge between stars will be zero if the stars dont intersect and some real number value if they do and to the degree that they do. It also will depend on the number of connections and also on the individual weights. Thus we have a very sound reason to proceed to a second order interpretation.

<S(a)^S(b)> is just a second order pair weight between vertex a and vertex b !

We can take any two inverted star vertices in G = { RG0, RG1, ... RGi} and produce a second order pair connecting them.

Now I wish to take a theoretical leap across the proverbial chasm and flat out propose that since a star is just an inverted* vertex and since a star is just a sub-graph and since a second order pair is just a weighted connection* between two stars, we can replace the whole theoretical construct at this point with our original type of data set we started with!

replace every Si(v) with its identity vertex v and replace every second order pair with its analogue using pairwt(P{vi,vj}) = <S(vi)^S(vj)>

What we end up with is the exact same type of data set we had originally, just a lot of pairs connecting references and having a weighted measure*.

This means we have closed the loop and can go straight forward into a first order sort, skim, and re-assembly with this new data set. When we do this we will have achieved a second order analysis. This is the whole crux of my model!

Since we have gone into a higher order loop and returned to the exact same data formulation, the process may be iterated or recursively processed (ad infinitum?)!

 

CLOSING THE LOOP

We need to discuss the theoretical leap that the model makes when it de-inverts the stars into vertices. I call it a leap because I dont have a rigorous reason to make this step. It seems to me, however, that the second order pair intersections capture enough of the associativity between the stars to justify dropping the extra linkage information* lost when we replace S(v) with just v itself. In fact, I think that since we have to discard some info anyway when doing generalized analysis of any subject, there is justification and perhaps we will be discarding just the stuff that didn't intersect anyway. I'm not sure. What is actually getting dropped is references to other vertices. I at present can't say which vertices those would be.

When the model is coded into a recursive engine, it is my hope that each successive iteration brings a reduction in complexity until only one vetex is left perhaps, at which point it can stop. If such is the case, The entire process* will have constructed a hierarchial tree of connectivity from the bottom up. By using pre-order vs post-order ouput display we can choose how we want our presentation of the data to printout - top down or bottom up.

There is also the question of whether or not we want to sort and skim after the first step. Is it really necessary? The dropping of a star down into vertex status already drops info. This will have to be seen. Our second order pairing is a kind of sort also.

An other possibility is that the graph will stabilize into a repeating pattern. This could be useful also and present a sort of irreducible structure* of the graph.

The whole model may turn out to be a massive Rube Goldberg device with turning wheels that shuffle the data around and around without doing anything useful. Of course I hope this is not the case.

A problem might be that the succesive weightings spread out among the stars until they are all equal. Or they might steadily increase or decrease. Or even oscillate. It may be necessary to normalize the weights between 0.0 to 1.0 at every stage.

I do not have the theoretical skills at present to answer these questions except by coding the model and letting her rip to see what it does.

Well that's it, the entire S.P.A.N. Model

 

RECAP - SPAN IN A NUTSHELL

1. Get all the pairs between* vertices* and their weights*

2. Sort

3. Skim

4. Reassemble

5. Present analysis* for that stage

6. Construct second order pairs

7. Convert to original* formulation

8. Repeat till satisfied

 

CONCLUSION:

This theoretical treatment was an attempt to clarify my thoughts enough for me to proceed with coding an automated analytic engine. Along the way, I have gained abstraction from the details of my data structures so the technique may be applied to any situation involving pairs of data that have weighted* values. That is why I chose the name Sorted Pair Associative Net for my theoretical model.

***@@@***

PART TWO WILL CONTAIN THE RESULTS OF AN ACTUAL WORKING MODEL AS SOON AS I GET IT CODED ON MY PC ----------------