Linear Ansatz
and
The Representation Problem
With Cellular Automata
S.Roof, 8-11-02
While reading and implementing some of the ideas in Stephen Wolfram's book "A New Kind of Science", I uncovered a problem that threatened to shake my confidence in his theory. This egregious problem was especially bad because Wolfram's basic cellular model was mathematically unassailable. After becoming obsessed with this for some time, I eventually had a major breakthrough and all is well again with the new science. Before revealing my discovery, I will explain how I got there first.
The problem originates when Wolfram shows how the rules for a simple 3cell 2state automata are implemented. The number of neighbor patterns varies as the base 2 to the power of the number of cells. The number of possible rules in turn varies as the base to the power of the number of patterns. The number of rules thus varies as an exponential power to the the power of i.e. second order exponentials. These numbers get cosmic in size very quickly. Since the represention for any individual rule, binary or otherwise, requires a certain number of bits determined by the number of possible representations, we have a problem.
For Wolfram's simple system, we have 8 patterns and 256 rules. If we go to 2 dimensions and 5 cells with base 3, we have 3^25 patterns and 3^(3^25) rules. This is about 100 to the billionth power.To represent just one rule in that system would require huge computer resources. All that for a really simple little cellular system with only a very few cells. This is nowhere near to the possible numbers that a natural system might suggest. The rules would quickly outnumber all the atoms in the universe.
Part of Wolfram's theory is that all natural systems that display complex behaviour can be modeled by such simple cellular systems. Analogy is made between discrete natural states as computations and computations of cellular states. This implies to me that ontologically speaking, one is making a direct correspondence between natural states and cellular automata states. It is equivalent to saying that a natural system is a finite state cellula automata. Cells literally represent local discrete states in the natural system being modeled. But what about the rules?
It seemed to me that the rules are just as much a physical part of the system being modeled as are the cell states. This fact is hammered in on one if he actually writes a computer program to perform the evolutions of a cellular automata. Representation takes space and this space is just as real as so called real space so far as the computer is concerned. Why should it be any different in natural physical systems? To say that the rules are not embedded within the system itself is to say that they originate outside the system. One can make this distinction in his mind and in his model also, but can one really say this about natural systems?
If the rules are assumed to reside abstractly and independent of the natural system, then are we not toying with concepts like the 'ghost in the machine' or metaphysical all-knowing manipulations that happen outside real time? One could argue that this is just a model and nature has here own way of transcending our model. I believed that this would be a weak cop-out to the problem. The problem still remained for me as a programmer regardless of what nature was capable of. If I was correct in my assesment, though, either Wolfram's theory was a sham or nature herself was in deep trouble. How on earth can the rules for a system be immeasurably bigger than the system itself? What happened to the simple rules idea? Something was terribly wrong.
What to do? Wolfram in part mentions the tip of the problem and begins using what he calls totalistic rules. They are rules derived by limiting the number of patterns. Each pattern is reduced to a total or average neighborhood value. This works only initially, I discovered. The rules themselves still vary as the base to the power of number of patterns. Eventually exponential growth overtakes his totalistic scheme also! There is another aspect to this also. Totalistic rules throw away all the spatial information embodied within a pattern. One can still get class 4 complex behaviour but in 2 dimensions it becomes obvious that one has created a fundamental symmetry and orderliness into the behaviour. We have 'thrown the baby out with the bath water'. for this reason, I regard totalistic rules as a gross simplification and a bit of a kludge.
I was flumoxed by this conundrum. Wolframs initial positional digit scheme has to be taken as an absolute model of all possible cases. There is no way to get around this fact. Totalistic reduction is like lipposuction to the extreme. There had to be some sort of compromise, but how? Several key ideas led me in my search for a solution.
I needed some way to preserve spatial information in the pattern but I also needed a way to limit the representation space for rules. As things stood, the rules were an abstraction separate from the cellular states. I would have liked for them to be somehow similar enough that I could embed them in the system itself if possible. Early in my programming, I noticed that there was a similarity between image processing algorithms and cellular automata. I also began to visualize the whole problem more symbolically or abstractly.
The basic form appeared to be that the next state of a cell was simply a function of multiple inputs that returns a single value. This reminded me of neural nets composed of logic functions. I toyed with this idea without much success. It seemed too contrived. How could I construct such a system without being structurally arbitrary (ps. I did manage to do precisely that with my generalized logic functions and without being arbitrary! That was weeks later though) ? I also thought about pattern recognition - something that Wolfram and I think is eminently possible with cellular automata. This led me to thinking about filters and vector products between filter and neighborhood. Bingo! The neural net idea did have possibilities. The paradigm was not to look at a net but at a node itself and how it's transfer function works.
The original and still highly useful basic form of gathering neurode inputs is a mathematical construct called a linear ansatz. This is nothing other than the sum of a series of products. McColloch & Pitts original work on their Perceptron in the 40's used a simple weighting scheme for each input. The net input signal was given by the sum of Wi*Si where Wi is the weight for the ith input and Si is the signal input itself. The output was given by a step function applied to the aggregate signal. Their work showed that such a method is cabable of linearly separating all the logic functions. Variations on this technique are found in almost all artificial neural net models even today. The method seems to give the ability to generalize in part because a summary value is taken. Spatial information is not thrown away either. It is embedded by multiplying in the weighting factors. The icing on the cake is that the number of weights exactly matches the number of inputs. I became very excited about the possiblity of using this scheme to evaluate cellular automata.
One can view the above linear ansatz construct as nothing more than the dot product of two equal length vectors in vector space. The above scheme would then simply be the product of a weight vector and an input vector. Much work has been done resolving neural net domains with this approach. The idea was very attractive to me because I could simply let the input vector be the cellular neighborhood pattern and the weight vector be the rule. The rule would serve the purpose of a filter slapped onto the neighborhood. It would always be exactly the same size as the neighborhood. I repeat. It would always be exactly the same size as the neighborhood! This was exactly what I was looking for - a method that incorporated spatial info and limited the rule size.
The catch was, could I make it work? I was very unsure about this. The implementation was simple to program, but several false starts and programming errors kept me on edge. Eventually I got something that had simple behaviour every now and then but nothing like nested or complex. I was discouraged but kept at it. Part of the problem was that neurode models generally assume continuous inputs represented by floating point numbers while I was using a simple unsigned character scheme. I considered reprogramming with real number vectors and doing normalization etc., but that would be a large rewrite and it seemed to head away from the idea of simple rules to begin with.
I decided to stay with my simple fixed byte scheme for representation, because the whole idea was to keep the rule space identical to the cell state space. Throughout this process I was bothered also because I really didn't have a name for my idea. I hadn't run across the term ansatz at this point in my search. I stumbled across it while I was researching neurode transfer functions in some of my books. Looking the word up, I discovered it was a German term meaning start, estimate, added piece, or way. This seemed very appropriate. It reminded me of 'ersatz' but ersatz means 'inferior substitution' so I veered quickly away from that one.
One of the reasons normalization is required on the vectors has to do with escalating values. In my byte valued system, adding a sum of products quickly overflows the values a byte can represent. How could I renormalize discrete unsigned number values? I tried dividing out the vector length but that doesn't quite do it. Then I hit upon the idea of simulating normalization by subtracting the value of the halfway point for each input. This would give the possibility of negative values and inherent cancellations. My inputs could then appear as positive or negative as required for the usual model. (I subsequently found this was unnecessary if a linear interpolative squash was used)
I still had to be able to manage possible values larger than a byte though, so I used a temporary variable of size 4byte integer. This gave me ample room, but I still had to somehow squash it back into the size of a single cell state value. Did someone say squash? as in Squashing function? As in Sigmoid. This is the old neural net technique - apply a transfer function to squash the values at their extremes while maintaing relatively linear output throughout the midrange. Sometimes an extreme case of a sigmoid, a step function, is used.
This turned out to be the answer. I resisted at first, because it involved using a continuous domain higher math function, but once I tried it I was rewarded. For my function I picked an exponential sigmoid of form 1/( 1 + exp(-x) ). Sigmoids by the way have simple 'S' shapes that flatten out at the extreme negative and positive ends of its domain. They thus squash outputs to lie within specific boundaries. My ansatz model began to occasionally display more interesting patterns, but I still couldn't seem to get more than class 1 and some class 2 repitition. No nesting, random, or complex behaviour appeared. Something was wrong.
Then I realized that I had to tune the steepness of the sigmoid to fit the parameters. I added an adjustable parameter to the exponential and valla! Almost immediately I began getting the most incredible plots of every type of behaviour Wolfram describes. In fact, if the parameter was tuned well, the number of rich complex plots became much more frequent than the population ratios of class4 for the positional digit model! For example, the 3cell 2state model has only one certain class4 rule (#110). Finding them is a real art. My ansatz model actually seemed to rev up the complexity in spite of the fact that all rules were now constrained to have exactly the same size and number as the neighbor input patterns themselves. This was an awesome revelation.
Before I got on my high horse, though, a sobering reality set in. I still had no way to number the rules and there were aspects of the sigmoid transfer function that bothered me. Since I was using bytes for cell values, it occured to me that in effect I was using a base 256 system instead of base two. Rule numbers would thus vary as 256 to the power of number of cells. I was back to big numbers again. My final solution for cases where the rule number exceeded 32 bit integers was to simply seed my Rand_CA30() generator with a number and generate the actual rules a byte at a time. That particular seed would then always generate the same rule and could be used to identify it. This hashing solution worked well and I still could use the actual direct representation for rule numbers less than 2^32. Keep in mind that I was still way ahead of the game.
Even at those large rule numbers I was still keeping the actual representation of my rule to a simple byte array no longer than the neghborhood cell count. A very manageable setup, while the positional digit method of Wolfram's would require explosive space. Mine would always be a one-to-one map. The demands for rule space would never exceed the demands of the cellular grid itself. In fact, on a 1000 x 700 display of pixel cells, my rules only add up to 3 pixels for a 3 cell neighborhood. Compact rules. Simple rules.
My next challenge was to generalize the program and methods to be able to vary the radius, dim, base, and evaluation mode in all possible configurations. Once I had done this, I could vary the number base and cell count for my ansatz rules and compare results against positional digit and totalistic rules. My ansatz method continues to give results even when the other two methods overload. Since Wolfram's original simple 3cell 2state system yields all types of possible behaviour, I was particularly interested in how far I could push the ansatz rules to this degree of simplicity. There are only 8 ansatz rules for this case (as opposed to 256) and they all give class1 behaviour. How large a base did I have to go to before results started kicking in?
Not very far it turns out. Base 3 began yielding class2 behaviour. Base 5 appears to be the critical edge at which class4 emerges from ansatz rules. Base 6 appears to give the full complement of Wolfram's rules. This makes sense because the number of ansatz rules at this level is 6^3 = 6*6*6 = 216 i.e very close to 256. There have to be enough cell states (i.e. number base) for the sigmoid and normaliztion to operate on and distinguish. Anthing above this level merely adds to the richness of behaviour. The fundamental classes of behaviour and hence degree of complexity are natural bounds, beyond which not much more is added than frills or escalades. This is a universal principle according to Wolfram.
My ansatz rules accurately reproduce every single one of Wolfram's rules. They do this with an added bit of richness but the same identical forms emerge. I was in rapture to discover this. I had done it!
.... Well, sort of