Explorations in the Garden of Partial Truths
S. Roof
Nov. 2002
Since the beginnings of mathematics, man has modeled mathematical truth after the way he thinks. Boolean algebra, propositional calculus etc. are direct attempts to codify the laws of thought. The advent of probability theory and statistics expanded the notion of truth from discrete absolute true/false statements into degrees of plausibility or likelihood. These notions then gave thrust to designing intelligent systems that were based on machines and/or software. This paper will explore the design of one such system.
I began thinking about truth tables after reading Stephen Wolfram's book "A New Kind of Science". His thesis is the idea that simple systems with simple rules can produce behaviour which is computationally irreducible i.e. they have a mathematical complexity which cannot in principle be fathomed in real time. His primary 1-dimensional cellular automata model in fact has examples which are Universal Turing Machines (UTMs). Apparently the magic of such systems results from the fact that they are recursive. Future states of a cell are a function of the current state of the cell and its nearby neighbor cells. The system thus cycles output back in on itself and we have non-linear feedback. I discovered while reproducing Wolfram's results in my own computer program that the rules in his simple model are the exact equivalent of higher dimensional Boolean truth tables.
I wrote several programs to explore these simple systems. Along the way, I generalized the idea to include arbitrary numbers of inputs and higher base number systems than simple binary power of two. I soon became troubled by two glaring problems. The first was the combinatorial explosion of rule sets as inputs and base are increased. This made modeling anything much more than 3cell-base2 systems problematic. The supposedly simple systems had an astronomically large number of degrees of freedom. This was good news and bad news. The good news was that Wolfram's idea is undisputable. The bad news is that how can we design an intelligent system when we can't come up with the computational resources to represent all the possible rules it can have. This is a not so simple a logistics problem. The hard AI folks use hierarchial heuristics to deal with it, but it takes months of real world interrogation of human experts to develop heuristics. The neural net folks use large arrays of neurodes to squeeze and squash the data into useful pattern representations, but training can take eons and has to be tailored to the problem domain. The statisticians make detailed analyses of global properties of the data but their single case predictive power is lacking.
The other problem was the continuity hypothesis (CH). Wolfram seems to be in the camp that believes that CH is illusory. The problem for me was, in practical terms, how to bridge the gap from simple discrete inputs and rules to useful real world input/outputs which may be fuzzy and seemingly continuously variable. Very few of us reason explicitly with Boolean Logic in our day to day activities. How can we come up with a natural extension to Wolfram's ideas?
Having framed my two questions, I proceeded to experiment with various models. After realizing that I was dealing with n-dim truth tables, things became a bit clearer. Wolfram's rules basically were an exploration in the world of cellular automata as the space of all possible propositional truths. His rules were just propositional forms in a truth table. I will show how this is so below. Before we do this, though, I would like to consider thinking inside and out of the box.
When solving problems it is often useful to step back and view the big picture. Just what would an intelligent system look like? A most apt picture is of a black box. The Chinese Box thought experiment makes a nice example. We call it a black box because we dont really know what it's doing inside. All we know are the inputs we feed into it and the outputs it returns. The designer might know the internal details, but even he might not know what the box is actually doing at any moment. We might as well say that there is a little person inside who is thinking about the input and making decisions about what the output should be. Imagine a switchboard operator who analyzes the lights flashing on a board as the inputs come in. If our operator is an intelligent system, there will be some set of rules or system for determining the output. If the system is designed well, it might be used for driving a car, diagnosing medical issues, estimating financial markets etc. At any rate, this black box view is the starting point that gives a general view of intelligent systems.
Now lets see if we can place some constraints on this picture. First, we want our box to have some degree of regularity or determinism. We do not want random outputs in most cases. It is our belief that intelligent planning and purposeful behaviour should be ingredients of useful interaction. This constraint means that our design should require that the same inputs always produce the same outputs. Later we might wish to relax this constraint and allow for novelty and experimentation as these are properties of intelligent behaviour also. At the beginning of the design process, though, it is useful to consider only strict deterministic functions. This is a many to one mapping. Many different inputs might produce the same output, but for any one given input, we always want it to produce the same one output. This is the conventional definition of the dependant range of a mathematical function across some domain of independant variables. If we knew a suitable mathematical function, we could replace our intelligent operator with it. To be blunt, we want our black box to be functional.
What other constraints are useful? Well, we want the number of inputs and outputs to probably be fixed at first. Our mechanical engineer needs to work up the wiring diagrams and has to know how many wires lead inside and out of the box. If we keep changing the outside schematics, we'll never get anywhere on the inside with the AI boys. Therefore let us assume that our black box will have m inputs and n outputs. We'll keep open the possibility that in the future we might want a system that adaptively rewires it's connection to the outside world but for analytical purposes it is best to stay with this constraint.
Now how would an intelligent operator deal with the inputs coming in? Well, it would be useful to be able to see patterns in the inputs or at least break the input signals down into levels or ranges. If we quantize things from the start, certain simplifications come into effect. We can arbitrarily quantize the input into discrete values and come to some interesting conclusions. This also makes noticing patterns simpler because they can be counted, classified and labeled.
Let's start with a simple case where the inputs are electrical signals that are clamped to binary on/off states. One input to our box might be connected to an external rain detector sensor. Another might be connected to an alarm clock. Input A fires a binary '1' signal if it is raining. Input B fires a '1' when the alarm goes off. Our intelligent agent inside the box sees the input pattern and knows that if it is raining and the alarm sounds, then he should output a message to bring an umbrella to work. This input/out sequence can be framed as a propositional truth - if A and B then C. The (AND) is a descriptor of the particular functional form. In general, with binary inputs we can construct what is called a truth table that lists all possible combinations of inputs and the desired output or truth value given the input propositions. Each input counts as a propositional statement with a certain truth value. The truth table as a whole is a logistical accounting in discrete form of the behaviour of that particular function. Different logic functions will have different entries in each slot of their own unique truth table.
It should be pointed out that nothing is known about the actual real truth or falsity of the inputs. We can only take them at their face value if we are an intelligent agent operating inside the box. We of course are free to make deductions given the behaviours of the inputs if we have that ability.
Now imagine things get a little more complicated. Our intelligent agent in a box works at a nuclear power plant. The lights on the board are emergency signals of a problem or situation that needs adjustment. Say there are only 3 input lights. The authorities have thoughtfully provided a book of tables to tell the operator what to do in all cases or combinations of lights such as up the coolant, withdraw the rods, etc. in response to the heat and power overload lights. Since there are 3 binary inputs there are 2^3 = 8 possible situations in the book.
But soon the plant adds radiation detectors etc. and ups the input sensors to say 16. Now there are 2^16 = 65,536 scenarios to deal with. The book is now very heavy and it is hard to find the right solution in time when an emergency is present. Only 16 lights and the plant is in trouble! Worse yet, every decision is binary and doesn't cope with faulty sensors or lights that just flicker. No warning is given either as certain values build up slowly to threshold into an emergency. The plant hires more operators and becomes a bureaucracy. It even needs a librarian to distribute and manage the lookup tables. The truth in every situation becomes a quagmire of uncertainty and the failures are brittle and catastrophic for the plant. I think this is the major problem with all bureaucracies today.
When the plant attempts to become safer by upgrading its control over the whole process, the lookup table books become too large to read or manage. The system goes into chaos and the plant is shut down (hopefully!)
Well there is a definite problem with discrete combinatorial mathematics. It simply cant deal with infinities! But who can?
Let's look at the neural net approach first. An interesting fellow named Kolmogorov came up with a mapping theorem which says that given a system of m inputs and n outputs, we can perform any possible map between the two with only 2m+1 tranfer nodes in a layer between the input and output nodes. Each node transfers activity as an ansatz sum over all inputs and sends it through a final transfer function. The catch was that the transfer function is unknown. This indicated possibility of a finite solution to a potentially infinite problem but didnt help too much. Then another fellow came up with a theorem that said using 2 hidden middle layers and the convential sigmoid transfer function, you could do any mapping. The problem then was that the number of nodes needed became potentially infinite.
The above situation has led neural net researchers down the road of making any number of net models which perform well but only for certain problem domains. Hodgepodge hybridization and a morass of theory are the result. The approach definitely has a grip on the problem, but it is a tough one.
It seems to me that the only real solution has to be symbolic functions. As Wolfram has shown, when confronted with irreducible complexity, the best you can do is occasionally notice regularities and create partial solutions. This gives me impetus to lead into my idea. Let's think about partial truths a bit. Suppose we do a least squares linear fit to some set of data points or perhaps we come up with some frequency distribution to describe some data. We are in effect replacing the real inputs with a reasonable fascimile that is a functional image of those inputs. This symbolic shorthand vastly reduces the storage space/and or time needed to deal with those inputs. The catch is that it is only a partial representation that seems to match what's going on and is thus only a model. It is not a statement about the inputs so much as it is a statement about what we think are the relations amongst those inputs. It is statement about our degree of ignorance. It relies on hoping the regularites or correlations we noticed do indeed have some causal origins and will repeat these structural relations in the future. I will be calling such estimates partial truths. They are not the whole truth. They are only our best guess estimate.
Continuous functions have the benefit also of being able to handle real number values. The inputs do not have to be quantized to function. They may have to be normalized, thresholded, truncated or otherwise squashed though. Let's now look at what a truth table that can handle continuous inputs would be like.
Examine the truth table for the exclusive-or (XOR) problem. If A is true (A=1) or B is true, then A xor B is true but not if both A and B are true.

This is a good example of a simple binary logic function which is called non-linearly separable. This means that you cant separate the truth solutions from the false ones by a straight line (which is a partial truth remember). Truths in this case dosn't clump together in certain regions. This makes it difficult to find algorithms to separate the true from false. Simple first order backpropagation neural nets take thousands of iterations to duplicate this function and require a hidden layer to do so. A single higher order threshold logic node as implement by C.L. Giles at NEC can do this in one training cycle. Higher order units have their own problem with processing overhead, though, when the inputs are increased in number. It is the same combinatorial logistics problem again. Maybe there's a more direct approach. Let's look at just using the truth table itself.
Four training inputs are all that's required if the table itself is the node function. This would be a really fast processor and extremely simple to train, but the combinatorial explosion problem is still present. What if we could replace the truths with functional ones i.e. partial truths? What would the function look like?
Let's lay the table over flat and view it in 3 dimensions as a 2 dimensional histogram. We have input 4 training vectors each of which represents a truth in the table. They are:

Now suppose we were doing this with real number continuously variable inputs for some function that was unknown or maybe an approximation to the XOR function. We should expect that the values for say the '01 1' truth slot should cluster around unity and the histogram should be something like a bell curve (hill in this case) This is the so-called normal distribution. Here is a plot of this curve in two dimensions.

The formula for the normal distribution is:
(1.0/(sigma*sqrt(2.0 * pi)))* exp(-(x-mean)*(x-mean)/(2.0*sigma*sigma))
where sigma is the standard deviation about the mean. If we remove (1.0/(sigma*sqrt(2.0 * pi))) from the front and just replace it with some weighting value say 'w'. The curve now has a peak of w. At present we are not concerned with the actual value of the mean so we can replace x - mean with 'delta'. Let 's' be sigma. This abbreviated version is:
Partial Truth: t = w*exp( -(delta*delta)/(2.0*s*s) )
The normal distribution represents a best guess functional replacement for some unknown random variable distribution. If our intelligent agent in a box can use this function, he can save a lot of time in resources. But how would he accomplish this?
Let our agent start off tabula rosa i.e. an empty table. The table in general for n inputs will be a n-dim hypercube with corners at the -1.0 and 1.0 extremes of each dimension. We will be using those numbers instead of the range 0.0 to +1.0 for reasons that will become apparent later. Also we might allow values to range outside the box. At any rate, just drop the box and think about a n-dim vector space. Increase the dimensions to n+1 and encode the truth values at the last component of each vector. We can now represent partial truths as simple points in truth space. Implicit to this will be that the point really represents a gaussian normal distribution surface in n dimensions. Also we will have to explicitly code the weight 'w' and shape factor 's' of the surface.
Since our agent is tabula rosa we have to train it. We do this by adding truth vectors (training vectors) to the table. An example would be baby A has the toy 85% of the time, baby B has the toy 23% of the time and the truth value is say 0.95 for the level of crying. The truth training vector we wish to add is T = {0.85,0.23,0.95}. We can assume a weight w of 1.0 and s = std dev = 0.35. As drawn above, s = 0.35 gives a curve that flattens to zero at +-1.0 delta away from the peak. Using this value goes along with our thinking about using a best guess estimate. This value keeps the distribution shape more or less confined to the imaginary discrete boundaries of truth table slots that convey every possible discrete binary Boolean logic. As training progresses, our representation of the data might broaden or narrow the curve. In fact, if we think about it, a problem that resolves into sharply discrete logical form will probably have this shape but real world problems might be more smeared because of other causitive factors not encoded into our sensor input.
Ok, we added a truth. We cant just keep on doing this as the logistics problem will ultimately defeat us. We have to actually do some creative reduction here. The reduction is to realize that our partial truth is representative of many inputs. The problem then resolves into how, where, when etc. we add new truths. We dont want just one broad statistical sum of all inputs. Some inputs just dont belong to a particular partial truth position. They have to be lumped into another slot. Well how do we manage this? This is not an easy answer, but we can use plausibility arguments to manage a reasonable solution. We will pursue this in detail later.
Once we have trained our agent we still have to study the mechanism of how to get answers from it. In general we will end up with some number say m of partial truth vectors. Now m may not cover every possible case but it might not have to. The number of boolean truth slots goes as 2^n where n is the number of inputs, but does every problem need that many partial truths? I think not. The XOR problem above actually only needs 2 truth vectors because zero can be assumed as the ground state (tabula rosa) of complete ignorance. We only need to represent truths as they occur. But what is our transfer function to get the final output?
It is convenient to think about all the truths as adding up to one single n-dim functional surface. Once we know where the input vector lies on this surface we can transfer the n+1 dimensional height of this surface to output.