Copyright (C) 2004-2013 by Anton Treuenfels
This document briefly describes
It attempts to highlight what someone who wishes to understand (or modify) HXA's source code might fruitfully pay attention to. It does not attempt to teach anyone how to program.
Programming and Design Influences
Advanced C Programming for Displays
The AWK Programming Language
Writing Interactive Compilers and Interpreters
I'd like to extend an appreciation to the many people who have written assemblers before me and posted their work on the Internet. Although I often could not follow the source code (it's hard to do without printing it all out!), I learned much simply by reading their documentation. Their feature lists broadened my ideas of what is possible, and their history summaries often pointed out areas that especially benefitted from extensive testing.
HXA's documentation exists in several parts:
To use HXA requires only the first, plus perhaps one or more of the second. To fully understand how HXA works requires the last three.
It must be said that writing documentation is fairly tedious. However simply creating it helps to make HXA better, and often enough that the process might be worthwhile even if not meant for reference.
Sometimes when writing a description of what a pseudo op is supposed to do the question comes to mind "Does it really?", which leads to a new or revised test to determine if it was so (and if not, to find out why). Sometimes describing a limitation leads to the question "Does it really have to be that way?", which in turn leads to finding a way of eliminating the limitation in question.
The test suite contains many more tests to determine how HXA reacts to an error than how it reacts to legal input. Part of this is simply that there are more ways to err compared to the number of ways to do things right. Part of this is because illegal input more often causes HXA to do something unexpected. Figuring out why often turns up an overlooked possibility. HXA may then be modified and the test suite expanded to account for that possibility.
Buried in some of the tests are comments on why things do or do not happen. There are also occasional tricks used that HXA is capable of, or a useful macro definition here and there.
Bug Fix - Modulus Operator
Bug Fix - Erroneous Nested IF Evaluation
Bug Fix - Byte at Highest Processor Address
Bug Fix - Unrecognized LIST-- Option
Bug Fix - Escaped Backslash
Bug Fix - Escaped Double Quote Mark
Bug Fix - Leftover Debug Code
Bug Fix - Source Line Count
Partial Bug Fix - Unbalanced IF Blocks Within Block Expansions
Bug Fix - PADTO Pseudo Op
Bug Fix - Listing Long Source Lines
Bug Fix - Labels In Nested Repeat Control Expressions
Bug Fix - Object Filename Extension
Bug Fix - DEA and INA Instruction Aliases (65C02 and Above)
Bug Fix - ONEXPAND Error Message Text
Bug Fix - STR$() Function
Bug Fix - Source Line in Error Messages
Bug Fix - Source Line in Error Messages
Bug Fix - DS Pseudo Op in Segmented Programs
Bug Fix - SEGOFF() Function and Uninitialized Segments
Bug Fix - More than 64K Object Code in Intel Hexadecimal Output Files
Bug Fix - First Character of Records in Intel Hexadecimal Output Files
Bug Fix - Numeric Values of Zero and One in STRING- Pseudo Ops
Bug Fix - Logical '&&' and '||' Short Circuiting
Bug Fix - FILL and PADTO Pseudo Ops with No Value Argument
Bug Fix - C-style Escape Codes in Char and String Literals
Bug Fix - ECHO Pseudo Op with No Argument
Bug Fix - Commas in Regular Expression Literals
Bug Fix - Record Counts in No-Header Motorola Hexadecimal Output Files
Bug Fix - Output Files Created Despite No Data Generated
Bug Fix - Relative Branch Offset Values
Bug Fix - W65C816S Incorrect Op Codes
Bug Fix - Local String Labels More Than 10 Characters Long
Bug Fix - Defining Macro Names Ending in Colons
Bug Fix - Extended Address Records in Intel Hexadecimal Output Files
Thompson AWK (TAWK) is a compiled variant of the normally interpreted AWK programming language.
HXA is written in version 4 of TAWK. This was available for MS-DOS, OS/2 and Unix. The MS-DOS version of HXA is compiled with this version.
The last version of TAWK was version 5. This was available for MS-DOS, Win32 and Unix. The Windows version of HXA is compiled with this version. However the source is exactly the same as the MS-DOS version. No version 5-specific features are used (and some "features" are explicitly avoided!).
Unfortunately TAWK is currently withdrawn from commercial distribution by Thompson Automation.
The main performance difference between the two versions is that the Windows version can handle much larger volumes of source code. The MS-DOS version is limited to around forty thousand code generating expressions. The limits of the Windows version have not been fully established, but are at least four hundred thousand (and likely to be limited only by the total amount of memory available).
The source code of an AWK program resembles a program in the C programming language. Indeed, many of AWK's operators and their associated precedences and behaviors are identical.
However there are important built-in features AWK provides to make text manipulation easy that are not found in C:
There is no type declaration in AWK.
There isn't even any need to declare a variable, although it's usually safer (and better style) to do so. Merely using a variable name in an expression causes it to be created with an appropriate type.
If a variable value does not have the appropriate type for the expression it appears in, it is automatically converted (if possible). Variables which have never had a value assigned to them have a default value (and type) of "unknown". The "unkown" value converts to zero or the null string depending on context.
The same variable can hold different types at different times.
Functions do need to be declared and defined, but they do not have to return the same type from every possible exit point.
A regular expression pattern is a concise notation used to specify text matches. Regular expressions are widespread and are used in compilers, text editors and programming languages.
HXA makes extensive use of regular expressions to parse input lines.
Associative arrays are indexed by arbitrary strings, which may be but are not required to be consecutive integers. Many tasks which are difficult in conventional programming languages are thus trivial in AWK.
For example, entering a label into a symbol table can be as simple as direct assignment:
symbolTable[ label ] = value
To check whether there is a conflict with an existing name in the table, the keyword in can be used:
if ( label in symbolTable )
error( "Duplicate Name" )
symbolTable[ label ] = value
The in keyword can also be used to walk through all elements of an associative array:
for ( ndx in symbolTable )
print "Value of " ndx " = " symbolTable[ ndx ]
For arrays whose indices are consecutive integers, however, it is much faster to use a C-style FOR loop:
for ( i = firstndx; i <= lastndx; i++ )
This is because by default the "in" keyword causes TAWK to sort array indices into increasing order (AWK makes no order guarantee at all). Not using "in" skips the sort, which makes the loop faster.
A subtle point is that the elements of an associative array do not all have to have the same type. Elements of one and the same array can be any type supported by TAWK, including other arrays (standard AWK does not support multi-dimensional arrays). Beyond this, not every dimension in a multi-dimensional array has to have the same number of elements.
A TAWK array can be passed to functions and results returned in that array. A subtle point here is that this is apparently (ie., it is not documented) allowed only one level deep. That is, a function which receives an array as an argument cannot successfully pass that array along to another function. In a few places this is inconvenient and leads to a slight awkwardness in HXA's source code.
Possibly what is happening is that what is passed is a pointer. Thus the first-level function receives a pointer to an array, but the second-level function receives a pointer to a pointer to an array. TAWK does not have an explicit pointer de-referencing operator which would clarify this, um, point.
In AWK and its variants variables of any required type are automatically created when necessary and deleted when no longer in scope. The delete keyword can be also used to erase variables under program control.
The MS-DOS version of TAWK 4 is built with a 16-bit DOS extender that allows direct access to up to 16 megabytes of memory. The extender also implements a virtual memory system by paging to disk if more is needed.
Although the total amount of available memory has never been a concern to HXA, it appears that the MS-DOS version of TAWK 4 has a global upper limit of about 150-250K elements total for all arrays used. Multi-dimensional arrays also appear to require "extra" elements, presumably used for internal housekeeping, that count against the limit.
Successive versions of HXA have internal changes designed to maximize the size of user programs that can be assembled successfully. The pace of these changes may slow as it appears HXA can be compiled with the Win32x version of TAWK 5, which does not have the memory limits of MS-DOS TAWK 4.
The TAWK variant of AWK offers some features not found in most (if not all) other AWK variants. Two in particular may cause difficulty in porting HXA to one of them.
First, TAWK provides true multi-dimensional arrays. Second, TAWK allows assignment of any legal type to any variable at any time.
HXA uses both these features. Its code includes two-dimensional arrays and assignment of regular expressions to variables. Work arounds may have to be found for them in porting to another AWK variant.
There are many desirable traits any computer program should have, some of which compete with each other. Of the ones which the design of HXA consciously considers, they generally rank in the following descending order:
HXA groups its source into several files, each more-or-less dedicated to a providing a single kind of functionality. Inter-module communication is strictly limited to function calls and reading variable values. Writing directly to another module's variables is not allowed.
The naming of HXA's functions and variables is designed to provide manifest scope . That is, it should be possible to tell by looking at a name what source file it is defined in and who is allowed to access it.
Function Name Scope
Variable Name Scope
Note that all global functions and variables declared within a single module share the same capitalized prefix. Thus whatever file a global name is found in, it should be possible to immediately tell which file it is declared in.
TAWK itself provides only local and global qualifiers that may be (and are) applied to the declaration of any name. The conventions described above are not enforced by TAWK, but are imposed as a programming style.
Initializes the CPU and supervises translating source code and data into object code.
This is the only "generic" (ie., non-processor specific) module that deals with native byte size and orientation. Only this and the "native" (ie., processor-specific) a_ins---.awk module need to know this information.
As part of CPU initialization a cpu descriptor is retrieved from the "a_ins--.awk" module. This describes the program counter width and native byte size and the orientation of multi-byte values. From this "a_codgen.awk" handles initializing the program counter width and which "-BIT--" pseudo ops will be available and what their aliases are.
The fundamental data structure HXA currently uses for code generation consists of three elements:
The first pass of HXA creates an array of these structures called the code storage array (although it could be more accurately called the data storage array).
Note that some numeric data values may not be fully resolved during the first pass.
Also note that only [type, value] pairs are explicitly stored. The address element is handled transparently by HXA, and may not actually be present in every element. However if needed the actual address of an element without an explicit address can always be calculated from elements with them.
The second pass scans through the code storage array to fully resolve all data values and verify that any ranged values (as specified by the data type members) are correct. Note that this is all the second pass does; there is no actual code generation at this point.
Only if output files are specified does HXA do any further manipulation. The data address of each element is used to determine where in the output sequence it will appear. The data type of each element is used to determine which bytes of each data value to extract and in what order. Numeric values outside the range of a signed 32-bit integer are reduced to one in such a way that the relevant bit patterns are unchanged.
The change to store only some address values rather than all of them was designed to allow the MS-DOS version of HXA to assemble larger programs. There was actually no noticeable change, though.
Miscelleaneous functions and variables used by more than one module or too small to yet warrant their own module.
Handles expression conversion and evaluation.
Expression conversion to Reverse Polish Notation (RPN) form is done in two major parts.
First, every expression (and sub-expression) is converted to RPN by a generalized operator precedence parser. This type of parser is bottom-up rather than top-down . The parser accepts expressions of any legal type, using a state table to guarantee that they are syntactically correct.
Second, at the end of each expression (and sub-expression) a check is made to ensure that its type is correct in context. At this point some expressions may be "coerced" to the proper type. For example, string expressions may be converted to numeric by adding an operator to compare them to the null string.
The expression evaluator is capable of partial evaluation, resolving whatever sub-expressions it can and saving the rest for later.
Caching of converted expressions has been part of HXA for some time. In this version caching is somewhat less ambitious than in previous versions, in that it does not try to cache every expression nor does it try to update any cache entries after successful evaluation. On the other hand this simpler version seems to retain about 98% of the effectiveness of the previous versions at speeding up repeated evaluations of the same expression.
This version eliminates the string cache with the unexpected result that a test program with a very high hit rate on that cache ("DEMO032.A") became slightly faster.
The key to making the expression parser capable of handling any legal type in the various branches of the ternary conditional is the ability to perform type-checking at the end of every sub-expression. Once this was implemented it was immediately apparent that the idea could be easily extended in many ways. Allowing global names to be specified by string expressions in any expression context is one. Permitting multiple mixed string and numeric expressions as arguments to the STRING-- pseudo ops without ugly code hacks is another.
The main difficulty in getting the logical short circuit operators to accept string operands was figuring out how to "know" that the right hand side still needs to be compared against the null string when the end of expression is reached. The solution turned out to be an accurate state transition table, which suggested where "hints" might be usefully included.
Another concern was how to get rid of the "looking back" into and adjusting the RPN expression occasionally used by v0.163. Again the proper state table enabled this, and v0.170 never looks back.
The handling of type checking parenthesized sub-expressions, although completely correct, is not yet completely satisfactory. The problem is that any of several operand types may be acceptable at the time an opening parenthesis is found, but it will not be until the closing parenthesis is encountered that the type of the "operand" they enclose will be known. The solution used at present is to save all legal types at the opening and check at the close to see if one was found. Another method might be to expand the state table to account for all possiblities, although desultory attempts at this didn't seem to coalesce very quickly to a "reasonable" table.
The addition of a second "clear stack" "operator" that takes off everything but the ternary conditional seemed an elegant solution to a nasty problem arising from the high precedence of the "compare to null string" operator not clearing off the stack itself.
The low precedence of the unary extract operators in previous versions originally arose from a desire to maintain compatibility with assemblers which had no notion of precedence. This version finally abandons that idea in favor of having all the unary operators behave similarly.
It is interesting to note that the main apparent purpose of a high precedence for unary operators is to limit their "sphere of influence" to immediately adjacent operands. There seems to be no inherent reason unary operands cannot have a low precedence if expanding their effect to a larger portion of an expression is permissible.
The extension of the unary operators to apply directly to string operands came from the notion that applying the logical negate operator to a string ought to work, and ought to result in the value one for a null string and the value zero for any other string. The ability was gained by making a single change in the state transition table in the "u"-nary row - no code change was necessary at all. That all the other unary operators can be also used with string operands is basically a side effect.
The expression cache now purges itself after every certain number of entries. Experience (in other contexts) has shown that it can exceed available memory. The regular expression cache also purges itself, though this is less likely to fill anyway.
The "extraction" operators "<", ">" and "^" are now "byte-size aware". They are implemented as calls to functions in a_codgen.awk , which is actually the only "generic" module that is truly "byte-size aware".
A much-belated (years!) recognition that expression evaluation failures due to bad arguments are really a class by themselves, just as failures due to unresolved forward references are. Divide by zero, out-of-range "STR$()" arguments, strings that should but do not match global names, and un-parseable "FWD()" and "VAL()" arguments are all evaluation-time failures that should be detected, reported and cause the offending expression to be considered "un-resolveable".
These errors actually were always eventually detected and reported, but in an inconsistent and sometimes confusing (eg., irrelevant cascading) manner.
The regular expression cache has been eliminated, as it could not be guaranteed to be in sync with the parsed expression cache unless it was either never purged or always purged at the same time. Regular expression literals are now converted to internal form at parse time. This loses the ability to display a regular expression literal in a parsed expression (usually for internal debug purposes) but does save some code. In demos where regular expressions appear in macros and are thus placed in the parsed expression cache there appears to be a slight gain in overall assembly speed (as with eliminating the string cache).
Currently Supported Variants Source Files
This module (where "---" is replaced by a specific CPU identifier) is called during the first pass to recognize CPU mnemonics and begin evaluation of any associated expressions. The results are stored as data by a_codgen.awk .
Creating an HXA variant which assembles a different language can be accomplished by replacing this single module.
Handles macro definition and expansion. "Macro" is here taken in a broad sense to include repeat and while blocks as well.
Macro expansion is performed by re-reading saved source lines originally read from files. HXA nests expansion blocks by stacking indices into the saved line store. The top index on the stack indicates which line to read next.
Whenever the index stack is non-empty HXA reads the next line from saved store (which is why a file cannot be included from inside an expansion, as the next line read would come from saved store and not the newly included file).
The top-level supervisor.
Unlike all other modules, this one has no globally available functions or variables.
During all phases, reports all HXA status and error messages to the user.
This is the only module that contains the actual text of any message HXA can display. Other modules specify only an index into a table of messages when calling this module. This allows replacing message texts without altering any other module.
Tracks HXA's internal program counters and segment numbers. Internal to this file they are tracked independently. Externally they are often combined to form coded values.
A_codgen.awk uses its coded values to maintain the proper order for object code generation. Coded values are used to distinguish between absolute and relative items by a_eval.awk and a_symbol.awk .
Note that no other module ever directly examines or manipulates a value they get from this module. If they need to know something about it or have something done to it, they ask this module.
Process HXA's pseudo opcodes.
Once this module identifies a pseudo opcode, it collects and verifies any arguments and passes them to a handler. A handler can be located in any module except the top-level supervisor.
The addition of IFDEF and IFNDEF in some ways spoils the ideal of *all* conditions being determined by a powerful expression evaluator. But it increases the compatibility of HXA with other assemblers that prefer a proliferation of --IF-- variants.
The superfluous addition of a LABEL() function that has the same capability as IFDEF gives the same power to the expression evaluator directly, in a small way restoring the original ideal.
These versions of IFDEF and IFNDEF recognize only global numeric labels, largely because it's easiest to leverage just existing evaluation functions. It is possible to increase the types of labels recognized by using other functions to guarantee a legal label name before trying to evaluate it (failure to evaluate then becomes the flag that signals the name does not exist).
The greatest problem is branch target labels, or more specifically the newly introduced ":" (colon) label. Although a legal label, it doesn't have a recognized meaning in the expression field the way a single "+" or "-" does. Some way of handling that has to be added. Not necessarily difficult, but there is something of an "ad hoc" feel to it.
Read and manage user source code.
HXA saves every source code line it encounters. This store is used for macro expansion, error-reporting and assembly listing.
HXA generally saves only those expansion lines which generate data, which are the only ones necessary to create the default listing file.
Manages the symbol table.
An HXA variant can created by replacing the a_ins---.awk source file, where "---" is replaced by a specific CPU identifier. No other module need be changed in any way.
The primary job of an a_ins---.awk module is to add data [type, value] pairs to the code storage array during the first pass. It is not involved in the second pass nor in any file output.
The specification of a_ins---.awk files details only the public functions that must be provided. How they work is up to the implementer. All that is required is that the inputs and any outputs be correct.
Required Public Functions:
The following public function may be called by this function:
mnemonic, cnt, expr
INSdoop() is allowed to do whatever it wants in order to convert an instruction mnemonic and any associated expression(s) into entries in the code storage array.
The following public functions may be called by this function:
The following functions in other modules of HXA are guaranteed to be available in any future version:
Another possibility for creating an HXA variant is to note that what the assembler deals with after the first pass is essentially an array of [type, value] pairs. This implies that some CPU instruction sets might be implemented entirely as macros which expand to one or more "-BIT--" pseudo ops.
The HXA_T variant of HXA can be used to assemble such macros with the correct program counter size, byte size and orientation.
For some processors the officially recommended mnenomics could be used. HXA expression evaluation is currently powerful enough to alter most expressions following such mnemonics if they cannot be directly evaluated in their original form.
For other processors a variant set might be possible or even desirable. A common reason would be to reduce the effort needed to identify the address mode of a mnemonic and evaluate any accompanying expression by indicating the mode via the mnemonic rather than the expression. Such a variant set might entail a loss of portability, but this might be acceptable in some cases.
As a proof-of-concept, macro include files implementing the official mnemonics of the 8080/85, Z80, 6502, 65C02 and R65C02 microprocessor instruction sets are provided in the general demos .
The 8080/85 mnemonics are the simplest to implement, as the instruction set is fairly small and the address modes are generally indicated by the mnemonics themselves.
The 6502 and 65C02 also have fairly small instruction sets. The main difficulty is that the address modes are often indicated by decorating the expression, so some examination and manipulation are required. The macros have been defined to be compatible with the same instruction set extensions recognized by HXA65 They do not recognize any kind of "address mode forcing", however, which might better be implemented as additional macros anyway (eg., "LDAA" might be defined to always "LoaD Accumulator Absolute").
The R65C02 instruction set is a superset of the 65C02 instruction set and its macro implementation is simplified by "including" the 65C02 macro implementation and defining only the 32 new instructions it adds.
The 65C02 macro implementation does not use this approach even though its instruction set is a superset of the 6502 instruction set because it introduces new address modes for old instructions. This makes it more reasonable to re-define the macros implementing those instructions.
The Z80 instruction set is fairly large and there are a large number of address modes. The macro file provided makes extensive use of nested macros to determine the proper data to emit.
The general result to be noted is that implementing an instruction set via macros has the advantages of being completely portable and not requiring any changes to the HXA_T variant of HXA, and the disadvantages of being slower and requiring a (sometimes much) larger number of source lines than a native HXA variant.
An informal list of enhancements or changes being considered (there is no guarantee any of these will actually happen):
"Case|Switch" pseudo op
Something along the lines of one of these:
|-SWITCH (num_expr)||CASE (num_expr)|
|VALUE (num_expr)||WHEN (num_expr)|
|VALUE (num_expr)||WHEN (num_expr)|
|ENDSWITCH||ENDCASE (alias ENDC)|
Mainly this offers a slightly cleaner-looking alternative to a series of "if..elseif..else..endif" statements. A possible differentiator would be to make "fall-through" from one branch to the next the default behavior (avoid this with "EXIT"). Some critics say this kind of behavior is a problem for beginners, though.
"Loop..Until" pseudo op
A conditional like "while..endwhile", except that it tests at the bottom of the loop. Not difficult to implement, but it's hard to come up with a name pair that keeps the "name..endname" symmetry of all the others
Fix known weak points (aka "bugs")
a) operator tokens
- these are currently strings, and expression evaluation fails if a user string happens to match one. They could be made more difficult to accidentally match (by including unprintable characters, say) or their type could be changed (to small floating point numbers, perhaps).
b) string lengths
- the MS-DOS version of TAWK 4 limits dynamic strings to 8000 characters. This is not checked by HXA during concatenation A simple fix is to check the length of the string operands before concatenation, but this adds overhead to every concatenation to catch a very rare failure (additional overhead actually, since the TAWK run-time package makes the same check)
c) regular expressions
- HXA performs only simple pattern matching to detect regular expressions, and relies on the TAWK run-time package to perform the actual parsing. However if TAWK decides the offering is not a legal regular expression it complains directly to the console (rather than, say, returning a null pattern). Which is not as neat as we might like, but it may not be possible to determine regular expression pattern legality simply by matching against another regular expression
d) numeric function arguments
- the sub-expressions which comprise function arguments in the RPN form are not range-checked. This is so that the VAL() function can return an out-of-range result which can come back in-range as part of a larger expression (and which will trigger an error if it doesn't). However since range checking only occurs once for every complete expression (at the end), an out-of-range intermediate result can be created and used for any numeric function argument. For some functions this won't matter as they will not be used if they are out of range (eg., MID$()). But any which do not range check might produce questionable results
e) EXIT block nesting clearing
- EXIT skips to the end of a macro, repeat or while block and silently closes any still-open IF blocks within the block. This behavior is necessary to allow the unconditional EXIT to be used within an IF block. However this unconditional nature means that it will also "cover up" an "IF without ENDIF" problem in the block body.
On the one hand that's somewhat helpful, as the error then can't hurt anything outside the block (such as unexpectedly cutting off assembly for hundreds of lines).
On the other hand, it's an undetected error. Moreover, unlike any other source code error in the skipped over portion, it's one that is actively covered up by HXA.
On the other other hand, it's an error that has no practical effect. If EXIT is executed HXA reacts the same way whether or not there is an unbalanced IF block. If EXIT is not executed HXA will detect an unbalanced block.
Revised source control
The difficulty is not so much user source code but macro expansions. If these are saved, and if used prolifigately, they can reach the array element limit (at least in the MS-DOS version). But the whole point of macros is to make life easier, so it won't do to eliminate them.
The 0.16x and higher versions of HXA by default save only lines read from source files and expansion lines that actually generate code. Macros may have quite complex logic covering many lines and still cause only one or two of those lines to be saved.
The 0.161 version reduced the count of saved line numbers (which are only applicable to lines read from files) by observing that these are used only for error reporting, hence only need to be saved for lines where an error has either already occurred or might be discovered later.
This also speeds the first assembly pass somewhat, and because all the lines necessary for the default list file are saved there is no slowdown when listing. If macro expansion listing is turned on, the situation simply reverts to what it has always been in the pre-v0.16x versions.
The 0.163 version saves the current line number of a source file only when it is interrupted by an include file. In all cases the line number of a source line can be re-calculated using the saved source text (although this can be slower than looking it up). During the first pass this is often not even necessary, as the value needed can be deduced to be the current line number (which is quite fast).
However these versions more-or-less just pick the "low-hanging fruit". If all listing is turned off, the only lines that need to be saved are macro, repeat and while definitions and those lines that generate code (in case of forward reference or range errors in them).
Regarding expansion definitions, HXA knows when they start and when they end, so it could be a fairly simple matter to temporarily "override" the user's specification and make sure those lines are saved. Moreover, if the index into the saved line buffer is noted at the start of non-nested repeat and while definitions, once they are complete the index can be re-set to that point, effectively erasing them from storage.
In the end this may all be moot, as Win32x TAWK 5 does not have the same memory limitations as MS-DOS TAWK 4.
Not likely to be compatible with any existing format, if only because they tend to be tied to specfic processors (although what if a "such-and-such file" was recognized as a "processor mnemonic" and that HXA variant included an ability to output that format?)
It should be fairly easy to dump the state of HXA after the first pass to a text file that could also be read back as an include file. Use of a "linkfile" pseudo op would trigger this, and also suppress errors if the first segment of a source file was relative rather than absolute (unless the user also specified a binary or hex output file).
The main problem forseen is turning the internal RPN expression form back to text, specifically partially resolved expressions. Labels, numeric integers and all operators are not a problem.
Strings would have to have unprintable characters turned into escape codes. Literal strings would not have this problem, but partial resolution might have created dynamic strings with unprintable characters.
Regular expressions do not exist in their literal form in partially resolved expressions. This would have to be changed or their existence in link files made illegal.
It would probably be a good idea to require all external labels to be declared via "public" or "global" or somesuch pseudo op.
New functions are easily added if a common use can be discerned. It is now more likely that user access to certain values will be provided by new functions rather than new internal variables, as it's easy to write functions that take no arguments, the style is not objectionable, and it simplifies the symbol-handling code.
a) FIRST$(str [,len])
- returns first <len> characters of string (<len> defaults to one if not supplied) - similar to BASIC LEFT$(), except length parameter is optional - however not much different from MID$(str, 1 [,len]) or MATCH$(str, /^./), except more compact if only the first character is wanted - in that case, though, is 'FIRST$(str) == "char"' more useful than '"str" ~ /^(char)/', which is already available ?
b) LAST$(str [,len])
- similar operation, comments, and objections as FIRST$(), except it operates on end of string rather than beginning - also the extension MID$()'s length argument to negative values allows MID$(str, -1) to achieve the same thing
c) MIN(numexp1, numexp2)
- returns minimum value of two expressions - easy to implement, but is it useful ?
d) MAX(numexpr1, numexpr2)
e) TIME$() - implemented v0.180
- essentially a wrapper for ctime(), re-named to match HXA conventions
f) ISLABEL(name) - implemented v0.190 as LABEL()
- boolean: is name a defined label? - already available as FORWARD("name") is itself a complete expression, although the sense is opposite (TRUE if "name" is not defined)
g) SEGABS(name) or SEGLDA(name) or SEGADR(name)
- the absolute load address of a segment ( SEGBEG(firstseg) + SEGOFF(name) ) - may make it easier to move to execution address (if different) - but this can already be calculated using existing functions - could also define this as a macro for ease of use, although it might have to be re-defined for every program using a different name for 'firstseg'
h) LABEL(strexpr) or GLOBAL(strexpr)
- makes a label out of strexpr and makes sure it is a global - but how would a value be assigned to it ? - might also consider as a pseudo-op: .LABEL strexp - presumably it would receive the value of the program counter at that point - or, allow global string expressions in the label field (what are the implications of this? - one is that skipping during block definitions has to somehow account for this, but when used in this context they probably would not be constants. Perhaps a ".LABEL" psop is better suited, as skipping would be satisfied by the psop and the expression itself left for later resolution) - might also provide something like this so macros named by string expressions during definition (which is allowed) could be expanded by string expressions which evaluated to their names (which isn't)
However PUTBACKS essentially provides all this functionality, including naming macros to expand by string expressions. Still, global labels in macro definitions are somewhat of a special case as things now stand, and a ".LABEL" psop would remove that condition in a consistent manner.
- makes a regular expression out of strexpr
j) CPU$() - implemented v0.180
- returns name of current CPU being used
k) User Functions
- easiest might be one-line functions:
- formal arguments must be variable or local labels - definition would place the argument names and expression in a table and verify the expression can be parsed without error - types would be deduced from the argument names so parsing could handle - evaluation might be a two-argument injected operator: the name of the function and the number of arguments - execution would pull the arguments off the evaluation stack and assign them to the formal arguments, with the provisio that pulling an operator means an actual argument could not be resolved (which might be okay) - then recursive calls to parse (always succeeds) and evaluate the function definition (might fail if there is a forward reference in the expression, and might be okay) - might even be possible to handle optional arguments with defaults:
- how would recursive calls to user-defined functions work?
k) PUSH$(strexpr[, name]), POP$([name]), TOP$([name]), EMPTY([name])
- user stack functions, default and named stacks - PUSH$() returns argument and pushes it on stack - POP$() returns top stack item and pops it off stack - TOP$() returns top stack item but does not pop it - EMPTY() returns TRUE if the stack is empty
alternatively: - .PUSH arg pushes item on stack - .POP assigns top item to a label and removes it from stack - .TOP assigns top item to a label - EMPTY() returns TRUE if the stack is empty
- if only one stack, how are types verified? - should pushing or popping evaluate the argument? If not, how would we know type is correct for label being assigned to? - maybe two stacks: a numeric stack and a string stack - if stacks are themselves named, then the name indicates type
- implemented v0.200 - one string stack - multiple named stacks turn out to be a pain to manage - a mix of pseudo ops and functions: PUSHS, POP$(), PEEK$(), EMPTY()
l) RELOFF(name, [name]) or OFFSET(name)
- returns the relative offset of "name" in its segment, ie., its offset value less any segment value (what should absolute origin segments return?) - enables some expressions to be evaluated during the first pass that otherwise could not be - a second argument would imply getting the difference between them. This could help verify that they both belong to the same segment - should the program counter be allowed? How would it be specified? A seperate function is one way - simpler for implementation, but for the user?
m) LOOP0() and LOOP1()
- return the zero- and one-based iteration counts of REPEAT and WHILE blocks, defaulting to zero (one) if no block active - the idea is that there would be no need to manually implement loop counters in source code - "loop1()" can be thought of as "number of loops started" and "loop0()" as "number of loops completed" - these have been implemented experimentally (quite easily done), but... - no noticeable speed improvement observed in limited testing (a bit of a surprise since repeated calculations were now avoided) - somewhat non-intuitive behavior at WHILE block starts (the functions can be used in control expressions in tricky ways to "jump start" the loop) - when used in multiple blocks of two or more nested blocks it can be difficult to apprehend what the result of any particular call might be
Explicit Expression Caching
Easy to implement, this would take the form of a "cache" pseudo op followed by an expression (maybe even a named expression, so the name could be used in place of the expression). The expression would be converted and cached but not evaluated. The advantage is any expression the user thought or knew to be heavily used could be cached. The disadvantage, if automatic caching was simultaneously dropped, would be that every expression desireable to cache would have to be specified.
Partial File Output
In addition to what we have now, another idea would be to have matched "FILEGROUP..ENDGROUP" psops where all segments named between the two would be output to the same file (named as an argument to FILEGROUP) in the order they appear. Any segment could belong to multiple output files. If lots of segments it could be a pain to name them all, though. Perhaps a regular expression could be of assistance?
Would this be of any use anyway?
More list file options
Perhaps an indication of which are source and which are expansion lines. - implemented v0.180
A source file indication might also be useful for included files. - implemented v0.180
User-defined page headers. These would probably require new functions to obtain file names, page number, time, etc. The main difficulty seems to be what to do with headers that require more than one line to display. If they are not part of the main text body either they are truncated or they mess up the top margin. If they are part of the main text body they could conceivably "squeeze out" the real text. Should any title be automatically truncated to fit on one line? - implemented v0.180, but only appear on first page due to above problems
Optionally sort the symbol listing based on value as well as alphabetically. Or make both sorts optional, with alphabetically default ON and value OFF. - implemented v0.180 as both methods always used
A PAGE pseudo op to force the next listed line to be at the top of the next page. Because this would actually be printed itself, a PAGE at the first line of a new page would cause the rest of the page to be blank. Should this be suppressed? Should PAGE have any effect in an unlisted section? If blank lines follow PAGE in the source, should they be listed at the top of that next page? - implemented v0.180 (no page top suppression, no effect in unlisted section)
Allow regular expressions to be specified by string expressions
Perhaps by a "S2R" type-check at parse time and a "verify and convert regex" at evaluation time would be sufficient
ENDMACRO [name] psop
Extend ENDMACRO to match an optional name to the current MACRO name, much like ENDSEGMENT can optionally match the current SEGMENT name - implemented v0.180
ASSERT expr psop
Issue an error message if "expr" is false. A macro based around an IF psop could easily do the same thing during the first pass. ASSERT's difference would be that it could save any partially resolved expression during the first pass and then complete evaluation during the second. Perhaps ASSERT1 and ASSERT2 could force evaluation during a particular pass.
Could this be incorporated into the code storage array as an "ASSRT" type? Brief inspection indicates that a fair number of places would have to be taught the difference. The biggest problem appears to be that unlike the other element types, "ASSRT" would not generate code or data. Rather than teach every other part to ignore "ASSRT", it might be simpler to give it its own, basically parallel, structure. - implemented v0.200 (piggy-backing on code used for display of EQU, DS, etc)
CPU Switching Support
Allow variants which support multiple CPUs to switch between them at assembly time. The basic desire is to disallow certain CPU/instruction combinations from happening, eg., some portion of a program must run using only the lowest common instruction set. Instructions not supported in that section must cause an error, but can be allowed in other sections.
One idea might be a ".CPUSET" psop which lists all CPUs to be used. If no ".CPUSET" psop appears before ".CPU" then the only allowed CPU is the one named by ".CPU". Otherwise any CPU listed by ".CPUSET" is allowed.
Singly-dimensioned arrays of numbers or strings - it seems fairly straightforward to store them in the symbol table (one idea is that their indices can be thought of as part of their name, thus allowing them to be stored as scalars - this might also make it possible to "skip" elements or more generally to treat them as associative arrays) - it also seems relatively easy to extend expression evaluation to handle them - the main difficulty appears to be how to assign values to elements. The standard method of using the label field would have to be extended to recognize array notation with a potentially complex index expression. An alternative syntax - eg., ARRAY name, index, value - is easy to implement but not at all what programmers are used to, plus would not look at all like how they are later used
Implied Formal Macro Arguments
This would allow a macro to accept a variable number of actual arguments without having to declare default values for any unspecified ones - the definition might use "..." as the last formal argument to signal that any number (including zero) more arguments can follow - the expansion would assign any arguments found past the last named formal argument to pre-defined labels, possibly '?1', '?2', etc - '?0' could be a count of how many are actually present at expansion time - references to pre-defined labels with no assigned values should default to null strings? - OR might ALWAYS define these at expansion time, and let user choose whether or not to use formal argument names as well (which would still be required in order to use default arguments with them) - though the "..." notation is designed to flag that "extra" arguments are acceptable (do not throw them away as is currently done) - another way might be to create another "ASSUME" convention that also flags extra arguments are just accepted as implied actual arguments, which has the advantage of only needing to be done once, and the disadvantage of applying globally with no visibility beyond where it's done - of course considering default arguments there is in effect a way to 'count' arguments already - does this add more complication than it's worth?