My Quest for a Better Programming Environment

Jeff Vroom, Orig: 9/9/2003 - Updated on 1/6/2004

While I'm working part-time for ATG as a consultant, I'm spending my free time looking for a better programming environment. Though I think Java and J2EE have made many improvements in programmer productivity, I think there are still some great improvements to be made. Here are a few of the key changes I'm looking for. If you have any suggestions of technologies to look at, feedback on any of this, or would like to be notified of any updates to my site let me know by dropping me an email at jeff@jvroom.com.

Fewer Languages

We need fewer languages in any given implementation. Currently, we have languages for configuration (e.g. XML, .properties), glue/script code (javascript, python), system programming (java, C++), and database programming (sql). Complexity occurs at the boundaries of these languages causing overhead in the programming process. Interfaces must be defined, learned, implemented at each boundary and often place limits on the flexibility or increase the "verbosity" of the resulting program. It is difficult or impossible to move code from one tier of computation to the other. Thus a lot of preplanning and designing is required up front which takes a lot of time. Mistakes are costly as you rewrite schemas and application code which interfaces with them.

It would be much more efficient to have a single language while was expressive enough for all 3 of these purposes. Code is separated into layers, and layers are filtered so that a user with a given skill set only sees constructs that they match their skill level.

Currently there is a growing divide between script programmers and system programmers. System programmers need high performance, type safety, implementation hiding, threading/synchronization/exception handling etc. Scripting languages are becoming more powerful more rapidly than system languages because they are more loosely controlled and get new operators quickly. They offer simpler, higher level data types which are easier to learn and use. But they offer poor performance and have may have threading and synchronization limitations and produce less reusable and maintainable code due to the lack of type safety and implementation hiding.

Though it will be difficult to bridge this divide, I think it is important to do so. We need one language which is high performance enough with the robust features required to implement any application. This includes shared libraries and memory between applications (something java still lacks). It should include high level operators which make programmers productive and should be extendable by adding new language implementation modules.

I think type safety is a tricky issue but one that can be improved with a slightly different editing paradigm. In most computer languages, there is a simple source file. In this system, using a versioned database the system provides a source file view for a given user which may be customized based on their settings. As the programmer changes the file, the compiler reads it and potentially annotates it with additional type constraints which were inferred. This can include casts and other variable types which were inferred from the context. The real source edited by the user is maintained and can be viewed by turning off these annotations.

Even when we must add sophisticated language constructs that script programmers might not fully understand, it should be easy to separate these constructs into layers not exposed to the higher level programmer. Similarly, certain administrative type programmers should be able to operate in layers which only contain configuration suitable for their skill set.

Programs are constructed in layers

Programmers build modules which each define their own "layer". This layer can see and modify any existing piece of the modules it depends upon. When they release the program, they can also define layers which are customized for the installation or for the user using the program. When a given user runs a program, the version that they see is the one produced by merging together these layers. The view of data in a given layer is restricted by security credentials.

Java supports the CLASSPATH notion which is similar to what I'm talking about here. It is different in that in Java, one class earlier in the CLASSPATH can replace a previous one. In this layered model, you can customize a given class by adding methods, replacing methods, removing interfaces etc. Aspect oriented programming does provide higher level layering concepts - for example, you can add synchronization or transactions on as a layer to an existing set of classes. My view is that aspects are really just more powerful code combination abstractions. There are certainly many of these which will prove to be easy to learn and help build more reusable code but perhaps the initial point is that we need an extendable language framework which supports an extendable set of language primitives.

If we had more powerful layering constructs, we could not only reduce a lot of code copying but use higher level abstractions in the design which are easier for new programmers to understand. For example, in a base layer you might have code which creates an HashMap. You might decide it should use a TreeMap in a sub-class or perhaps in a customized version of that same class. This should be possible without having to rewrite the base class to expose that constructor. For example: "replace all HashSets in foo.java with TreeSets".

One Data Framework

There is one data framework which stores classes, instance configuration, and program data. This data framework supports versioning (optionally) for use as the source control system, as well as for application services such as recovery, undo, revision control etc. This data framework incorporates adapters which maintain data in SQL or on a file system as well as a high performance distributed client/server databases which directly map regions into virtual memory.

It has the full features of an OO database, but also integrates in data from file systems and SQL. It supports transactionality with pessimistic and optimistic locking, and can be extended to support additional transaction conflict resolution policies or rules.

Just as the code for a given application is constructed by merging together layers of progarm, the data framework allows merging together of layers of data. One layer may add or remove the visibility of data items in a previous layer, it may extend the schema and add new values, hide values, add security constraints, add additional integrity constraints. Layers are also used to implement transactions. A layer can place constraints on dependent layers and we define a framework for specifying rules for how layers are combined to deal with locking, error propagation, etc.

Persistence integrated into the Language

The current model is to build in persistence behavior directly using code, often using apis such as JDBC. A better way is to abstract persistence from the code using some technique like JDO or EJB CMP, then specify configuration via XML which defines the data model. I think an even better model is to define a more generalized concept of "memory". The code simply sees variables as attributes of other objects just as in most programming languages, but we attach more flexibility to how that "a.b" reference is implemented. This gives more options for implementing lifecycle of state than static, instance, and automatic stack variables. Also, lifecycle concepts are separate from how data is accessed or modified.

Persistence behavior is implemented by plug in modules which implement types of persistence. These plug in modules have interfaces for compiling references into the runtime as well as interpreting the value for efficiency.

To specify how and where the variable is stored, we attach meta-data onto each piece of memory such as transient/persistent, versioned/un-versioned, scope/lifecycle, ownership of this value (is it a part of something else, does it have a parent), does it any meta-data from its parent?, etc. These are configured via "meta-data" attached to variables. Since this meta-data can be modified in a separate layer from the code, you can control persistence without modifying the code itself.

Indexed queries should be part of the same language making quick access to the data easy as well.

Use the Generational Model for Language Development

Read Cznarecki's book for a good, extremely detailed description of some of these techniques, particularly the chapter on intentional programming. Here are I think the important concepts. A "generative" model of program development is used: the program is defined via files on a disk (which will at some point be persisted in a versioned database). To build it, we traverse this program tree using multiple passes. Each pass attempts to reduce the level of complexity of nodes until each of the nodes in the tree is "executable". At this point, we can interpretively execute the program, or generate code which executes it in a compiled manner. We might decide to compile certain nodes and do so by generating code in the runtime language, then dynamically loading in those new modules. This is done in batches for efficiency.

Because we are using a versioned object oriented database in the programming model, we also have a record of all running processes which need to be updated for any changes made to the program. In an OODBMS, this usually requires that objects be migrated to the new schema but now we have a new option. We can continue to run using the old code for the old objects without migration (because we have those versions in the versioned DB). Of course we should also have the option of passivating/migration/reactivating modified objects as well.

This all occurs in an automated fashion managed by the system and its configured policies. A programmer modifies the code in a particular layer and all processes running off of that layer get updated transparently according to their rules. An interpretive model also operates on the same language for cases where compilation is not appropriate or for improved debugging and diagnostics.

The original uncompiled model of the program can be lazily loaded into the compiled version of the program for debugging and diagnostic purposes using hooks that we place into the compiled version of the code. This will be used mostly for customization, diagnostics - we avoid using the larger meta-data rich version of the program for "normal" operation of the application to improve efficiency.

This model has many advantage of the "byte code" model used by languages such as Java and .Net. By implementing the high level language as simply a translation from one high level language to another, you avoid a lot of implementation complexity and also improve the compilers ability to optimize the code. You also have the option of doing "static program analysis" for diagnostics and optimization. You can eliminate runtime code which is not executed during this generation process. You can also use global knowledge of the application in making bindings (i.e. if there are no subclasses, you can invoke an instance method directly). Even if these bindings change later on, you can transparently recompile to get the required more general version of the binding.

For applications where distributing the source is not compatible with the business goals, you can "host" the system on a server with a similar architecture to the end user's environment. The user downloads new executables produced automatically from one of these servers and so does not need the source on their system.

Security

Security is built in to the language as well. In order to perform some action, you need first to have been granted the access to do so and that defines your privileges. This single security mechanism is used for both internal program security (what class can call what method) and external user security (what user can modify what piece of state, or invoke what method). Rules can be added which restrict access to a given variable, method, etc. for a given context. This rule can include properties about the user or the code which is making request that are used for the autentication criterion. We also attach meta-data to "user input" or untrusted variables (i.e. those received from an untrusted process). We can have constraint rules defined for these variables - maximum length, invalid characters etc to avoid data corruption and trojan horse type security problems.

The system should maintain runtime efficiency statistics (memory+CPU resources consumed). Part of the authorization for a given operation ought to commit some resource limit for the processing of that operation and if we run out, it should be possible to abort the operation. This might eliminate some hacker attempts as potentially they'd exceed the CPU/memory resources allocates to a given user in a given section of time. In this general model too, we might be able to write security rules which could enforce security at the lowest level for the tighest security.

Memory model

I think the global nature of garbage collection is a major limitation and believe most memory reclamation should be performed using reference counts instead. Python uses this technique effectively with an optional global pass to clean up reference loops. I think there should also be a way to mark a back pointer as unreferenced to avoid these loops in the common case of a bi-directional relationship. If we define this correctly, we can simplify the maintenance of these back pointers as well by keeping them in sync with the forward version automatically.

Though I think a purely unmanaged memory model like Java is an excellent feature of the language, if you do not have a way to incrementally add management of explicit types you will run into a broad class of performance critical applications which are not suitable for this language.

More specifically, I want my language to operate in a few different modes (each module using a different model):

Clearly the second model is only safe when the code is well debugged. Note though that the first model can be used to test the code in the second model - when we detect an error in a free call, we can omit the free call and mark that type as "unmanaged". When we detect an unsafe bounds or type reference which is marked as "safe" we remove that flag and signal a workflow to the responsible programmer.

The Java memory model includes an extra level of indirection for each pointer access (i.e. "handles") which allows objects to be relocated without application code being aware. I think this is another feature of Java which makes it insuitable for performance sensitive applications. Because you do not always know performance requirements when you start development, Java thus becomes an inefficient language for programmers who need to compete on a cost/performance basis. Of course performance tuning takes a lot of the programmer's time!. Whether or not handles are used for a particular object reference should be something that is configurable at the container level for an object. Clearly this affects all code which uses that object, but if the language defines only one way to dereference a pointer, it limits the flexibility of the language greatly. I do believe that in some cases the ability to move objects in memory can make memory use more efficient by since it also expands the amount of memory used by the application and causes a hit on every pointer dereference, it should not (in my mind) be fundamental to the language. Also note that handles are stored in a separate region of memory which decreases the effectiveness of the on-chip processor cache.

Fixed and dynamic binding

How does a programming language implement the "a.b" construct? There are a few different approaches at implementing name binding in various programming environments but in my view they fall into two important categories: fixed - where a given name maps to a specific integer offset in bytes or slots to where the value lives in its object container. There are also dynamic access where you use a symbol reference to look up the slot number - possibly via a linear search through a table or using a b-tree search. You might maintain a type cache for these offsets when they are invariant for a common class of objects. C and C++ only use fixed binding, Java uses a mix of fixed and dynamic binding - fixed for member variable and method access through objects and dynamic for method access through an interface.

Java uses a hack (in the interpreted version at least) for interfaces where it patches a "guess" offset into the instruction in the byte code for the interface de-reference. The instruction in the byte code stores the last offset used for this de-reference. When encountering an "a.b" reference where "a" is an interface, it looks in the method name table at the "guess" offset for the object currently assigned to "a". If so, it uses this method and if not, it begins the linear search for the proper method in this object in the method table. In many normal cases, this no doubt causes a lot of overhead though perhaps they have fixed this in the JIT version somehow.

Java has one other important benefit over purely fixed languages like C and C++ in how it supports incremental linking. Each class can depend on or use other classes but the class file in general does not duplicate information even from superclasses. As a result, you can frequently modify Java classes without having to recompile classes that they depend upon. But this flexibility costs at link time because the offset tables have to be built up for each class incrementally as it is loaded. Java also lazily links in method references between classes - again by patching the byte code using self-modifying code. Clearly this architecture makes shared libraries a lot more complicated to implement and also impacts the startup time.

In my prototype, I have a mixed model which allows both fixed and dynamic bindings for the same language construct. In the model for the application, we can determine whether a given slot is fixed or dynamic and generate the description. Further, when compiling in a reference to a member, we can generate a fixed or dynamic offset into the reference depending on rules we have defined.

For slots which are not marked as permanent, they might go into both the fixed and dynamic lists. For example, within the same module you would access them as fixed but from another module you'd access them as dynamic. When fixed offsets are changed in a "published" layer used by other modules/applications, any module which requests notification will be recompiled automatically and its data migrated. data migrated. This is a painful process so I think the basic rules in the system should tend to keep values as dynamic until we need to tune accesses to them as part of the performance tuning process. A wizard can display the most frequently accessed member variables (and where they are accessed from) which are only accessed via a fixed offset in the test suite. You selectively then can mark as fixed only those members which need it. The code manages this meta-data in a separate layer so it does not overly complicate the code itself.

Programming with Rules - A Functional/Declarative Skeleton

The language should support higher level operators that are useful such as triggering execution of methods based on events (timer, property changed, etc.) It should include rules which constrain and relate values. "when A changes set B = A" or always derive B's value by getting A. As it has built-in persistence capabilities, we have a rule oriented query engine for processing and filtering lists of objects. This query language should support a flexible regular expression engine as well for flexible pattern matching.

I hesitate to call this functional programming but there are some similarities. In a pure functional program, a program is defined by specifying a "functional expression" relating mathematical functions which do not have any side effects. You build a graph which represents the computation and then evaluate it in whatever way seems most efficient. The idea is to compute and cache values to minimize computation and/or memory.

Lots of languages integrate functional constructs with procedural (or imperative) constructs - two interesting examples are python and ocaml and are good models to look at but are lacking in other respects. Usually you evaluate a functional construct to retrieve a value and it may in turn use a dynamic value from the application in this construct.

In the model I'm thinking about takes a bit more flexible approach. We build up a program tree when generating the program. Many nodes in this tree embed functional constructs - at least in the same way that a SQL query is functional. Because it is a generative model, these constructs can evaluate when generating the code, get turned into little compiled chunks of code, or if necessary turned into data structures that are dynamically executed when running the compiled version.

One key goal I have in designing a language is to build powerful and efficient abstractions for the programmer. Eliminating side effects from a particular construct - i.e. making it an unconditional rule - is one important way to simplify something as you have eliminated all other possibilities. If you say: A *is* B (i.e. a lasting "functional" relationship) we have constrained the system much more tightly than if we say A equals B using the traditional assignment operator.

I think a good language becomes functional in this looser sense because in choosing the simplest most powerful abstraction for a given domain, you choose one where as much logic as possible goes into the 'declarative', as opposed to the 'imperative' part of the program. Declarative constructs are not order dependent and so are much easier to view and edit using tools and code generators. By putting as much logic as possible into these declarative constructs, you simplify the application.

One problem with functional systems and "rule based" systems is that they do not allow for conflicts and yet conflicts arise in real programming environments all the time. Rather than trying to generalize the functional language to make it possible to eliminate all conflicts, I think a better approach is to allow more flexible handling of the conflicts - both detection and avoidance as well as exception handling. In some cases, perhaps a wizard is invoked to help build a rule to resolve some types of conflicts.

Another way folks describe language constructs is as declarative and I think this might be another useful way to categorize programming constraints and is probably more suitable to describe what I'm talking about than "functional" but I've seen these terms used in different ways.

Synchronization as a separate layer

Synchronization rules should be layered on to the application without affecting the original code. Methods can be marked as synchronized, Locks can be substituted or merged together.

Developer tools should be developed which identify potential deadlock cases by analyzing patterns of locks held at any given moment. When it sees locks acquired in a conflicting order, it can signal a potential deadlock and provide the stack traces which are in potential conflict.

Transactions Built in to the Language

Transactions should be hierarchical and you should be able to enable transactional attributes as meta-data to existing methods in a subsequent layer of your application.

Extendable and Generalized Notion of Scope

In most current languages, there are three different lifecycles defined for variables: static (one per class), and instance (one per instance), automatic (one per method invocation). Some languages add per-thread variables. I think a language should have a generalized form of "scope". Each variable is given a scope attribute like: one-per-user, one-per-system, one-per-process, one-per-thread, one-per-workgroup. Scopes become a powerful language construct also when you can associate them with a given operation. For example, in a language parser you often want to maintain variables which are exposed throughout a given parse operation. The standard way to do this is to create a "parser state" class and start associating members with that object - then pass this object in as a variable in every "parse" method. Some might put all of this state in the Parser object itself and use a different class to store state shared by all parsers of this type.

In my approach, we mark such state as "per-operation(parse)". These variables are defined at the beginning of the parse operation and are available for all methods invoked in a call tree below parse. The generator will create a struct called OpParse which stores all member variables in that scope. This object gets allocated at the beginning of all methods which implement the specified operation and are passed to all methods which are part of that operation. When the operation is over, the object is freed and any cleanup operations performed on that object will occur.

If a variable is accessed from a method not directly invoked from parse, we detect this as compile time and trigger an error which can eventually give the user a wizard to help the user repair the problem. They might need to change the scope for the state or have some other logical error in their thinking.

Advanced programmer's can define their own application specified scopes as well: per-order, per-workflow, per-project etc. These scopes are coordinated in helper code which associates the current object with the given thread thus setting up a safe environment for script level programmers to operate in without having to worry about lifecycle issues for these application specific objects. They simply refer to them by name which gets mapped to the appropriate physical "struct" in that context. Since typically this mapping is bound at compile time (you know both the object and the scope), you can make this mapping transparent or at least pretty efficient. Just how efficient remains to be seen, but I think that if you assume that the scope will not change it should be as efficient to use scopes as other ways of accessing variables.

Interestingly, this "scope" concept is used in my prototype to implement the class/instance relationship too. Values stored in the class as simply part of the per-process state of the object. Further, we separate read-only per-process state into a different scope so that it can be shared between processes using the same object.

Containment

Each object can have state for one of several scopes. We collect the state for a given object and partition in by scope into separate "records" which are optionally chained together via references (for example, you can get from the object foo's instance state to its type state by following the type link). When you define a new scope, you specify which scopes it is chained To or from.

The container decides how an address for the object is assigned, how accesses to this object are retrieved at compile time and runtime. The rules for retrieving a container are defined using a mapping involving both the object and a given scope. In other words, a given scope will provide a default container. For a given pattern of objects, we might associate a different type of container for a given scope. The container defines the physical storage location too - whether it is in a database etc.

The container is the basis for the implementation of separate heaps, persistence frameworks etc. Containers can be nested. In fact, a simple object is itself the container for the properties inside of it. Contained objects can either be independent or owned. If they are owned, their lifecycle is tied to that of their container. For efficiency, most properties will be owned by their objects which attaches their lifecycle to that of the parent. It allows us to allocate them as sub-objects. It will be rare for code to ask for the identify of those owned-property objects - usually, you'll ask for them by their value. But these objects remain full fledged objects which can store arbitrary meta-data even at runtime. Each parent object can stores a set of references to owned-properties who have sub-objects created and lazily can create them when they are non-trivial and remove them if they are trivial.

Testing

The language should include tools for automatically generating tests for individual pieces of code. It can record all state used and updated in an operation as well as what code was modified in the recording of the test. To replay this test, it must restore the initial conditions of the test, then invoke the code it is testing and compare it to the known good result. External services used by the application should be simulateable so that they too can participate in test generation. With this type of system, the language can maintain test databases and automatically run tests before checked in changes are merged to a shared layer.

Fundamental Data type

I believe that these can all be best implemented in a system with a single elemental data type. But where Lisp has its "list", this language would use an ordered list of name/value pairs. A list really is a degenerate form of this data type (where the place in the list is used as the name of the value). I think using this data type removes the order dependence of arguments in lisp expression which created an awkward syntax. It is not easy to figure out what the roles of the arguments are simply based on their position in the argument list. The order of parameters in a list is arbitrary memorization which is difficult for regular people.

To me a better model is to give each parameter a name. The programmer can either specify argument values by position or by name. If they omit the name, the compiler can optionally insert the name into the code view for clarity if desired by the programmer.

Each value in an ordered list can either be a primitive/leaf value, or may have its own associated meta-data. It is important that you can extend any value in the system to add additional meta-data (and the associated constraints or behavior) without breaking any interfaces or having to rewrite the code. Each object in the system can have its own ordered/named list associated with it. Thus data is hierarchically structured to an arbitrary degree.

It is true that promoting simpler concepts like lists to this more general data type can cause inefficiencies, but I think these are largely removed in the generative model. You have the opportunity to do analysis of program regions, choose the most efficient data type to represent a construct and then ensure that all references to you use the right type. The most general form of the data - the meta-model of the program - only lives during compilation time, when running the interpretive model, or perhaps during install/re-configure.

The compiled version of the program will separate itself from the meta-description when performance is a factor. Given the modern memory requirements of users, taking up the users memory is not a problem but taking up their time is. An efficient low foot-print application of 1 or 2M that bloats up to 20-40M when in a diagnostic/install/configure mode is acceptable and I think a technique which will offer users the most satisfaction.

For example, the test environment for a given application would live on the server at a known location but is fully available for the application if it encounters any failures. It runs the suite - finds a patch and downloads and installs it

Language Syntax

I'm still evolving the specific syntax and my goal is to support different "dialects" which can be turned on and off on a per-statement level. Dialects will be created for users of existing languages like Java, Python, which preserve their basic aesthetics. Dialects can also be used for various ways of nesting lists and escaping code. The idea is to make it easy to import languages, to disambiguate ambiguous contexts, in a context sensitive grammar to turn a file into a program.

Rather than specifying a well-defined language grammar, I will create a robust test suite which defines the language inductively rather than deductively. Changes to the language that break these tests will not be allowed. As long as we are careful in designing both the language and the test suites, I think this will make a language which is easier to learn, easier to extend and where the code is more accessible to the more programmers.

Of course this is not a new idea and there are some good reasons why I think folks do not take this approach when writing programming languages:

Only the hard-core geeks learn the language from the grammar and I believe that is only because it is the most concise page which contains the necessary keywords. The simple examples or equivalences page would be just as effective for these folks I think. Since the language is itself implemented using an analytical process - it is going to be well defined if it is well implemented. Open source review and good testing will prevent major flaws from being introduced. Inductive tests will I think adequately constrain the implementation and give real confidence that the language works by practice not proof. By incorporating end-user applications into the testing phase, any language modifications can be validated with large amounts of code before they are released.

With a language specification perhaps we are lulled into a false sense of security by the spec into thinking that it encapsulates portability of code. So many factors in the implementation can break working code without touching the language grammar so I do not think we are losing anything here.

By implementing a language that can be trivially extended by the user, using their own private dialect, we develop one family of interconnected languages. Since language elements can be shared or re-mapped in different dialects, languages can differentiate themselves like different species while passing back and forth potentially remapped or customized genes which are known to be effective. Any application can use whatever language elements from whatever dialects it wants.

The third problem above - the inability to write a new parser - is perhaps the most troubling argument for why a grammar is necessary. But I think that as long as language elements in the dialects which implement broadly applicable features are licensed properly (i.e. free and open source), the real question is how hard it is to write a translator into another language. To protect ones work, having your information technology tied into a single language makes you vulnerable. But, the whole point of my project is to build a translator from this language into an efficient version of the application in the host language (for now C and Java). If a new language comes out and you want to migrate your code base to that language, you can build a translator for that language.

Conclusion

So this is the state of my thinking so far. Right now, I have a small C prototype which implements the "dynamic/fixed" struct format I want to generate from my language. This format allows for both fixed binding between components (ala C) and also dynamic binding (which allows dynamic re-linking without re-compilation for compatible changes). I'm now writing the parser and generator for the language in itself. When this is finished, if I still think it is looking good I'll translate that to C by hand and try bootstrapping the language that way.

Some interesting systems I've seen so far are jade, eyedb, dylan, o'caml. globus grid toolkit

I'm interested in any feedback you have on these ideas - send me an email to jeff@jvroom.com.

Visitors to this page: