While I'm working part-time for ATG as a consultant, I'm spending my free time looking for a better programming environment. Though I think Java and J2EE have made many improvements in programmer productivity, I think there are still some great improvements to be made. Here are a few of the key changes I'm looking for. If you have any suggestions of technologies to look at, feedback on any of this, or would like to be notified of any updates to my site let me know by dropping me an email at jeff@jvroom.com.
It would be much more efficient to have a single language while was expressive enough for all 3 of these purposes. Code is separated into layers, and layers are filtered so that a user with a given skill set only sees constructs that they match their skill level.
Currently there is a growing divide between script programmers and system programmers. System programmers need high performance, type safety, implementation hiding, threading/synchronization/exception handling etc. Scripting languages are becoming more powerful more rapidly than system languages because they are more loosely controlled and get new operators quickly. They offer simpler, higher level data types which are easier to learn and use. But they offer poor performance and have may have threading and synchronization limitations and produce less reusable and maintainable code due to the lack of type safety and implementation hiding.
Though it will be difficult to bridge this divide, I think it is important to do so. We need one language which is high performance enough with the robust features required to implement any application. This includes shared libraries and memory between applications (something java still lacks). It should include high level operators which make programmers productive and should be extendable by adding new language implementation modules.
I think type safety is a tricky issue but one that can be improved with a slightly different editing paradigm. In most computer languages, there is a simple source file. In this system, using a versioned database the system provides a source file view for a given user which may be customized based on their settings. As the programmer changes the file, the compiler reads it and potentially annotates it with additional type constraints which were inferred. This can include casts and other variable types which were inferred from the context. The real source edited by the user is maintained and can be viewed by turning off these annotations.
Even when we must add sophisticated language constructs that script programmers might not fully understand, it should be easy to separate these constructs into layers not exposed to the higher level programmer. Similarly, certain administrative type programmers should be able to operate in layers which only contain configuration suitable for their skill set.
Java supports the CLASSPATH notion which is similar to what I'm talking about here. It is different in that in Java, one class earlier in the CLASSPATH can replace a previous one. In this layered model, you can customize a given class by adding methods, replacing methods, removing interfaces etc. Aspect oriented programming does provide higher level layering concepts - for example, you can add synchronization or transactions on as a layer to an existing set of classes. My view is that aspects are really just more powerful code combination abstractions. There are certainly many of these which will prove to be easy to learn and help build more reusable code but perhaps the initial point is that we need an extendable language framework which supports an extendable set of language primitives.
If we had more powerful layering constructs, we could not only reduce a lot of code copying but use higher level abstractions in the design which are easier for new programmers to understand. For example, in a base layer you might have code which creates an HashMap. You might decide it should use a TreeMap in a sub-class or perhaps in a customized version of that same class. This should be possible without having to rewrite the base class to expose that constructor. For example: "replace all HashSets in foo.java with TreeSets".
It has the full features of an OO database, but also integrates in data from file systems and SQL. It supports transactionality with pessimistic and optimistic locking, and can be extended to support additional transaction conflict resolution policies or rules.
Just as the code for a given application is constructed by merging together layers of progarm, the data framework allows merging together of layers of data. One layer may add or remove the visibility of data items in a previous layer, it may extend the schema and add new values, hide values, add security constraints, add additional integrity constraints. Layers are also used to implement transactions. A layer can place constraints on dependent layers and we define a framework for specifying rules for how layers are combined to deal with locking, error propagation, etc.
Persistence behavior is implemented by plug in modules which implement types of persistence. These plug in modules have interfaces for compiling references into the runtime as well as interpreting the value for efficiency.
To specify how and where the variable is stored, we attach meta-data onto each piece of memory such as transient/persistent, versioned/un-versioned, scope/lifecycle, ownership of this value (is it a part of something else, does it have a parent), does it any meta-data from its parent?, etc. These are configured via "meta-data" attached to variables. Since this meta-data can be modified in a separate layer from the code, you can control persistence without modifying the code itself.
Indexed queries should be part of the same language making quick access to the data easy as well.
Because we are using a versioned object oriented database in the programming model, we also have a record of all running processes which need to be updated for any changes made to the program. In an OODBMS, this usually requires that objects be migrated to the new schema but now we have a new option. We can continue to run using the old code for the old objects without migration (because we have those versions in the versioned DB). Of course we should also have the option of passivating/migration/reactivating modified objects as well.
This all occurs in an automated fashion managed by the system and its configured policies. A programmer modifies the code in a particular layer and all processes running off of that layer get updated transparently according to their rules. An interpretive model also operates on the same language for cases where compilation is not appropriate or for improved debugging and diagnostics.
The original uncompiled model of the program can be lazily loaded into the compiled version of the program for debugging and diagnostic purposes using hooks that we place into the compiled version of the code. This will be used mostly for customization, diagnostics - we avoid using the larger meta-data rich version of the program for "normal" operation of the application to improve efficiency.
This model has many advantage of the "byte code" model used by languages such as Java and .Net. By implementing the high level language as simply a translation from one high level language to another, you avoid a lot of implementation complexity and also improve the compilers ability to optimize the code. You also have the option of doing "static program analysis" for diagnostics and optimization. You can eliminate runtime code which is not executed during this generation process. You can also use global knowledge of the application in making bindings (i.e. if there are no subclasses, you can invoke an instance method directly). Even if these bindings change later on, you can transparently recompile to get the required more general version of the binding.
For applications where distributing the source is not compatible with the business goals, you can "host" the system on a server with a similar architecture to the end user's environment. The user downloads new executables produced automatically from one of these servers and so does not need the source on their system.
The system should maintain runtime efficiency statistics (memory+CPU resources consumed). Part of the authorization for a given operation ought to commit some resource limit for the processing of that operation and if we run out, it should be possible to abort the operation. This might eliminate some hacker attempts as potentially they'd exceed the CPU/memory resources allocates to a given user in a given section of time. In this general model too, we might be able to write security rules which could enforce security at the lowest level for the tighest security.
Though I think a purely unmanaged memory model like Java is an excellent feature of the language, if you do not have a way to incrementally add management of explicit types you will run into a broad class of performance critical applications which are not suitable for this language.
More specifically, I want my language to operate in a few different modes (each module using a different model):
The Java memory model includes an extra level of indirection for each pointer access (i.e. "handles") which allows objects to be relocated without application code being aware. I think this is another feature of Java which makes it insuitable for performance sensitive applications. Because you do not always know performance requirements when you start development, Java thus becomes an inefficient language for programmers who need to compete on a cost/performance basis. Of course performance tuning takes a lot of the programmer's time!. Whether or not handles are used for a particular object reference should be something that is configurable at the container level for an object. Clearly this affects all code which uses that object, but if the language defines only one way to dereference a pointer, it limits the flexibility of the language greatly. I do believe that in some cases the ability to move objects in memory can make memory use more efficient by since it also expands the amount of memory used by the application and causes a hit on every pointer dereference, it should not (in my mind) be fundamental to the language. Also note that handles are stored in a separate region of memory which decreases the effectiveness of the on-chip processor cache.
Java uses a hack (in the interpreted version at least) for interfaces where it patches a "guess" offset into the instruction in the byte code for the interface de-reference. The instruction in the byte code stores the last offset used for this de-reference. When encountering an "a.b" reference where "a" is an interface, it looks in the method name table at the "guess" offset for the object currently assigned to "a". If so, it uses this method and if not, it begins the linear search for the proper method in this object in the method table. In many normal cases, this no doubt causes a lot of overhead though perhaps they have fixed this in the JIT version somehow.
Java has one other important benefit over purely fixed languages like C and C++ in how it supports incremental linking. Each class can depend on or use other classes but the class file in general does not duplicate information even from superclasses. As a result, you can frequently modify Java classes without having to recompile classes that they depend upon. But this flexibility costs at link time because the offset tables have to be built up for each class incrementally as it is loaded. Java also lazily links in method references between classes - again by patching the byte code using self-modifying code. Clearly this architecture makes shared libraries a lot more complicated to implement and also impacts the startup time.
In my prototype, I have a mixed model which allows both fixed and dynamic bindings for the same language construct. In the model for the application, we can determine whether a given slot is fixed or dynamic and generate the description. Further, when compiling in a reference to a member, we can generate a fixed or dynamic offset into the reference depending on rules we have defined.
For slots which are not marked as permanent, they might go into both the fixed and dynamic lists. For example, within the same module you would access them as fixed but from another module you'd access them as dynamic. When fixed offsets are changed in a "published" layer used by other modules/applications, any module which requests notification will be recompiled automatically and its data migrated. data migrated. This is a painful process so I think the basic rules in the system should tend to keep values as dynamic until we need to tune accesses to them as part of the performance tuning process. A wizard can display the most frequently accessed member variables (and where they are accessed from) which are only accessed via a fixed offset in the test suite. You selectively then can mark as fixed only those members which need it. The code manages this meta-data in a separate layer so it does not overly complicate the code itself.
I hesitate to call this functional programming but there are some similarities. In a pure functional program, a program is defined by specifying a "functional expression" relating mathematical functions which do not have any side effects. You build a graph which represents the computation and then evaluate it in whatever way seems most efficient. The idea is to compute and cache values to minimize computation and/or memory.
Lots of languages integrate functional constructs with procedural (or imperative) constructs - two interesting examples are python and ocaml and are good models to look at but are lacking in other respects. Usually you evaluate a functional construct to retrieve a value and it may in turn use a dynamic value from the application in this construct.
In the model I'm thinking about takes a bit more flexible approach. We build up a program tree when generating the program. Many nodes in this tree embed functional constructs - at least in the same way that a SQL query is functional. Because it is a generative model, these constructs can evaluate when generating the code, get turned into little compiled chunks of code, or if necessary turned into data structures that are dynamically executed when running the compiled version.
One key goal I have in designing a language is to build powerful and efficient abstractions for the programmer. Eliminating side effects from a particular construct - i.e. making it an unconditional rule - is one important way to simplify something as you have eliminated all other possibilities. If you say: A *is* B (i.e. a lasting "functional" relationship) we have constrained the system much more tightly than if we say A equals B using the traditional assignment operator.
I think a good language becomes functional in this looser sense because in choosing the simplest most powerful abstraction for a given domain, you choose one where as much logic as possible goes into the 'declarative', as opposed to the 'imperative' part of the program. Declarative constructs are not order dependent and so are much easier to view and edit using tools and code generators. By putting as much logic as possible into these declarative constructs, you simplify the application.
One problem with functional systems and "rule based" systems is that they do not allow for conflicts and yet conflicts arise in real programming environments all the time. Rather than trying to generalize the functional language to make it possible to eliminate all conflicts, I think a better approach is to allow more flexible handling of the conflicts - both detection and avoidance as well as exception handling. In some cases, perhaps a wizard is invoked to help build a rule to resolve some types of conflicts.
Another way folks describe language constructs is as declarative and I think this might be another useful way to categorize programming constraints and is probably more suitable to describe what I'm talking about than "functional" but I've seen these terms used in different ways.
Developer tools should be developed which identify potential deadlock cases by analyzing patterns of locks held at any given moment. When it sees locks acquired in a conflicting order, it can signal a potential deadlock and provide the stack traces which are in potential conflict.
In my approach, we mark such state as
"per-operation(parse)". These variables are defined at the beginning of the
parse operation and are available for all methods invoked in a call tree below
parse. The generator will create a struct called
If a variable is
accessed from a method not directly invoked from parse, we detect this as
compile time and trigger an error which can eventually give the user a
wizard to help the user repair the problem. They might need to change the
scope for the state or have some other logical error in their thinking.
Advanced programmer's can define their own application specified scopes as
well: per-order, per-workflow, per-project etc. These scopes are coordinated
in helper code which associates the current object with the given thread thus
setting up a safe environment for script level programmers to operate in
without having to worry about lifecycle issues for these application specific
objects. They simply refer to them by name which gets mapped to the
appropriate physical "struct" in that context. Since typically this mapping
is bound at compile time (you know both the object and the scope), you can make this
mapping transparent or at least pretty efficient. Just how efficient remains to be seen, but I think
that if you assume that the scope will not change it should be as efficient
to use scopes as other ways of accessing variables.
Interestingly, this "scope" concept is used in my prototype to implement
the class/instance relationship too. Values stored in the class as simply part of
the per-process state of the object. Further, we separate read-only
per-process state into a different scope so that it can be shared between
processes using the same object.
The container decides how an
address for the object is assigned, how accesses to this object are retrieved
at compile time and runtime. The rules for retrieving a container are defined
using a mapping involving both the object and a given scope. In other words,
a given scope will provide a default container. For a given pattern of
objects, we might associate a different type of container for a given scope.
The container defines the physical storage location too - whether it is in a
database etc.
The container is the basis for the implementation of separate heaps,
persistence frameworks etc. Containers can be nested. In fact, a simple
object is itself the container for the properties inside of it. Contained
objects can either be independent or owned. If they are owned, their
lifecycle is tied to that of their container. For efficiency, most properties
will be owned by their objects which attaches their lifecycle to that of the
parent. It allows us to allocate them as sub-objects. It will be rare for
code to ask for the identify of those owned-property objects - usually, you'll
ask for them by their value. But these objects remain full fledged objects
which can store arbitrary meta-data even at runtime. Each parent object can
stores a set of references to owned-properties who have sub-objects created
and lazily can create them when they are non-trivial and remove them if they
are trivial.
To me a better model is to give each parameter a name. The programmer can
either specify argument values by position or by name. If they omit the name,
the compiler can optionally insert the name into the code view for clarity if
desired by the programmer.
Each value in an ordered list can either be a primitive/leaf value, or may
have its own associated meta-data. It is important that you can extend any
value in the system to add additional meta-data (and the associated
constraints or behavior) without breaking any interfaces or having to rewrite
the code. Each object in the system can have its
own ordered/named list associated with it. Thus data is hierarchically
structured to an arbitrary degree.
It is true that promoting simpler concepts like lists to this more general
data type can cause inefficiencies, but I think these are largely removed in
the generative model. You have the opportunity to do analysis of program
regions, choose the most efficient data type to represent a construct and then
ensure that all references to you use the right type.
The most general form of the data - the meta-model of
the program - only lives during compilation time, when running the
interpretive model, or perhaps during install/re-configure.
The compiled version of the program will
separate itself from the meta-description when performance is a factor.
Given the modern memory
requirements of users, taking up the users memory is not a problem but taking
up their time is. An efficient low foot-print application of 1 or 2M that
bloats up to 20-40M when in a diagnostic/install/configure mode is acceptable
and I think a technique which will offer users the most satisfaction.
For example, the test
environment for a given application would live on the server at a known
location but is fully available for the application if it encounters any
failures. It runs the suite - finds a patch and downloads and installs it
Rather than specifying a well-defined language grammar, I will create a robust
test suite which defines the language inductively rather than deductively.
Changes to the language that break these tests will not be allowed. As long
as we are careful in designing both the language and the test suites, I think
this will make a language which is easier to learn, easier to extend and where
the code is more accessible to the more programmers.
Of course this is not a new idea and there are some good reasons why I think
folks do not take this approach when writing programming languages:
Only the hard-core geeks learn the language from the grammar and I believe
that is only because it is the most concise page which contains the necessary
keywords. The simple examples or equivalences page would be just as effective
for these folks I think.
Since the language is itself implemented using an analytical process - it is
going to be well defined if it is well implemented. Open source review
and good testing will prevent major flaws from being introduced. Inductive
tests will I think adequately constrain the implementation and give real
confidence that the language works by practice not proof. By incorporating
end-user applications into the testing phase, any language modifications
can be validated with large amounts of code before they are released.
With a language specification perhaps we are lulled into a false sense of
security by the spec into thinking that it encapsulates portability of code.
So many factors in the implementation can break working code without touching
the language grammar so I do not think we are losing anything here.
By implementing a language that can be trivially extended by the user, using
their own private dialect, we develop one family of interconnected languages.
Since language elements can be shared or re-mapped in different dialects,
languages can differentiate themselves like different species while passing
back and forth potentially remapped or customized genes which are known to be
effective. Any application can use whatever language elements from whatever
dialects it wants.
The third problem above - the inability to write a new parser - is perhaps the
most troubling argument for why a grammar is necessary. But I think that as
long as language
elements in the dialects which implement broadly applicable features
are licensed properly (i.e. free and open source), the real question is how
hard it is to write a translator into another language. To protect ones work,
having your information technology tied into a single language makes you
vulnerable. But, the whole point of my project is to build a translator from
this language into an efficient version of the application in the host
language (for now C and Java). If a new language comes out and you want to
migrate your code base to that language, you can build a translator for that
language.
Some interesting systems I've seen so far are
jade,
eyedb,
dylan,
o'caml.
globus grid toolkit
I'm interested in any feedback you have on these ideas - send me an email to
jeff@jvroom.com.
Visitors to this page:
Containment
Each object can have state for one of several scopes. We collect the state for
a given object and partition in by scope into separate "records" which are
optionally chained together via references (for example, you can get from the
object foo's instance state to its type state by following the type link).
When you define a new scope, you specify which scopes it is chained To or
from.
Testing
The language should include tools for automatically generating tests for
individual pieces of code. It can record all state used and updated in an
operation as well as what code was modified in the recording of the test.
To replay this test, it must
restore the initial conditions of the test, then invoke the code it is testing
and compare it to the known good result. External services used by the
application should be simulateable so that they too can participate in test
generation. With this type of system, the language can maintain test
databases and automatically run tests before checked in changes are merged to
a shared layer.
Fundamental Data type
I believe that these can all be best implemented in a system with a single
elemental data type. But where Lisp has its "list", this language would use
an ordered list of name/value pairs. A list really is a degenerate form of
this data type (where the place in the list is used as the name of the value).
I think using this data type removes the order dependence of arguments in lisp
expression which created an awkward syntax. It is not easy to figure out what
the roles of the arguments are simply based on their position in the argument
list. The order of parameters in a list is arbitrary
memorization which is difficult for regular people.
Language Syntax
I'm still evolving the specific syntax and my goal is to support different
"dialects" which can be turned on and off on a per-statement level. Dialects
will be created for users of existing languages like Java, Python, which
preserve their basic aesthetics. Dialects can also be used for various ways
of nesting lists and escaping code. The idea is to make it easy to import
languages, to disambiguate ambiguous contexts, in a context sensitive grammar to
turn a file into a program.
Conclusion
So this is the state of my thinking so far. Right now, I have a small C
prototype which implements the "dynamic/fixed" struct format I want to
generate from my language. This format allows for both fixed binding between
components (ala C) and also dynamic binding (which allows dynamic re-linking
without re-compilation for compatible changes). I'm now writing
the parser and generator for the language in itself. When this is finished,
if I still think it is looking good I'll translate that to C by hand and try
bootstrapping the language that way.