Monday, March 25, 2013

C^1-Lohner IVP solver specification

Last week Jan started looking at ZgliczyƄski's paper C^1-Lohner algorithm as a source for the specification of an IVP solver that can produce narrowing enclosures for systems with asymptotically stable solutions. The paper proved to be written in a very accessible style, and after resolving the usual notational ambiguity Jan found the presentation to clearly specify an algorithm up to choices of concrete algorithms for computing components, such as a Taylor method, or a QR decomposition. As always, when provided with a functional specification as this one, the computational mathematician will still be left to make the appropriate choices of data structures and concrete instances of algorithms. These are the tasks that Jan will target next.

A work sharing scheduler prototype for unbalanced workloads

During the past week Jan and Adam have been working on a work sharing scheduler and thread pool for Acumen's parallel interpreter. The main issue the new scheduler is designed to address is that the current scheduler is limited to a static assignment of work to the threads that are managed by the thread pool. The following model illustrates the main problem with this approach.

class A(n, mode)
  private x := 0 end
  switch mode
    case "Spread"
      if n > 0
        create A(n-1, "Spread");
        create A(n-1, "Spread")
      end;
      mode := "Work"
    case "Work"
      x = sum i*i for i in 1:100 if true
  end
end
       
class B()
end

class Main(simulator)
  private
    mode := "Init"
  end
  switch mode
    case "Init"
      create A(8, "Spread");
      create B();
      mode := "Persist"
    case "Persist"
  end
end


The above model is an example of a highly unbalanced workload, where one of the children (the A instance) of the root object corresponds to practically all the work done during the simulation. However, with only two threads to distribute among the root's children (on a dual-core machine) the static scheduler will waste one of the threads on the B instance, leaving the entire computation of the A instance (and all its descendants) sequential. There are many ways in which such an unbalanced workload could be scheduled more efficiently. For instance, one could increase the number of threads in the thread pool above the number of cores available on the machine. Such an approach will suffer from the overhead incurred by the creation and garbage collection of threads. The strategy pursued this week tries to reallocate the remaining work as evenly possible among free threads as they finish their work. The current implementation can handle unbalanced workloads reasonably well, but additional benchmarking is needed to understand how it performs on a variety of models and simulator parameter settings.

A considerable amount of time during the implementation of this scheduler was spent on debugging deadlocks in the implementation of the thread pool and scheduler. Aside from using the Eclipse debugger and good old fashioned print statements, Oracle's VisualVM tool proved to be very useful. It collects a number of tools from the JDK, including a thread inspector and most notably a way to dump the state of all threads currently running inside a given JVM. Whenever a deadlock occurs, it suffices to look in the thread dump to see what each thread is up to at the moment.

Monday, March 18, 2013

Parallel interpreter determinism and load balancing opportunities

Last week Adam and Jan continued the work on Acumen's parallel interpreter. They implemented a version of the parallel interpreter regression test that, instead of comparing the output of an interpreter with a previously cached value, compares the output with that of a reference interpreter. They also conducted an experiment to test if the construction of a structure encoding the order in which changes to the object graph of an acumen model are discovered, is indeed necessary. The conclusion was that the structure is indeed needed to ensure that the parallel interpreter is deterministic. During this work a hypothesis about the scheduling strategy currently employed by the interpreter was formed. It seems that on each traversal of the object graph each available thread is only allocated work once, leading to a potentially unbalanced load distribution. This potential source of optimization opportunities will be investigated next.

Tuesday, March 12, 2013

Planar Vehicle Simulation

Jawad had created the next simulation case for longitudinal function called Collision Imminent Braking. The simulation case describes the testing vehicle approaching from behind towards the reference stationary object (car/pedestrian). Simulation shows the testing vehicle kinematic maneuver when critical distance between testing vehicle and reference stationary object/Vehicle crosses threshold values. Two wheel planar differential vehicle was modeled in local reference frame, where the position, orientation, linear velocity and rotational velocity were defined in terms of right and left side wheels. The yaw angle was used as the control variable for safe maneuver. The local planner position and velocity vector was then translated in global cartesian coordinates.  3D visualization of the planar vehicle was represented by rectangular box; the straight road was represented by a fixed rectangular box; the sensor was represented by the single variable which calculate the distance between testing vehicle and stationary vehicle. The left and right tire constant velocity was provided to get the velocity and orientation at the center of the vehicle. Please click Simulation 2 to watch the video.

--
Dr. Jawad Masood
Postdoctoral Researcher, Halmstad University
Kristian IV:x vag 3, office F312,
302 50 Halmstad, Sweden
Tel:  +46 (0) 35 167 308
Cell: +46 (0) 76 221 2643
http://hh.se/

Monday, March 11, 2013

Regression tests for Acumen's parallel interpreter

Last week Jan continued the work on Acumen's parallel interpreter. In preparation for the planned optimizations and alternative scheduling strategy experiments he adapted the property-based regression tests previously developed with Adam for the enclosure interpreter. Currently, changes to an Acumen model's object graph are not performed immediately during evaluation of a discrete step, as is the case with discrete scalar variable updates, but instead logged and performed once the discrete steps have stabilized. Jan wonders if it is possible to do away with these logs entirely without changing the semantics of object graph-manipulating models. The regression tests will be handy when attempting this optimization as they test functional equivalence of an interpreter against a pre-recorded baseline.

Sunday, March 10, 2013

Talk about Delite

During the past week several members of the Effective Modelling Group visited Chalmers University of Technology to see a talk by Kunle Olukotun about the Delite framework. The talk introduced the framework as applying the DSL approach to parallelisation, striking a balance between productivity and performance, at the cost of generality. Several implicitly parallel domain-specific languages implemented on top of the framework were given as evidence for the success of the approach. The talk used one of these languages, OptiML (a language for machine learning applications) to illustrate the use of Delite's built-in operations (map, reduce, fiter, etc.) in creating the OptiML compiler. For those who missed the talk, some parts were covered in Kunle's ICFP 2012 keynote presentation which is available on YouTube.

Monday, March 4, 2013

Acumen's imperative reference interpreter

Over the past week Jan has been taking a closer look at Acumen's parallel reference interpreter together with Walid, Corky, Kevin and Adam. Kevin and Adam have been leading the profiling effort, in order to identify potential bottlenecks and to lay down a baseline to measure improvements against. Corky, Walid and Jan have together with Kevin and Adam been looking over the source code to determine if there is an obvious way to start the optimization work, and in the process Jan has come believe that the sequential algorithm that the parallel interpreter defaults to when given a single thread to execute on is in some ways simpler than the pure functional reference interpreter. At some point Jan will look to extract the sequential implementation from the parallel interpreter to have as a potential entry point into the code base for collaborators with an imperative programming background.

Chroma sub-sampling on Ambric

During the past week Adam has been learning about programming the Ambric processor in the Embedded Parallel Computing course. This highly parallel architecture (each processor contains over 300 cores) is interesting in that, compared to other common manycore architectures such as GPUs, it is especially designed to support the data flow programming model. Algorithms are programmed using a C-like language called aJava, and the data flow is described using the aStruct language. The application implemented on top of Ambric was Chroma Sub-sampling, which compresses a stream of images by re-using color information in adjacent pixels. Though Adam had issues with the tool chain, which produced some rather cryptic error messages at times, the programming model was significantly easier to grasp than CUDA.