Grid and hydrology
Comments morning of 28/11/07
Here are comments on V2 in no particular order.
- Introduction paragraphs 1 and 4: we talk about "standard escience tools". I am inclined to think that the word "standard" is not helpful here, because generally research councils are not keen on funding things that are standard. On the other hand, hype words such as "novel" may also be counterproductive. I wonder whether "emerging" is a better word.
- But the whole sentence in paragraph 4 that uses the word "standard" rather plays down the role of escience to one very specific aspect of the project, whereas I think that we have to make the point that escience is what is going to transform the science. eScience is such a wide-ranging word that it is possible to make such a statement without treading on too many sensibilities. For example, the use of web pages is not escience, but portals are (particularly if they have portlets and stick to standards). Anyways, my point is that this paragraph needs to make two points (if I understand the science argument): first, we are now poised to do some groundbreaking work based on the fact that we believe that there is a way to characterise water runoff; second that to do this properly requires some good escience work.
- Last sentence in paragraph 4 of the Introduction: is the primary point that inductive approaches are much less worked out, yet with enough data (observational or model) they could be the most powerful? Can I check, are the terms downwards/upwards correctly applied?
- My feeling is that the Introduction needs subheadings because it does too many things. Of course, this is a matter of style and is personal, but my take on this is that to make a proposal readable it needs to have subsections that make one major point only. In our Introduction I might suggest the following subheadings. a) Background in which we point to the problem, and which might end around the end of paragraph 4 or 5; b) a precis of what we are going to achieve (some of what follows), c) what the partners bring to the project that mean that it is timely and that they are the ones to do the work. I think all this stuff is in the Introduction, but it needs ordering as such.
- The preceding point actually would resolve two other comments I have about the introduction. a) the link (or perhaps the textual transition) between science and escience is not very clean at the moment; a) nor is the textual transition between what the two partners bring to the project; c) I don't think we really make a clean statement of the problem we are tacking in specifics, and in particular we do need to make clear why this project is a major project and not something that could be easily achieved. With regard to these three points, I think I understand the points, but the text doesn't make the transitions between ideas. I think that effective use of subheadings will fix this.
- I didn't really understand the point in the Introduction about the proposed work not being in conflict with the work of Duan and Gupta.
- First aim/objective needs expanding I think.
- What is needed I think is a link between the grand vision set out in the Introduction and the specific vision of this project. This should be in the aims and objectives section, but I think that the aims and objectives mix together grand vision and specifics. I would have a grand vision statement at the start of the aims and objectives, and a statement that the grand vision will be achieved by a list of specific task-oriented objectives.
- WP1 is unbalanced between sections a and b in my view. The point in WP1 is that there are two sources of data, real observations and the predictions of models. I think that the proposal will read better if both are of more-or-less equal length, but the differences in length of both sections is really unbalanced. Observational part is itself in two parts, namely time-series and spatial. I would make this point in the opening paragraph, and slightly trim the descriptions. The summary paragraph of part a has a lot of details about specific issues that rather takes away the main point. Although we have to mention licensing, would it not be better in a footnote?
- I would think that WP1 should be the place to describe how simulated data will be generated. Here is the place to say that it will need grid computing to get the required high throughput. I think we have to talk about the simulation codes here, or at least reference them and back-reference to the Introduction.
- I am happy to edit b but will want to chat before hand.
- I am not sure about the balance in WP2 – which is my bit. It would be useful to chat.
- WP3 probably contains the simulation detail I have said is lacking in WP1. Maybe we need to simply create proper hooks/links as we edit the text. WP3 isn't really linked in to the grid part.
- WP4 is where we talk about the user interface. Here I would like to flag the point I made about advice to users being created on-the-fly rather than hard-wired into the interface. (see my notes below). The point is that if you hard-wire stuff into interfaces, if the simulation code has new functionality, there will be a lot of knock-on effects. On the other hand, if you can modularise the process such as the way we do it in MaterialsGrid, then it is easy to incorporate new functionality or even use new simulation codes. In short, the approach is that you have XML files that contain all the information that the portal needs to collect, and these files are read by the portal/portlet and turned into requests. The portlet has no code-specific hard wiring, but simply interprets the XML files it reads. XML files are much easier to change that code.
Post from the weekend
A lot of this proposal has a lot of nice links to work we have been doing in Cambridge within all three of our escience activities, eMinerals, NIEeS and MaterialsGrid.
There are some core ideas that link all three efforts, namely
- Grid computing can be extremely useful for running ensemble jobs, and generally grid computing increasingly is a route towards availability of computing power.
- Data management involves having data within data repositories with access to collaborators in a transparent manner – this is what the SRB gave us, but the key gains of the SRB could be replicated using alternative technologies (our favoured route is via webdav servers).
- Data management requires proper metadata capture, and with very rich metadata one can use the metadata as an interface to data.
- The key to making grids easy to use is to represent data in XML format. This makes reading data files relatively easy as one gain (we have tools to perform a transformation to XHTML on the fly, with graphs drawn using SVG on demand), but it also makes extraction of information easy. We use this for gathering metadata for output files during their grid runs, for example.
Different projects are bring different tools to the table:
eMinerals has developed
- a grid job submission that fully integrates data management and metadata capture (RMCS)
- a set of XML-writing libraries for Fortran (FoX)
- XML transformation tools (ccViz) and plotting tools (pelote)
- metadata tools (RCommands)
NIEeS has developed
- a KML version of FoX
- a grid infrastructure based on standard middleware tools and other Cambridge tools
MaterialsGrid has developed
- a portal/portlet interface to job submission and data management
- an service-based infrastructure for setting up and running jobs, undertaking workflows, and managing data
- a tool for extracting information from XML files based on the use of XML dictionaries (Golem)
- a tool for converting XML to SQL
It can be seen in all of the above that XML is important to our work. Our experience is that XML is actually critical, and I would advocate using it within this proposal. The cost overhead is now not high (eg using our FoX tool for Fortran, and there are decent XML tools for other programming languages), and you lose nothing by using it, but the gains are enormous.
WP1
I would advocate using our xml2sql tool for this work. The tasks required here are
- Ensure that the simulation codes are writing an appropriate XML. This is a task we are currently thinking about in general. KML is the XML for Google maps, but it is not adequate on its own because it carries information about representation, not the raw data. Different XML namespaces can easily be mixed within the same XML document. A subset of GML might well be useful.
- Write a good database schema, which you need for any xml2sql tool.
- On the use of SRB, I like SRB a lot, but it is not the only tool we can use and I would advocate that we do some thinking here. The main issue with data grids is over who controls the data. There is clear risk if access to your data is dependent on another institute, because you have no control over continued access. Most data grid products, SRB included, have a model that presuppose that the data grid will last for a long time. Our idea is to use a set of locally-based webdav servers, with a metadata interface that provides access to all the files, enabling sharing of data but without compromising ownership.
WP2
I am not entirely sure what goes in here, but here are some ideas of things that I think we should do.
We need a grid job submission system that is not tied to any non-standard system. At the present time, the key tool around is Globus. GridSAM is being developed within OMII, and I think we have to mention it, but we have evaluated its current status and found that it is not robust and lacking in documentation. On the other hand, Globus may well be heavyweight and hard to use for new people, but there is a lot of expertise.
Our approach has been to provide tools that interact with one or more Globus servers. It is completely impractical putting Globus on the users' computer for several reasons (eg need for static IP address and name, issues with installation, doesn't work on Windows). Then you have the issue that writing Globus commands is not easy – well, some are of course, but scripting jobs and workflows on a case-by-case basis is not easy to do and even harder to debug. So eMinerals has developed its RMCS system, and both NIEeS and MaterialsGrid are running an independent RMCS instance.
Let me specify what RMCS does. It is based on a server that will submit Globus jobs. At its heart is a perl program called MCS (My_Condor_Submit – so called because we use the Condor-G interface to Globus, and thus we use Condor-like scripts). MCS integrates data management with grid computing in one very specific way. It grabs its data files not from the user's computer but from the data grid, and it writes all output files back into the data grid rather than sending them to the user's computer. The user can then access the files from the data grid. One key advantage of this is that the user ends up with a complete and reasonably-protected archive of all files associated with a job, without having to do anything about it. In short, MCS builds data curation into the job submission process. Moreover, MCS will collect metadata automatically, but extracting metadata from the output XML files. We collect various sort of metadata, including stuff about the job environment (date, machine etc), all metadata the program throws out (eg code version number), all input parameters, and core output values (these are the only bit that the user needs to specify). MCS requires a relatively easy and lightweight script from the user that allows the user to specify information about the location of directories in the data grid, name of executable etc, in a relatively easy format. MCS is the tool that does the job submission and data management. RMCS is the way the user interacts with MCS. RMCS basically consists of a server and a database. The server receives instructions via web services from client tools, it submits the MCS job, and it keeps records of how the job is doing in the database. In practice the client tools interact directly with the database (a bit of side information). We have 2 client tools, one a set of shell commands for tasks such as submit a job, check the job status, and delete a job. The other is a java GUI which basically does the same thing except allowing users to press buttons rather than type in commands.
Now the RMCS system allows any process to send off web services call, so it can be used in conjunction with any portal. This is what MaterialsGrid uses for its portal-based job submission system. Because everything is done using 'standards', it works well.
RMCS can submit jobs to any system that uses standard middleware such as Globus. Thus it works on things like the National Grid Service (worth a mention), and should work on EGEE (but we haven't tried it). We also have it working on the NW-grid. There is a need to have some things installed on the grid resource besides Globus; external resources will need XML, metadata and data grid tools, but these can be installed on a user basis if we can't get them installed system-wide (we have them system-wide on the NGS).
The point of this discourse is to note that I would advocate using this system for the proposed simulation runs. If we do, then our coding effort is raised to a higher level of interfacing with RMCS rather than with the underlying Globus calls. We get a lot "for free" with RMCS, as MaterialsGrid has realised.
I am not clear as to how much workflow is required. There are various routes to this.
If the workflow is fixed and straightforward, it can be exectuted within a shell script. This requires little effort really, and we do it when we submit tasks that involve a mix of simulation and analysis.
In other cases, the workflow may be generated (or at least defined) by the code's XML dictionary, and put together by a tool such as the portal. This is exactly what MaterialsGrid does. But you then have a question of which workflow tool to use? We use Pipeline Pilot because it works really well, much better than BPEL (tools are buggy, and only implement parts of the BPEL standards), but BPEL is free and PP is not.
WP4
The idea of having an interface guide the reader through the issues associated with a program is something we have been working on in Cambridge, and I would be keen to include it here.
The idea is that in a general framework, any hard-wiring against the requirements of a specific code immediately throws away the generality. For specific projects where no-one will have significant changes of mind, this need not be a worry, but in the wider picture, it is nice to have no code-specific hard-wiring within any infrastructure.
Within the MaterialsGrid approach, this is tackled using the Golem tool in combination with an XML task list. This is accessed via a special portlet written for the MaterialsGrid portal.
To get this working, one needs to create a sample output file from the code. The Golem tool can then be used to create the dictionary, in terms of defining for each item its data type, units etc. The human-readable part can be added later (and is worth the effort).
Summary
What I have done above is give my take on the grid stuff. I think that the next stage is to liaise as to how to link this in specifically. I think that the grid roadmap is reasonably clear, just as the science roadmap looks clear to me. What we have to do is ensure that the two match seamlessly, which means (in my view) adapting the grid stuff to the science drivers. I am sure this is not hard.
We will need to make some specific comments from our side, such as the design of the data grid, the design of the portal, and the XML-isation of the simulation codes. The latter is a bit of work only in one sense, namely defining the XML language. The actual mechanics of adapting the codes is now straightforward.
But this is stuff we want to do anyways!
martin, 28-Nov-2007 06:00 (GMT)