The Cambridge-MIT Institute, whose remit from the government is to help UK academics have a greater impact on the UK economy, is putting together a consortium of universities, escience centres and companies in the industry to carry out this work. Applications are being made for resources to write the middleware that will form the core of the framework.
At the same time, members of the consortium are developing applications. Among these are road charging (including economic, social, health and land-use issues), the impact of transport on the environment, and various others. One project will use the City of Cambridge as a test bed to learn how to gather data from traffic sensors and other sources, such as CCTV security cameras or car park entry and exit gates.
The National Transport Data Framework will be made available for use by universities, transport operators, government departments and others. It is an open-ended venture, since there are a very large number of possible applications, funding for which will need to come from participant companies, the research councils, the DTI, the EU, etc. By joining in a consortium, the participants will enhance their chances of attracting such funding. Further, the contacts among them that the project will generate will enhance the influence of the work of the various universities and, very importantly, help to generate new ideas.
Initially the consortium consists of institutions in Cambridge, Edinburgh, Leeds, Newcastle, Southampton, and Imperial College and University College in London. Thus it includes the country's leading regional escience centres and university departments of computer science and transport studies. It will be essential also to have active participation from transport executives and operators, regional agencies, and the DfT. Our initial contacts with these have revealed a great deal of interest in participating, though detailed negotiations with them have yet to begin. We will have access also to expertise in transport and in data handling at MIT.
The NTDF may be thought of as a hub, comprising the escience middleware, with spokes that comprise the applications of it. These spokes will be independent to a large extent, though not completely so.
Sound transport research and policy making depend upon the availability of appropriate, high quality and up-to-date information. Thus, the means by which transport related data are collected, stored, processed and made available are of central importance to the outcome of research and practice. Any shortcomings in underlying data are propagated to shortcomings in research and practice.
The goal is to establish a Web service that provides common access to relevant data sources for a variety of users with an interest in transport. It is not the intention to store copies of the data, but instead to manage metadata describing the characteristics of each data set, and to support appropriate tools for interacting with them. Data are useless unless there is a clear indication of their accuracy. Part of the metadata description will record what is known about the uncertainty in the various attributes that are recorded in each primary data context.
We propose to build a programmable system giving access to both live and historical data. If data are of interest to any of the groups of users, then they are a candidate for inclusion. For example, traffic congestion will cause deterioration of air quality, so air pollution statistics are a candidate. We need more than access to data via websites for human viewing; there should be interfaces offering access to data streams. With a programmable interface, users will be able to tailor the data to their needs, and also make use of other sources of information such as pollution monitors, demographic information and so on.
In order to implement such a system, it is vital that the stakeholders retain control of their own data, and are responsible for collecting and collating them, and ensuring that they are as accurate as possible. Security of the data is an important consideration. This means that it is impractical to consider a single centralised database holding all the relevant data. Instead we propose to implement a central metadata repository which would contain descriptions of data held elsewhere, specifying their physical format and information content. The metadata should also contain information on the provenance and reliability of the data. The remote data sources can then be accessed by users without detailed knowledge of the local system, such that applications can find the data they need, and interpret them in a coherent manner.
This approach has many advantages. The data themselves are held by the original owner, with no need to implement a large central store, which would need a large capacity, with high bandwidth access, and lead to problems with ensuring the integrity of the copied data. The system is scaleable, as new data sources can be added at will, by adding suitable metadata descriptions. Data providers are not required to adapt their data to a common format, only to specify their own metadata, hence making it possible to cope with legacy systems.
At present there is a lack of agreed standards for the collection and reporting of different types of data items, leading to problems of consistency and comparability between different datasets. There is also a lack of widely available and robust statistical methods to enable data drawn from different sources to be efficiently and consistently combined. There is also a need for better treatment of data quality and quality assurance. Currently there are very few commonly agreed data quality indicators or minimum quality standards that apply to transport data. As a result, data quality is largely a matter of 'trusted source', with the inevitable consequence that the quality of data deployed in research and practice varies widely.
We will address all these issues. There are many challenges, notably in implementing a suitable system of access control, such that data providers can be confident that they retain the ability to provide different levels of access to outside users (for example, more access might be granted to a partner organisation than to the general public), and to protect the data integrity. It will also be necessary to provide for large amounts of real-time information, in addition to the more traditional database which is updated at a relatively low rate.
The middleware must be general, but it is only possible to develop usable software by specifying and meeting the requirements of real users. We shall therefore formulate a number of use cases. The intention is that we shall learn early about what is required for describing secondary information sources, also the processes by which they are derived. Road traffic sources will be used in this task. In addition the initial collections will include at least one data source derived from rail transport.
The NTDF will be a web-based resource which will be widely available. Potential clients include authorities concerned with congestion monitoring, transport operators, network managers such as Network Rail, the Treasury and the DfT, local government, planning authorities, health authorities, academics and the general public. Users will be authenticated according to some standard protocols. Provided that a user has the necessary privileges, interfaces will be provided to corresponding data sources, and will in the long run offer specialist tools, for example for data visualisation.