The driving vision behind LarKC is to go beyond the limited storage and inference solutions currently available for semantic computing. For this purpose such an infrastructure must go beyond the current paradigms which are strictly based on logic by fusing reasoning with complementary techniques e.g. from information retrieval.
The overall aim of LarKC is to build an integrated platform for semantic computing on a scale well beyond what is currently possible.
The platform aims to fulfill needs in sectors that are dependent on massive heterogeneous information sources such as telecommunication services, bio-medical research,
and drug-discovery. LarKC is based on a pluggable architecture in which it is possible to exploit techniques and heuristics from diverse areas such as databases, machine learning, cognitive science, Semantic Web, and others. As result LarKC allows to integrate logical reasoning with search methods.
We develop the Large Knowledge Collider, a pluggable algorithmic framework implemented on a distributed computational platform. This will allow reasoning at Web scale by trading quality for computational cost and by embracing incompleteness and unsoundness.
Plug-in Architecture: Instead of being built only on logic, the Large Knowledge Collider allows to exploit a large variety of methods from other fields: cognitive science (human heuristics), economics (limited rationality and cost/benefit trade-offs), information retrieval (recall/precision trade-offs), and databases (very large datasets). A pluggable architecture ensures a coherent integration of various components.
Distributed and Parallel Computing: The Large Knowledge Collider makes use of parallel hardware using cluster computing techniques. In this way it aims to leverage large-scale, distributed computational resources in order to meet the scalability requirements of current, data-driven applications in semantic computing.
Meaning-based computing is the way of the future as 80 per cent of information within enterprises is unstructured and making use of this ‘hidden’ intelligence is at the heart of improving the way we interact with information. Some of the most advanced use cases for such semantic computing today require processing of 10 billion RDF triples or more in near-realtime. Practical examples are context-sensitive and personalized mobile services in the telecom sector, or processing of large volumes of medical data and literature.
In general, the Web has already made tremendous amounts of meaningful information available and, for example also the public sector has begun recently to make increasingly more data accessible to parties interested in working with this information (see http://www.data.gov.uk/ or http://www.data.gov/)
Exemplary use-cases within the LarKC project serve to further illustrate this challenge posed by unprecedented amounts of information. In particular the LarKC use-cases focus on the following problems:
Data integration in the pharmaceutical and biotechnology domain: Data integration continues to be a serious bottleneck for the expectations of increased productivity in the pharmaceutical and biotechnology domain.
“Linked Life Data” integrates common public datasets that describe the relationships between gene, protein, interaction, pathway, target, drug, disease and patient and currently consist of more than 5 billion RDF statements.
The dataset interconnects more than 20 complete data sources and helps to understand the “bigger picture” of a research problem by linking previously unrelated data from heterogeneous knowledge domains.
Urban Computing: Urban environments are represented on the Web through a large and diverse set of distributed pieces of information: maps, events, interesting places, traffic data, etc. Moreover, local governments’ awareness of publishing data on the Web for public use is increasing.
Concentrating our attention on the city of Milano in Italy, we have identified several data sources to take into account for our Urban Computing scenario. These data sources are diverse and heterogeneous not only in content, but also in format and availability conditions.
For more information on the datasets published as outcomes of work within the LarKC use-cases see our resources section.
Current reasoning systems do not scale up to the challenge of processing data-sets of such a size. Therefore, a reasoning infrastructure must be designed and built that can scale and that can be flexibly adapted to the varying requirements of quality and scale of different use-cases.
The Large Knowledge Collider (LarKC) performs massive, distributed, and necessarily incomplete reasoning over web-scale knowledge sources. Massive inference is achieved by distributing problems across heterogeneous computing resources and coordinated by LarKC. Some of the distributed computational resources will run highly coupled, high performance inference engines on local parallel hardware before the results are communicated back to the distributed computation environment.
LarKC does not only perform deductive inference with a given set of axioms in order to realize this platform for large scale reasoning. Instead LarKC employs different methods and heuristcs, integrated through a pluggable architecture, which can be combined and coordinated in complex workflows. LarKC then allocates resources strategically and tactically to different plug-ins that interact achieve a specific task.
Researchers can design and experiment with multiple realizations for each of these components, which can be roughly categorized according to the functionality they perform within the LarKC platform:
Integrating Reasoning and Search
- Where do the axioms and data come from that contribute to a solution?
- How to abstract that information into the forms needed by further heterogeneous components?
- Which part of knowledge & data is required?
- When is an answer “good enough” or “best possible” for a specific purpose?
- What can be derived from information automatically, using deductive, non-deductive inference, etc.?