Data Science

Industry-relevant data comes in many forms, from structured sources (e.g., databases) and unstructured sources (e.g., natural language documents intended to be read by humans). Having access to all this data is only as useful as the methods one has for evaluating and using this data to make decisions.

This research program employs natural language processing, machine learning, and statistics in order to extract as much information as possible from both unstructured and structured data sources. It will in particular focus on developing novel approaches for extracting information from data while taking into accounts its structure and semantics.

Accessing data, which is the overall theme of SIRIUS, is only useful if we can use this data to make decisions. For this reason, SIRIUS has partnered with the BigInsight SFI to form the DataScience@UiO Cluster. We have also employed Basil Ell to lead our data science research program. In 2018 we have also linked SIRIUS’ research in language technology and data science into a single research program.

Data Science is a broad and poorly defined field. SIRIUS’ focus is on domain-adapted data science. Our aim is to contribute in three narrow areas where our mix of skills can advance research and help the oil and gas industry. These are:

  • Extracting data from unstructured sources (documents that are designed to be read by people) so that it can be used by machine learning and data mining. These standard data science methods need structured data. Our research looks at how we can convert text, written in a natural language, into structured data. An example of how we do this is given in Extracting Information from Tables to Populate Knowledge Bases.

 

  • Interpreting text that uses oil & gas language. Unfortunately, oil & gas language differs from normal literary Norwegian or English. This means that language processing tools need to be adapted to the special vocabulary of oil & gas documents.

 

  • Using all the structure in structured data in our data science. Data from business, engineering and operational have a rich structure. Things are linked together. Invoices are linked to deliveries, bills of materials and work orders. Flow measurements in an oil platform are linked together by the flow of hydrocarbons through the systems. This structure is represented by graphs: networks or trees of connection between data. Machine learning tools, however, deal only with vectors, long lists, of data. This means that they cannot take account of the information that is available in the linkages between data.

In 2018 the language technology group in SIRIUS demonstrated a system for extracting relations from scientific text. This tool was built with convolutional Neural Networks using semi-supervised learning and structured domain knowledge. The tool was a top performer in the SemEval international shared task. This was a research competition on relation extraction from scientific texts. Our system ranked third out of 28 participants. Farhad Nooralahzadeh is working further with this tool to develop algorithms and a prototype implementation that deals with the complexity of realistic exploration reports.

This work depends on domain-specific word embeddings. The semantic properties of technical terms need to be captured and put in a form where they can be used by extraction tools. SIRIUS released such a tool in May 2018. Its results were validated by geoscientists from the Department of Geosciences in Oslo. It was also validated with Schlumberger’s proprietary glossary of oil and gas terms and the GeoSciML (http://www.geosciml.org) standard