Services for the European Open Science Cloud
Services for the European Open Science Cloud
The demos will be presented on Wednesday, 10th April 2019 at 10:40 am CET in room Prague A+D as part of the main opening plenary and showcased in the foyer Prague according to the below schedule:
Latest technologies for molecular imaging at state of the art photon facilities like the European XFEL deliver hundreds of Petabytes of data per year, challenging established data processing strategies. Integrating collaborative platforms for scientific computing on fast growing cloud infrastructures like the European Open Science Cloud, DESY develops innovative flexible and scalable storage and compute services. Covering the entire data life cycle from experiment control to long term archival, the particular focus on re-usability of methods and results leads to an integrated approach that bundles data, functions, workflows and publications.
Scientists develop and deploy micro-services through shared container registries as cloud functions, thereby preserving software environments, configurations and algorithm implementations for research and publications. This Function-as-a-Service approach leverages efficient, auto-scaling provisioning of cloud resources for compute intensive and repetitively used codes from lambda functions to highly specialized applications.
The backend storage system dCache is designed to scale up to peta-scale throughput and to provide storage events which leverage directly achievable automation on production systems. Codes executed in response to incoming files immediately extract metadata, update data catalogues, feed into monitoring and accounting systems and create derived data sets.
In the eXtreme-DataCloud (XDC) project, DESY demonstrates that event-driven function execution as a service adds a flexible building block to data life-cycle management and smart data placement strategies. Enforcing machine actionable Data Management Plans (DMP), rule-based data management engines and file transfer systems consume events e.g. to create replicas of data sets with respect to data locality and Quality of Service (QoS) for storage. On data ingestion, files can be copied to cloud storage buffers where strong, auto-scaling clusters of compute elements carry out data processing pipelines and update data placement rules on success. Eventually, this triggers their automated distribution to offline storage and long term archival.
With a focus on metadata and data interoperability, pipelines span from photon science into domain specific analysis and simulation tools e.g. in structural biology and material sciences. Well-defined interfaces allow users to combine functions from various frameworks and programming languages. Where data connectors or format converters are needed, scientists can deploy own solutions as additional micro-services and programmable interfaces.
This live demonstration addresses the user and the provider perspective to a decoupled cloud based micro-service oriented architecture and illustrates how to share codes as functions and continuously integrate them in automated data processing pipelines as well as interactive workflows.
Authors: Abdulrahman Azab, Milen Kouylekov, Leon Charl du Toit, Eirik Haatveit, Jaakko Leinonen, Antti Pursula, Maria Francesca Iozzi
Research often involves the use of personal data as a basis for the scientific analysis. However, a particular challenge in this area is to use these data resources without violating privacy. And for that we need secure digital infrastructures, compliant with both national and European regulations.
EOSC-hub project provides services for sensitive data  through two partners: the Sigma2 / University of Oslo in Norway, and the CSC in Finland.
TSD (Services for Sensitive Data)  is the Norwegian e-Infrastructure for sensitive data storage and management provided by the University of Oslo. TSD provides sensitive data services directly to researchers and groups in the form of SaaS (Software as a Service) and PaaS (Platform as a Service). Each TSD project has a separate VLAN and is accessed through two-factor authentication. Currently there are 670+ projects in TSD. TSD supports: Data Storage, Web forms, High Performance Computing (HPC), Audio/Video streaming and analysis, and software management for Windows and Linux platforms.
The demo will include an introduction to sensitive data and EOSC-hub sensitive data services, and live demos for using TSD (Services for Sensitive Data) and containerised tools for analysing sensitive data:
Abdulrahman Azab, Eirik Haatveit, Leon Charl du Toit, Jaakko Leinonen
Authors: Enikő Nagy, Péter Kacsuk
Apache Spark cluster together with HDFS (Hadoop Distributed File System) represents one of the most important tool for Big Data and machine learning applications, enabling the parallel processing of large data sets on many virtual machines, which are running Spark workers. On the other hand, setting up a Spark cluster with HDFS on clouds is not straightforward, requiring deep knowledge of both cloud and Apache Spark architecture. To save this hard work for scientists we have created and made public the required infrastructure descriptors by which the publicly available Occopus cloud orchestrator can automatically deploy Spark clusters with the number of workers specified by the user.
One of the most typical application area of Big Data technology is the statistical data processing that is usually done by the programming language R. In order to facilitate the work of statisticians using Spark on MTA Cloud, we have created an extended version of the Spark infrastructure descriptors placing the sparklyr library on Spark workers, too. Finally, we have also integrated the user-friendly RStudio user interface into the Spark system. As a result, researchers using the statistical R package can easily and quickly deploy a complete R-oriented Spark cluster on clouds containing the following components: RStudio, R, sparklyr, Spark and HDFS.
Spark also provides a special library called “Spark MLlib” for supporting machine learning applications. Similarly, to the R-oriented Spark environment, we have developed the infrastructure descriptors for the creation of a machine learning environment in MTA Cloud. Here, the programming language is Python and the user programming environment is Jupyter. The complete machine learning environment consists of the following components: Jupyter, Python, Spark and HDFS. Deploying this machine learning environment is also automatically done by Occopus and the number of Spark workers can be defined by the user. The supported cloud types are AWS, OpenStack, OpenNebula, CloudSigma.
In the demo we show how to deploy a Spark cluster environment with Jupyter, Python, HDFS in OpenStack clouds and how this environment is used to solve the categorization of newspaper articles.
Authors: Amir Kamran, Ondřej Košarko, Jozef Mišutka, Pavel Straňák
CLARIN-DSpace is an enhanced fork of the DSpace repository software. The out of the box it is tailored to suit the needs of a language data repository, but has been also deployed for contemporary history or film archives, and is not limited to specific scientific field.
The provided workflows make integrations with third party services auto-suggest grants from OpenAIRE, reporting to CLARIN Virtual Language Observatory, Clarivate Data Citation Index, or OpenAIRE, effortless.
Not only is the submission workflow preconfigured to require necessary metadata, but also provides a guide for Open Access licensing by integrating the Public License Selector. And because not all data can be open, there is also a support for submitters to assign custom licenses to their datasets, for users to sign the licenses, and for repository managers to manage and keep track of all of it.
When the software is configured to connect with Piwik (Matomo) analytics platform, the submitters of data are provided with concise periodic reports about popularity of their submissions.
To offer additional layer of protection, the system can be easily configured to automatically backup submissions via EUDAT B2SAFE service.
Authors: James DesLauriers, Tamas Kiss, Jozsef Kovacs, Gregoire Gesmier, Hai-Van Dang,Gabriele Pierantoni,Gabor Terstyanszky,Peter Kacsuk
Many scientific and commercial applications require access to computation, data or networking resources based on dynamically changing requirements. Users and providers both require these applications or services to dynamically adjust to fluctuations in demand and serve end-users at a required quality of service (performance, reliability, security, etc.) and at optimized cost. This may require resources of these applications or services to automatically scale up or down.
The European funded, H2020 COLA (Cloud Orchestration at the Level of Application) project set out to design and develop a generic framework that supports automated scalability of a large variety of applications. Learning from previous similar efforts and with the aim of reusing existing open source technologies wherever possible, COLA proposed a modular architecture called MiCADO (Microservices-based Cloud Application-level Dynamic Orchestrator) to provide optimized deployment and run-time orchestration for cloud applications.
MiCADO is completely open-source and is built from well-defined building blocks implemented as microservices. This modular design supports various implementations where components can be replaced relatively easily with alternative technologies. These building blocks, both on the MiCADO Master and also on the MiCADO Worker Nodes are implemented as microservices. The current implementation uses widely applied technologies, such as Kubernetes as the Container Orchestrator, Occopus as the Cloud Orchestrator, and Prometheus as the Monitoring System.
The user facing interface of MiCADO is a TOSCA (Topology and Orchestration Specification for Cloud Applications, an OASIS standard) Application Description Template which describes the desired container and virtual machine topology and its associated scalability and security policies. This interface has the potential to be embedded to existing GUIs, custom web interfaces or science gateways.
MiCADO has been tested on a range of small to large-scale research activities that saw different computations, simulations and other experiments carried out on a number of different public and private clouds. Support for a number of research and industry use-cases has been added throughout the development of MiCADO, and as a mature framework, it will be a useful and important addition to the European Open Science Cloud Hub.
The two main targeted application types are cloud-based services where scalability is achieved by scaling up or down the number of containers and virtual machines based on load, performance and cost, and, in a second category, the execution of a large number of (typically parameter sweep style) jobs where a certain number of tasks need to be executed by a set deadline.
We propose a short demo of MiCADO demonstrating virtual machine and container auto-scaling for both types of targeted applications. The load and performance-based scaling demonstration will see a resource-intensive application deployed, to which MiCADO will respond by scaling up the infrastructure to meet the demand. In the deadline-based scaling example, an experiment made up of a large number of parameterized jobs will be submitted to a distributed task queue (JQueuer). MiCADO will handle scaling the container and virtual machine infrastructure so that JQueuer may execute and complete the total number of jobs before the set deadline is reached.
Authors: André Moreira, Twan Goosen, Dieter Van Uytvanck, Willem Elbers
CLARIN offers its community a wide range of tools to discover, explore and process language-related resources in an integrated manner. Three of these tools are being further developed by CLARIN ERIC in the context of EOSC-hub: the Virtual Collection Registry (VCR, https://collections.clarin.eu), the Virtual Language Observatory (VLO, https://vlo.clarin.eu) and the Language Resource Switchboard (LRS, https://switchboard.clarin.eu).
The Virtual Collection Registry is a web application where scholars can create and publish virtual collections for manual access (using a web-browser) as well as automated processing (e.g. by a web service). A virtual collection is a coherent set of links to digital objects (e.g. annotated text, video) that can be easily created, accessed and cited. These links can originate from different archives, hence the term ‘virtual’.
The Virtual Language Observatory is a metadata-based portal for language resources. It was developed as a means to explore the linguistic resources, services and tools available within CLARIN and related communities. It aims to provide a uniform search and discovery process for a large number of resources from a wide variety of domains and providers, and is completely based on the Component Metadata (CMDI) standard and semantic mapping through the CLARIN Concept Registry.
The Language Resource Switchboard aims at helping users to connect resources with the tools that can process these. The LRS lists all applicable tools for a given resource, lists the tasks the tools can achieve, and invokes the selected tool in such a way so that processing can start immediately without any, or with little prior tool parameterization. The LRS can be called directly from the VLO or the VCR, as well as EUDAT's B2DROP data exchange service.
In this demo we will demonstrate some of the possible workflows made possible by the tight integration of these 3 applications, while showcasing the main features of each of them. We invite you to get in touch, if you are interested in using these services for your own community.
Authors: Miguel Caballer; Amanda Calatrava; Ignacio Blanquer; Francisco Brasileiro
Federated clouds address many problems of scientific and industrial applications, such as legal restrictions and efficient data access. Despite these benefits, its natural geographical distribution and complex software stack challenges the developers and operators aiming to harness this kind of infrastructure.
The infrastructure used in this demo is comprised of multiple cloud providers that are geographically distributed. The seamless federation of these providers is implemented through the use of the Fogbow middleware . It consists on a federated deployment of several sites at both sides of the Atlantic ocean that integrates a small amount of resources, enough for the validation of the deployment services.
This demo shows the work developed in the ATMOSPHERE project  to deploy self-managed elastic clusters on federated clouds using the EC3 tool . In particular a Mesos cluster will be shown in the demo, but other type of cluster are supported (Torque, SLURM, Kubernetes, etc.) Our approach can automatically scale up/down the managed resources and secure communication and execution across different IaaS providers. In particular two different cloud providers are used. Initially only the front-end node is launched using the EC3 client  in the first provider. As requested by the cluster workload, new WNs are deployed in the first provider. When the number of WNs in the first provider reaches to a predetermined threshold (specified by the user) subsequent WNs are deployed in the second provider. To enable the connectivity among all the VMs in the infrastructure, a federated network is created. It enables not only the connectivity without needing public IPs in the WNs, but also encrypting the communications to ensure security.
Authors: Matteo Longoni, Antonella Mauri
It is fairly common, for top-level athletes both in individual and in team specialties, to record their training sessions in order to review them afterwards, find their weak spots and address the improvement areas with specific training. Continuous improvement in sports also implies having a means to keep track of an athlete's performance over time. As of today, that is chiefly done manually and it is mostly based on the individual athlete's or coach's interpretation of what can be viewed in a video, and on the accuracy by which the observations are recorded for future use. The amount of data can be huge, since every athlete of every sport or even every single player per team can be shot at every training session (at least at 50 frames per second) and from every frame many patterns can be identified and tracked. Moreover, to optimize the gestures, the athlete performance must be monitored during all the season. The goal is to answer the market needs for powerful and smart tool for video processing to extract KPIs in a data-driven and automated way: that is avoiding to manually inspect the huge number of video that may be required for a complete analysis, as big data are processed by advanced algorithm and methods (such as functional data analysis) automatically, to extract KPIs in standard reports minimizing the user intervention and maximizing the analyses efficacy.
Demo will show how the final users, first and foremost sports managers and coaches, will be able to upload training session videos and metadata taken during the shooting sessions and, thanks to the automated video processing, get back the training session result in a matter of minutes instead of hours of video inspection. The result will show how an athlete performs a certain motion (e.g the serve in tennis or volleyball) in relation to the motion outcome, and provide indicators that will help coaches and the athletes themselves to improve. Users will have access to all the training session videos and reports and they will be able to share them with other users.
Authors: Fenareti Lampathaki, Maria-Jose Nunez
With data analytics and business intelligence reaching into every corner of industry, today’s knowledgeable and demanding customers interact with companies, casually refer to their perceptions on consumed products and services, and make purchasing decisions very differently than they did in the past. Consumer feedback regarding offered products and services has in fact always been greatly appreciated by manufacturers who seek to leverage it in order to gain insights into their customers’ preferences and adjust future products and services to their needs and desires. Despite the tremendous data deluge our world witnesses, though, the furniture domain still suffers from several inherent challenges: (a) Generic social media analytics tools monitor the global trends and do not provide actionable insights to the domain’s SMEs; (b) The furniture SMEs online presence is limited and therefore the current content is biased towards larger brands; (c) Trend prediction methods cannot easily distinguish between promotional and genuine content while facing significant difficulties when it comes to image recognition.
The demo will provide a walk-through to the functionalities of the furniture analytics platform-as-a-service that collects, analyzes and visualizes publicly available online content (from social media platforms and blogs) related to furniture. The demo will show how furniture manufacturers can navigate to intuitive dashboards that have been created and curated by domain experts in order to gain useful furniture-related insights, detect relevant furniture product-service topics/features along with their prevalent emotion, monitor brands and customer interactions and early predict furniture trends for the upcoming seasons. The demo will also focus on how domain experts use the platform to consolidate and share the end-user dashboards: (a) different comparisons in time and in content (between trends or competitors, for example) will be demonstrated, (b) how the discussions and weak signals from other “neighbouring” domains influence the furniture domain and its future trends will be explained.
Authors: Adam Majewski, Krzysztof Gibas, Maciej Hłasko, Bartłomiej Pysiak
Guardomic is a bot mitigation engine aimed for web services owners who want to protect their websites from bot traffic and their users from fraudulent digital ads or cryptocurrency web mining. A big part of the system is in-depth statistics, allowing clients to be conscious about their traffic.
Recent years show an increasing number of bot attacks in the global network. For example, only in the online advertising business, estimations for waste made by bots were at the level of $7.2 billion in 2016 (source: Association of National Advertisers and White Ops).
There is a very fast rising problem of bots in the Internet. Online business suffers from financial losses because of web attacks. All web-based services are exposed to all kinds of attacks originating from bot networks. Preventing against those type of attacks is crucial from business perspective. Lack of tools which can provide clear picture about detected threats and bots on websites. There is a need on the market for a solution to protect against bots attacks. Guardomic is an innovative solution to prevent on-line services from botnets attacks such as: web scraping, online fraud, digital ad fraud, web application security, spam. It also allows customer to block unwanted traffic (i.e. from specified country, specified ISP, IP range, etc.). Demo will show how individual users or companies can significantly decrease their financial losses caused by bot traffic on their domains. Demo will prove this by showing how the system monitors all types of network movement and provides ability to react and increase domain effectiveness, performance and security. This includes monitoring traffic incoming from links generated by advertising campaigns, and by statistical analysis giving in-depth view of distribution between humans and bots for each campaign, which as a result allows to increase Conversion Rate. The demo will also show how users can mitigate artificial traffic by enabling certain features - protection against cross-site scripting, brute force login attacks and variety of others. Statistics dashboard will help detecting origin of movement that is decreasing domain performance and effectively block it. The presentation will move on to elaborate how users can protect their domain by turning on security-increasing headers, CAPTCHA protection, masking server details or even block certain countries.
Authors: Agustín P. Monteoliva Herreras, Alberto Criado Delgado, Fernando Aguilar Gómez
Harmful Algal Blooms (HABs) happen when toxic microalgae proliferate beyond control and take over rivers, lakes or ponds with costly environmental and socioeconomic impacts, for example: on fisheries, or on the availability of drinking water. At sea, this phenomenon causes red tides. Blooms and red tides are caused by a combination of meteorological, hydrodynamic and biogeochemical factors that are difficult to pin down with certainty. For these reasons, managing algal blooms is a challenge for local governments, environmental agencies and the people that depend on healthy water bodies for their livelihood. Despite the investment in waste management and monitoring systems, current methods and processes are still far from ideal. Ecohydros believes that deploying new technologies and big data analytics can pave the way for better and more efficient ways to manage harmful algal blooms. Extracting meaningful information from monitoring data is a computational challenge. The data covers hundreds of variables and parameters that need to undergo treatment, processing and analysis before they can be used in visualization tools. The predictive models also require calibration in the short and medium term in an at least semi-automatic way, performing sweeps of numerous parameters in multiple combinations, forcing the system to use high demand iterations. All this leads to an increase in the demand for computing beyond what a standard company or a standard computer center can provide.
The demo will show how cloud computing based solutions enable a system to manage not only the data ingestion from diverse sources but also the modeling of aquatic ecosystems. First of all, the data ingestion part will show how data coming from different sources are gathered (Sentinel-2, Landsat, AEMET). After downloading, data is stored in Onedata and metadata is automatically attached. These metadata are indexed and enable queries to find the datasets during the modeling part. The different datasets to be downloaded can be selected in a Jupyter notebook form (which has access to the Onedata space) as well as the type of data to find and the model to perform. When the modeling is selected, the data stored in onedata is found (using dates range and location) and used to feed the model after preprocessing. The models run using cloud computing resources and the output is also stored in onedata, so Jupyter notebook interface access directly to the generated data.
Authors: J. Rogeiro, A. Oliveira, J. Teixeira, A. Fortunato, A. Azevedo, P. Lopes, J. Gomes, J. Pina, M. David, S. Bernardo, M. Rodrigues
Forecast systems are fundamental components of emergency response and routine management of coastal regions. They provide coastal managers and all entities that have responsibilities on the coast with accurate and timely predictions on water conditions (e.g., water levels and velocities, wave characteristics), supporting multiple uses such as navigation, water monitoring, port operations, dredging works and construction activities on the coast.
In the scope of the EOSC-Hub project, a new thematic service for generic deployment of forecast systems at user-specified locations was developed by LNEC, LIP, CNRS/LR and UC. Denoted OPENCoastS, this service builds on-demand circulation forecast systems for user-selected sections of the coast and maintain them running operationally for the time frame defined by the user. This daily service generates forecasts of water levels and 2D velocities (and wave parameters in the near future) over the spatial region of interest for periods of 48 hours, based on numerical simulations of all relevant physical processes. The OPENCoastS.pt service takes advantage of two e-infrastructures for computational and storage resources: the National Advanced Computing Infrastructure – INCD (integrated in the National Roadmap for Infrastructures of the Foundation for Science and Technology of Portugal) and IFCA (Institute of Physics of Cantabria, Spain). OPENCoastS is supported by the EGI computational resources, through project H2020 EOSC-Hub, being available as one of its thematic services (https://www.eosc-hub.eu/catalogue/OPENCoastS).
The architecture for the OPENCoastS.pt service includes:
- the user interface component, a web based portal;
- the computation component, where simulation results are generated and post-processed;
- the archive component, responsible for preserve all relevant data.
In this demo we will showcase the main functionalities of the OPENCoastS web interface and help users how to set up a new forecast deployment using this interface. Resources will include a video on how to use the service and a team member onsite to help users using the interface in his/her laptop.
Authors: M. Antonacci, A. Ceccanti, G. Donvito, Álvaro López García
The INDIGO-DataCloud PaaS Orchestrator service is now ready to be part of the EOSC-hub Service Catalogue, providing users with advanced orchestration and scheduling capabilities. The PaaS Orchestrator allows, in fact, to coordinate the provisioning of virtualized compute and storage resources on Cloud Management Frameworks, both private and public (like OpenStack, OpenNebula, AWS, etc.), and the deployment of dockerized services and jobs on Mesos clusters. It receives the deployment requests, expressed through templates written in TOSCA, and deploys them on the best available cloud site. In order to select the best site, the Orchestrator implements a complex workflow: it gathers information about the SLAs signed by the providers and monitoring data about the availability of the compute and storage resources. Using the PaaS Orchestrator and the TOSCA templates, the end users can exploit computational resources without knowledge about the IaaS details. Moreover, in the framework of the ongoing H2020 projects and in particular DEEP-HybridDataCloud, the functionalities of the PaaS Orchestrator are being enhanced to deal with the emerging needs of the scientific communities, for example the possibility to exploit specialized hw resources (namely GPUs).
In this demo you will see how a user can easily submit the deployment of an application, both exploiting virtual machines or docker container, requiring GPUs through the PaaS Orchestrator. A simple and intuitive web interface is used to submit the deployment requests to the PaaS stack: the user is not required to know any technicalities about the TOSCA template language and the interaction with the PaaS. He is guided through the deployment submission process with user-friendly web pages. The authentication and authorization aspects are managed by the INDIGO IAM service that can federate different Identity Providers and offer different authentication mechanisms: social login (e.g. Google), eduGain, EGI Check-in, X.509 certificate, etc. After the authentication through the preferred IdP, e.g. EGI Check-in, the user can select the template to be used for the deployment looking at the description and the input parameters needed to run the deployment, including the required resources in terms of CPUs, GPUs and RAM. Upon form submission, the Orchestrator receives the deployment request and performs the workflow to select the best resource provider and coordinate the provisioning and the configuration of the required compute and storage resources. The demo will show how the PaaS Orchestrator can be used to schedule the user deployment requests on sites of the EGI Federated Cloud.
Through the web interface the user can monitor the deployment status and finally get the endpoint to access the deployed application. As soon as the deployment is not needed anymore, the user can click on the “delete” button to release all the resources allocated for it. In any moment, the list of deployments created by the user is accessible through the web interface.