KPI obtention

A key performance indicator (KPI) is a measurable value that demonstrates how effectively a company is achieving key business objectives. Organizations use KPIs to evaluate success at reaching targets.

KPI list of the project:

https://docs.google.com/spreadsheets/d/1WDA5hatG7NLCzBTpIoVFM_yiRU_Nn6kTsbeYyPXs03A/edit#gid=1922047057

Data backends

At the end of the day, the KPIs are a set of queries made to the different data backends (databases) used for the project. Among them:

catalogue_production: postgresql database used by the catalogue service.
- Data model: https://interlink-project.github.io/interlink-project/collaborativeenvironment/datamodel/datamodel.html#catalogue-service
coproduction_production: postgresql database used by the coproduction service.
- Data model: https://interlink-project.github.io/interlink-project/collaborativeenvironment/datamodel/datamodel.html#coproduction-service
elasticsearch6: Elasticsearch (version 6) database for storing user activity logs. backend-logging service connects to this database and sends the data retrieved from its API. The reason why an old version (6) is being used for this is because Dremio is not compatible with further versions.
elasticsearch8: Elasticsearch (version 8) database used by serviceaugmenter (servicepedia).

Dremio

In the simplest of terms, Dremio is a data lake engine, meaning that you can use Dremio to liberate your data through live and interactive queries sent directly to your cloud-based or on-prem data lake storage.

https://www.dremio.com/

Deployment

Dremio is deployed as a single container among all the others. You can see it in the docker-composes of the different environments:

https://github.com/interlink-project/interlink-project/blob/master/envs/development/docker-compose.yml

...

dremio:
    container_name: ${PLATFORM_STACK_NAME}-dremio
    restart: always
    image: dremio/dremio-oss:21.2
    volumes:
        - dremio-data:/opt/dremio/data
    networks:
        - traefik-public
        - default
    ports:
        - "9047:9047"   # UI (HTTP)
        # - "31010:31010" # ODBC/JDBC clients
        # - "2181:2181"   # ZooKeeper
        # - "45678:45678" # Inter-node communication
    labels:
        - traefik.enable=true
        - traefik.docker.network=traefik-public
        - traefik.http.routers.${PLATFORM_STACK_NAME}-dremio.entrypoints=websecure
        - traefik.http.routers.${PLATFORM_STACK_NAME}-dremio.tls.certresolver=letsencrypt
        - traefik.http.routers.${PLATFORM_STACK_NAME}-dremio.tls=true
        - traefik.http.routers.${PLATFORM_STACK_NAME}-dremio.rule=Host(`dremio.${DOMAIN}`)
        - traefik.http.services.${PLATFORM_STACK_NAME}-dremio.loadbalancer.server.port=9047
    ...

Once the docker service is up, Dremio GUI is accessible through https://dremio.dev.interlink-project.eu/

Nevertheless, in the first launch there is no admin account yet. For that, a python script has been created (setup-dremio.py) that calls the Dremio API to set up the administrator account and the different backends.

Dremio setup script at https://github.com/interlink-project/interlink-project/blob/master/envs/development/setup-dremio.py

import json
import requests
import time
import os
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

class Config():
    userName = os.environ.get("DREMIO_USERNAME")
    firstName = os.environ.get("DREMIO_USERNAME")
    lastName = os.environ.get("DREMIO_USERNAME")
    ## needs to be with uppercase, lowercase, digits and special character
    password = os.environ.get("DREMIO_PASSWORD")
    email = os.environ.get("DREMIO_EMAIL")
    dremioServer = "http://localhost:9047"
    token = None

    def get_headers(self):
        if self.token:
            return {'content-type': 'application/json', 'Authorization': '_dremio{authToken}'.format(authToken=self.token)}
        return {'content-type': 'application/json', 'Authorization': '_dremionull'}


config = Config()

...

response = requests.put('{server}/apiv2/bootstrap/firstuser'.format(server=config.dremioServer), headers=config.get_headers(), data=json.dumps({
    "userName": config.userName,
    "firstName": config.firstName,
    "lastName": config.lastName,
    "createdAt": int(time.time()),
    "email": config.email,
    "password": config.password,
}))

login()

# POSTGRES SOURCES
body = {
    "entityType": "source",
    "name": "catalogue",
    "description": "catalogue data",
    "type": "POSTGRES",
    "config": {
        "username": os.environ.get("POSTGRES_USERNAME"),
        "password": os.environ.get("POSTGRES_PASSWORD"),
        "hostname": os.environ.get("POSTGRES_HOST"),
        "port": "5432",
        "authenticationType": "MASTER",
        "fetchSize": "0",
        "databaseName": "catalogue_production"
    },
}
apiPost('catalog', body=body)

body = {
    "entityType": "source",
    "name": "coproduction",
    "description": "coproduction data",
    "type": "POSTGRES",
    "config": {
        "username": os.environ.get("POSTGRES_USERNAME"),
        "password": os.environ.get("POSTGRES_PASSWORD"),
        "hostname": os.environ.get("POSTGRES_HOST"),
        "port": "5432",
        "authenticationType": "MASTER",
        "fetchSize": "0",
        "databaseName": "coproduction_production"
    },
}
apiPost('catalog', body=body)

# ELASTICSEARCH SOURCE
body = {
    "entityType": "source",
    "name": "elastic",
    "description": "elasticsearch for logs",
    "type": "ELASTIC",
    "config": {
        "username": os.environ.get("ELASTIC_USERNAME"),
        "password": os.environ.get("ELASTIC_PASSWORD"),
        "hostList": [{"hostname": os.environ.get("ELASTIC_HOST"), "port": os.environ.get("ELASTIC_PORT")}],
        "authenticationType": "MASTER",
    },
}

body = {
    "entityType": "source",
    "name": "elastic2",
    "description": "elasticsearch for logs",
    "type": "ELASTIC",
    "config": {
        "username": os.environ.get("ELASTIC_USERNAME"),
        "password": os.environ.get("ELASTIC_PASSWORD"),
        "hostList": [{"hostname": "elasticsearch", "port": os.environ.get("ELASTIC_PORT")}],
        "authenticationType": "MASTER",
    },
}
print(body)
apiPost('catalog', body=body)

The variables are obtained from the .env file (find_dotenv and load_dotenv functions) or directly from the environment. THAT IS WHY THIS SCRIPT MUST BE EXECUTED BY THE WORKFLOWS AND NOT MANUALLY, since it needs to has access to ALL the credentials needed for establishing a connection to the databases. Some of that credentials are only in the Github secrets, and consequently, only in the shell executed by the workflows.

name: update-dev-environment
...
jobs:
  deploy:
    ...
    runs-on: ubuntu-latest
    environment: dev
    steps:
      - name: Deploy Dev SSH
        uses: appleboy/ssh-action@master
        with:
          host: ${{ secrets.DEV_HOST }}
          username: ${{ secrets.DEV_USERNAME }}
          key: ${{ secrets.SSH_KEY }}
          script: |
            export LOOMIO_SMTP_USERNAME=${{ secrets.LOOMIO_SMTP_USERNAME }}
            export LOOMIO_AAC_APP_SECRET=${{ secrets.LOOMIO_AAC_APP_SECRET }}
            export MAIL_PASSWORD=${{ secrets.DEV_MAIL_PASSWORD }}
            export LOOMIO_SECRET_COOKIE_TOKEN=${{ secrets.LOOMIO_SECRET_COOKIE_TOKEN }}
            export LOOMIO_SMTP_PASSWORD=${{ secrets.LOOMIO_SMTP_PASSWORD }}        
            export LOOMIO_DEVISE_SECRET=${{ secrets.LOOMIO_DEVISE_SECRET }}
            export LOOMIO_DEVISE_SECRET=${{ secrets.LOOMIO_DEVISE_SECRET }}
            ...

            pip3 install python-dotenv && python3 setup-dremio.py
            ...

Line 75 in update-dev-environment workflow

Making queries

Dremio allows to make SQL queries to SQL and NoSQL databases. This allows to create queries that join data from different backends.

Simple queries

Count all the coproduction processes in the postgres coproduction database:

SELECT COUNT(*) FROM coproduction.public.coproductionprocess

Get all the user activity logs (from the logs index in the elastic2 backend):

SELECT * FROM elastic2.logs.log

Cross-backend queries

SELECT * FROM catalogue.public.interlinker WHERE id IN (SELECT softwareinterlinker_id FROM elastic2.logs.log WHERE action LIKE 'CREATE' AND model LIKE 'ASSET')

Automation of the KPI obtention

Dremio exposes an API (https://docs.dremio.com/software/rest-api/) that can be used to send queries to the data backends. Taking that into account, a python script has been created in order to automate this obtention by executing it every X time.

The script is located in cronjobs/jobs/dremio/kpis.py.

https://github.com/interlink-project/interlink-project/blob/master/envs/development/cronjobs/jobs/dremio/kpis.py

from sheets import service, sheet_id
from datetime import datetime
import dateutil.relativedelta
from common import *
import json

# date in the left cell
date_time = datetime.now()
str_date = date_time.strftime("%Y-%m-%d %H:%M:%S")

d2 = date_time - dateutil.relativedelta.relativedelta(months=1)
one_month_before = d2.strftime("%Y-%m-%d")

login()

queries = [
    {
        "name": "A7: Number of coproduction processes",
        "sql": "SELECT COUNT(*) FROM coproduction.public.coproductionprocess",
        "extract_count": True
    },
    {
        "name": "A7.1: Number of coproduction processes in english",
        "sql": "SELECT COUNT(*) FROM coproduction.public.coproductionprocess WHERE coproductionprocess.\"language\" LIKE 'en'",
        "extract_count": True
    },
    (...)    
]

print("Obtaining kpis on", str_date)
results = run_queries(queries)

# print(json.dumps(results))

ENVIRONMENT = os.environ.get("PLATFORM_STACK_NAME")

# send data to GoogleDrive
try:
    result = service.spreadsheets().values().get( spreadsheetId=sheet_id, range=f"{ENVIRONMENT}!A1:ZZ1").execute()
    header = result.get('values', [[]])[0]
except:
    header = []

values_to_insert = []
update = False

if len(header) > 0:
    # check if all query names are present in header and add a new one if not present
    for query in queries:
        name = query.get("name")
        if not name in header:
            header.append(name)
            update = True
            print(f"Added {name} to header")

    # update header if needed
    if update:
        service.spreadsheets().values().update(
            spreadsheetId=sheet_id,
            range=f"{ENVIRONMENT}!A1:ZZ1",
            body={
                "majorDimension": "ROWS",
                "values": [header]
            },
            valueInputOption="USER_ENTERED"
        ).execute()
else:
    # if there are no cells in the header, create them with the kpis names
    header = ["Date time"] + [i.get("name") for i in queries]
    values_to_insert.append(
        header
    )
    values_to_insert.append(
        ["Last value"] + [f"=INDICE( FILTER( {char}3:{char} , NO( ESBLANCO( {char}3:{char} ) ) ) , FILAS( FILTER( {char}3:{char} , NO( ESBLANCO( {char}3:{char} ) ) ) ) )" for char in "ABCDEFGHIJKLMNOPQRSTUVWXYZ"]
    )

# set each value of row depending on the index of the name of the kpi in the header
row = [str_date]
for query_name, query_result in results.items():
    index = header.index(query_name)
    set_list(row, index, json.dumps(query_result))

values_to_insert.append(
    row
)


# append (the header if necessary) and the row to the existing sheet, defined by ENVIRONMENT (development, demo, zgz...)
service.spreadsheets().values().append(
    spreadsheetId=sheet_id,
    range=f"{ENVIRONMENT}!A:Z",
    body={
        "majorDimension": "ROWS",
        "values": values_to_insert
    },
    valueInputOption="USER_ENTERED"
).execute()

print(
    f"Document updated. See it at: https://docs.google.com/spreadsheets/d/{sheet_id}")

The script defines a list (queries) that contains a dict for every KPI. This dicts must contain at least two keys; name (name of the KPI) and sql (SQL query). If the query is a count query (COUNT(*)), you may also want to set the extract_count key to True, in order to extract the number that the query returns.

The results are stored in an excel document located in: https://docs.google.com/spreadsheets/d/1WDA5hatG7NLCzBTpIoVFM_yiRU_Nn6kTsbeYyPXs03A/edit#gid=0

How to add new KPIs

The name attribute is used to locate or create the column headers. For example, every time an SQL query is executed, its result is stored in the index of a list (values_to_insert list) depending on the index of the element in the header equal to the name attribute. That way, if a query fails, the rest of the queries will be put in the correct index.

If the name of a query is not found in the header, it is appended, so a new column is created.

In conclusion, the only thing needed to add a new KPI is to add a new entry to the queries list.

:warning: Every time you update the KPIs script file crontab service must be restarted in order to apply the last changes

Cronjob (crontab)

The script is executed every X time (it depends on the environment). There is a service called crontab in every docker-compose that is responsible for executing some cronjobs.

The configuration is located at https://github.com/interlink-project/interlink-project/blob/master/envs/development/cronjobs/config.json

The configuration of the dev environment is as follows:

[
    {
        "schedule": "@every 10m",
        "command": "cd /opt/jobs/dremio && python3 kpis.py",
        "onstart": true
    },
    ...
]

To restart it, you could access portainer and restart it using the GUI.

https://portainer.dev.interlink-project.eu/#!/home