Contents

1. Introduction

We love software. They really make our life easier ... if they work.

If the worst happens (i.e. the application doesn't work) then we have to call / send a mail to somebody and ... wait. For a firm it is definitely not good news as the people cannot work and it can cost a lot of money.

It was the side of the user (which wants to use the application). The other side is the maintenance that will get the call / mail eventually and has to resolve the problem. Usually there are well-known processes how to solve the issue (e.g. in case of OutOfMemoryError the application has to be restarted or if "ImportDataException: Not enough space on drive" occurs then first delete the old import files and restart the import, etc.).
So what the database / middleware / Unix / ... administrator has to do is to find the right process that will remedy the problem and execute it step by step. I am not saying that's all that the ladies/gentlemen mentioned in the previous sentence only do but sometimes they have to do these boring, not very interesting tasks.

It is interesting to examine what the applications do with the errors / exceptions : basically nothing. They write errors in their log files and that is all. There are 3rd party tools that can monitor the log files and send a mail / text message but that's all

So there are tasks that can be done automatically:

What needs to be done is to define what log files (host, log file location) have to be monitored, specify the flow of commands (operating system commands that are executed on the same or separate hosts) like an activity diagram and bind these together i.e. define the execution flow that has to be run if an error is detected (i.e. OutOfMemoryError) in an application log file.

It is what Reaction was born for.

1.1. What is Reaction for?

Shortly Reaction can perceive the incidents by monitoring the application log files: if a known regular expression pattern (e.g. .*java\.lang\.OutOfMemoryError.*) is detected there then select the execution flow (which is basically series of operation system commands) that can fix the problem and execute its tasks one by one on one/more hosts.

How can it do all of these?

First of all the log files have to be set that needs to be monitored. Here you can create hierarchy like: we have the 'Hermes' CRM clustered system that works on 4 server machines with different files per hosts.

Then the execution flow has to be created which can contain

Finally these two (execution flow and system (which is basically the log file location)) have to be glued together by creating a so-called error detector where you can set the pattern (that will be examined in the log file) and select the system (log file) and the flow.
If these data is set then all you have to do is to start the worker as a background process on the machine where the log file is or where the OS command has to be executed.

Reaction has 3 main components:

The administration web application is the tool where all the information can be specified. Also you can find detailed information about the progress of the started / scheduled flow, flow can be scheduled or started manually, etc.

The reaction engine is responsible for communicating with the workers, starting/approving/scheduling a flow, managing the flow execution with the workers, etc.

There are 2 kinds of workers:

These 2 can be started / stopped separately. It is important to note that they will operate in the background, so all have to be done is to start them. Only the configuration file has to be set up correctly.
All the information (i.e. the location of the log file, etc.) will be synchronised automatically.

1.2. How does it work?

First of all have a look at the picture below.

The Administration web application manages the reference data (see later) in database (JDBC) and the user can maintain the started, scheduled execution flows. Also it communicates to the Reaction engine (start a flow manually, schedule it, etc. - REST).

The Reaction engine manages the events (started, scheduled execution flows) in database (JDBC), receives new incidents from the reader worker, waiting for the result of the executed OS commands on the executer worker, etc. (REST)

The reader worker examines the log file and report incidents (REST), the executor worker execute OS commands and send its result to the Engine (REST).

The Reaction Engine and the Administration web application have to be deployed first.
The Engine is a Java application and tested on Tomcat 8, on Wildfly 10 and on Weblogic 12. The admin web application is a python web application and tested on Apache (with mod_wsgi).

The reader worker has to be started on all the hosts where those systems run whose log files have to be monitored. The executor worker has to be started on the host machines which are involved in any of the execution flows (i.e. where external commands have to be executed).

After the deployment of the engine and admin web application and the start of the workers only the Administration web application has to be used to manage the reference data. All the data synchronisation, etc. will be done in the background, automatically. If new hosts have to be monitored or involved in the execution flow that can be done any time.

1.3. Licensing

All the Reaction components are under MIT License.

2. Components

2.1. Worker

2.1.1. Quick introduction

The workers are to monitor the log files, report an incident, execute an operation system command and send back the result of the command.

2.1.2. How to install

The installation is easy, just copy the zip to the host machine and extract it. (Please be aware that in the download section a Dockerfile can be found that contains the specific Linux commands to install the worker!)

The configuration file (conf/worker.yml) has to be set up properly and add the credentials (security/credential) for the REST authentication.
Also please set the .sh / .bat files correctly. The values that have to be altered are at the beginning of the file (in the VALUES TO BE CHANGED section).

The worker needs JDK8 to run.

The worker can be used in Linux and in Windows. Both the reader and the executor have the same options so only describe the reader worker.

2.1.2.1. on Linux

The worker has to run as root. Please don't forget to change the owner of the worker directory and the files in it to root! (e.g. chown -R root worker-1.0)
If the worker is executed another user than root then this user has to be in the sudoer list.

The jsvc (https://commons.apache.org/proper/commons-daemon/jsvc.html) has to be on the host machine and the path has to be set properly in manage_executor.sh and manage_reader.sh files.

Executing the worker without parameters it will enlist the available options.

ric_flair@mylaptop:/local/reaction/worker> sudo ./manage_reader.sh

[sudo] password for vikhor:

Usage: sudo ./manage_reader.sh {start|stop|restart|status}

2.1.2.2. on Windows

On Windows the worker has to be installed as service so the options are different:

c:\work\reaction\worker>manage_reader.bat

"Usage: manage_reader.bat {install|deinstall|start|stop}"

2.1.3. Communication

2.1.3.1. Retry

There are two 2 types of calls being sent from the worker to the engine that are crucial to be sent or at least everything must try to complete them:

If the engine is not available (it is restarted for example) then these calls have to be retried. A queue of the messages to be sent will be created and all these requests will be put to this queue until the back-end is online again.

The messages will be removed from this queue if

  1. the message is sent (the engine is available again)
  2. the message is too old (expired) -> it can configured (application.reader.call.validity_interval and application.executor.call.validity_interval)

If the number of the events in the queue is higher than the capacity of the queue then the new message won't be put to the queue. It is worth to mention the retry mechanism will occur in a separate thread so it won't interfere with the main processing.

Also it can be configured how much time the retry mechanism has to wait between two tries (see application.reader.call.sleeping and application.executor.call.sleeping).

2.1.3.2. Security

Authentication

If the worker calls the engine then first it has to authenticate itself. HMAC authentication (https://en.wikipedia.org/wiki/Hash-based_message_authentication_code) is used i.e. the clean-text password is not sent (HTTPS is not needed) and it cannot be decrypted as it is not a static hash.
The user + password can be specified in 'security/credential' file where the first line will contain the user name and the second is the password. The same credential has to be stored in the engine too (see later).

Message encryption

Message-level encryption can be utilised which has the advantage over HTTPS that it can go through any firewall / router without remaking the HTTPS connection (and maintaining the credentials on these stations, etc.).
Two types of encryption can be used:

2.1.4. Reader worker

The reader gets the log file locations that have to be monitored from the engine, it constantly monitors the log file(s) and examines every line that are put to the file and if necessary (i.e. the pattern can be found in the line) an incident will be reported to the engine.

It is important that the reader worker queries the active error detectors when trying to get the log file locations. So a log file will be monitored if an active error detector is created with the system (log file).

The reader starts as many threads as the count of the log files are to be monitored.

As I mentioned before the synchronisation (getting the log file locations) is automatic (it can be configured how often the reaction engine is called by the reader).
When there was a change (i.e. the log file location is modified in the management web app) then the current thread that monitors the log file has to be stopped and a new one has to be started with the new location. No line in the log file will be lost while stopping the old thread and starting a new one.

2.1.5. Executor worker

The executor executes the OS command (that is defined in the execution flow) and sends back the result to the engine. It is recommended to set the executor to call the engine regularly to check if there is a command to be executed so the flow execution will seem smooth.

It is possible that more than one command is sent by the engine to execute.
If the commands are different then they will be executed parallel. If 2 or more commands are the same then they will run sequentially.
For example: the following commands arrive (command-C, user-U, pattern-P): (C1, U1, P1) and (C2, U1, P2) and (C1, U1, P1) => 2 threads will be started and one of the threads will execute 2 commands.

The executor performs command and usually every command has an output. The executor can do 2 things with the output:

  1. capture the last 15 lines and send back to the engine (and the engine will store these lines for the event and can be checked on the web GUI)
  2. if an output pattern is defined for the command then it tries to match this regular expression to every line and send back the last match as an extracted output value

2.1.6. Configuration file

The location of the configuration file is conf/worker.yml. The content is as follows:

host_name the name of the machine
it is important that it has to match with the value of the host property of the system specified in the admin GUI (system -> host)
it can be anything but highly recommended to use the real name of the host
sample value: ADNLT653
rest_call.engine_endpoint the endpoint of the Reaction Engine that is used by the REST client
sample value: http://localhost:7003/reaction-engine
rest_call.url_read_timeout the read timeout in millisecond is the timeout on waiting to read data
specifically if the server fails to send a byte <timeout> seconds after the last byte, a read timeout error will be raised
sample value: 5000
rest_call.url_connection_timeout the connection timeout in millisecond is the timeout in making the initial connection; i.e. completing the TCP connection handshake
sample value: 1000
rest_call.credential.file the location of the credential file that contains the username / password for authenticating against the REST service of the Reaction Engine
sample value: /local/reaction/worker/ADNLT653/security/credential
rest_call.encryption.type

specifying if the encryption will be symmetric (credential-based) or asymmetric (certificate-based)
possible values:

  • CREDENTIAL_BASED : symmetric encryption, only a string key will be the base of the encryption; the key will be taken from rest_call.credential.file i.e. the same key (password) is used for authentication so encryption keystore and truststore don't have to be defined
  • CERTIFICATE_BASED : asymmetric encryption ie. public and private keys are used to encrypt the REST call
  • NONE : no encryption is needed
Sample value: NONE
 rest_call.encryption.key_size the key size in bits
The maximum size can be 128 bit by default. If AES 256 or AES 512 has to be used then the JDK has to be upgraded with Java Cryptography Extension (JCE) - (search for 'jdk jce aes 256' in google). If AES 256 or AES 512 is used than the server must be ready to handle it too! i.e. the JDK on the server has to be updated too. sample value: 128
 rest_call.encryption.transformation the name of the transformation
For more info please see: https://docs.oracle.com/javase/7/docs/technotes/guides/security/StandardNames.html#Cipher.
sample value: RSA/ECB/PKCS1Padding
 rest_call.encryption.keystore.location the location of the keystore where the private key of the Reaction Worker is to decrypt the message
only required if CREDENTIAL-BASED encryption is used
sample value: /local/reaction/worker/ADNLT653/security/clientkeystore.jck
 rest_call.encryption.keystore.password the password of the keystore
only required if CREDENTIAL-BASED encryption is used
sample value: password
rest_call.encryption.keystore.type the type of the keystore
only required if CREDENTIAL-BASED encryption is used
sample value: JCEKS
rest_call.encryption.keystore.key_alias the alias in the keystore that points to the certificate
only required if CREDENTIAL-BASED encryption is used
sample value: client
rest_call.encryption.truststore.location the location of the truststore where the public key of the Reaction Engine is to encrypt the message
only required if CREDENTIAL-BASED encryption is used
sample value:/local/reaction/worker/ADNLT653/security/clienttruststore.jck
rest_call.encryption.truststore.password the password of the truststore
only required if CREDENTIAL-BASED encryption is used
sample value: password
rest_call.encryption.truststore.type the type of the truststore
only required if CREDENTIAL-BASED encryption is used
sample value: JCEKS
rest_call.encryption.truststore.key_alias the alias in the truststore that points to the certificate
only required if CREDENTIAL-BASED encryption is used
sample value: server
application.reader.sleeping the reader waits for the specified number of seconds until it tries to get the system list again
sample value: 100
application.reader.multiline_error_supported

It is possible that the header doesn't exist at the beginning of each lines, e.g.
       2017-02-14 16:34:00,203 default task-2 blahblah.acme.vasco.VascoWrapper ERROR failed to load native library from classpath: 'aal2sdk' Hint: make sure the library is accessible on the java.library.path.(per default bin/native or set the path LD_LIBRARY_PATH
            java.lang.UnsatisfiedLinkError: no aal2sdk in java.library.path
                  at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
                  ...
       2017-02-14 16:34:01,298 INFO [stdout] (default task-2) 16:34:01,298 ERROR [VascoWrapper] current java.library.path=${LD_LIBRARY_PATH}:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
possible values:

  • true : the above scenario will be supported i.e. the reader finds the header first and if it doesn't find the header in the next line then it will use the one that was found previously
    disadvantage: if more than one application uses the same log file with different header pattern then they cannot be differentiated
  • false : won't be supported i.e. those lines won't be examined that don't have the header
Sample value: false
 application.reader.file_system_check_interval the amount of time in milliseconds to wait between checks of the file system when monitoring the log file to check if it changed
sample value: 800
 application.reader.log_charset the charset of the log file
it can be checked with the command file -i <logfile location>
sample value: US-ASCII
 application.reader.call.queue_capacity capacity of the queue when retry is needed
When the engine back-end is offline the events that have to be reported mustn't be lost -> the reportEvent REST call is retried until
  • the back-end is online again
  • the event is too old (expired) -> it can configured
  • the number of events is higher than the capacity of the queue
The capacity of the queue mustn't be too low as it can lead to loss events even if the back-end is online!
The event is removed from the queue if the call is successful. If too many events arrive then the queue can be overflown. (e.g. the capacity is 8, the REST call needs 0.5 sec to finish and 10 new events arrive in 0.5 then 2 events will be lost -> the following entry can be seen in the log: "The following record is not added to the queue as it hits the limit...")
sample value: 10000
 application.reader.call.validity_interval how long will an possible incident (event) be valid if the engine is offline (in sec)
I.e. there is an OutOfMemoryError in the log so a server should be restarted but the reaction engine is offline -> it is not a valid scenario that the server will be restarted 2 days later because the event is valid forever
sample value: 1200
 application.reader.call.sleeping if the reaction engine is offline the reader tries to resend the possible incident, the sleeping time can be set how much time to wait between two calls (in millisecond)
sample value: 3000
 application.executor.sleeping the executor waits for the specified number of seconds until it tries to get the commands from the Engine to be executed
sample value: 4
 application.executor.max_nr_running_commands the maximum number of the running commands
The commands to be executed will run parallel in different threads. The setting tells how many threads can be started.
sample value: 20
application.executor.call.queue_capacity see at application.reader.call.queue_capacity
application.executor.call.validity_interval after the execution of a command the result (if successful, output, etc.) has to be sent back to the reaction engine
if the engine is offline then the result will be resent but it needs a validity interval (in sec)
similar to application.reader.call.validity_interval
sample value: 3600
application.executor.call.sleeping the executor waits for the specified number of milliseconds until it tries to send the result of the executed command (in millisecond)
sample value: 3000

2.1.7. Folder

Log file

The log file can be found in log folder. The worker.log file contains all the log entries that the reader and executor workers produce.
The logs can be configured runtime (without starting / stopping the worker) by altering the conf/logback.xml file, after the few sec (default 15 sec, configured in the same file -> scanPeriod="15 seconds") the changes will be picked up.

There are 2 other log files in the 'log' folder:

(the [worker name] is configured in manage_reader.sh / manage_reader.bat files -> see NAME variable). They are maintained by jsvc (in Linux) / prunsrv.exe (in Windows) program that starts / stops the worker.These files can contain additional information if the worker couldn't be started / stopped.

Security

The following 3 files might be found in the folder:

Dependencies

All the JAR files that the workers need can be found in the lib folder.

Configuration

The worker configuration file and the logging configuration file are in the conf folder.

2.2. Reaction Engine

2.2.1. What does it do?

The engine's tasks are as follows:

2.2.2. State transition

The event of the execution flow can have different states while running. The state transition can be seen on the following diagram:

The process starts when the workers reports an incident (i.e. it finds a match to the pattern in the log file -> the pattern is a regular expression like .*OutOfMemoryError [0-9]+.*).
The engine tries to find a matching error detector. See more information about the error detector in the reference data. It basically means that it looks for a matching regular expression in the error detector records, the found error detector is assigned to a system and the log file location (system) has to be assigned to the error detector where the pattern was found. Be aware that the system might have hierarchy so it might find an error detector with its parent system (there is the Hermes CRM system and it has 2 server instances, Hermes web 0 and Hermes web 1 -> if the Hermes CRM system is assigned to the error detector then both servers will be examined).

If it cannot find the matching error detector then the event won't processed (saved) but indicating in the log of the engine that it is ignored (no event will be created).

If it finds an error detector record then one of the following scenarios can happen:

As per the state diagram from the WAITING_FOR_OTHERS state the event can go to

The flow can be rejected ( -> REJECTED) or approved. If the normal approval is chosen on the web application then the event has to be scheduled (-> not in the maintenance window) or started. If the flow is forced to start (CONFIRMED_AND_FORCED_START) then it will be started immediately regardless if we are in the maintenance window (if there is any defined at all).
The CONFIRMED and CONFIRMED_AND_FORCED_START will be IGNORED if there is already a flow having STARTED status.

A SCHEDULED flow will be started if the specific time arrived. Before starting the flow it will be checked if another instance of the same flow is started. If it is then this event will be IGNORED.
The event to be scheduled will be IGNORED in that case too if there is another flow being scheduled for the same day. If it is scheduled for another day then the current event is going to be scheduled.
It is important to note that the flow is not always to be scheduled to run at the beginning of the maintenance window. Let's say the window starts at 7PM, a flow ended at 6:45PM whose timeBetweenExecutions is 40 mins. An event (which would trigger the same flow) arrives at 6:55PM which will be scheduled to run at 7:25PM (and not 7PM!).

If the flow is started (STARTED) then something can go wrong (FAILED) which might mean the end of the flow. The flow can go back to STARTED from FAILED if the failed task is restarted on the web application or the failed task is skipped.

If everything went well then the event will have the FINISHED status which means that the flow ran successfully.

Please be aware that the manually started flows (which have to be started now and not scheduled) wont be ever IGNORED!

2.2.3. Reference data

The reference data is the base information that the Reaction Engine can work from, it is managed by the Management Web GUI.
There are 3 base data type: system, execution flow, error detector.
Other information is produced by the Reaction engine:

2.2.3.1 System

Here the log file locations and the related information are stored.
It is important that the reader worker won't get the log file location until there is an active error detector specified which is assigned to this system (i.e. if only a system is defined here which contains the host and the log file location but an error detector which has the system is not created then the reader worker won't get the location so it won't monitor the log file)!
Hierarchy of systems can be built so a parent system can be created and attached the log file locations to it.
For example: the Hermes CRM application has 2 web application servers and 2 backend (REST, SOA, etc) servers and one log files to be monitored per server. So the following hierarchy can be built:

Hermes
     Hermes web
          Hermes web 0
          Hermes web 1
     Hermes backend
          Hermes backend 0
          Hermes backend 1

The following properties can be set on higher level (on parent system) and the children will inherit them:
host, log header pattern enabled, log header pattern, log level, log file location, maintenance window
For example the log header pattern is the same for Hermes web 0 and Hermes web 1 so it can be added to Hermes web (the log header pattern has to leave empty at the children). The maintenance window is the same for all servers so it is defined Hermes only (the maintenance window has to leave empty at the children).

Name the name of the system
Description the description of the system
Host the host where the system (log file) resides
It can be any text but it has to be equal to the host name defined in the config file of the worker (2.1.6. Configuration file -> host_name)
Parent system the parent system of the current one
Log file path

the log file location
sample value: /etc/httpd/logs/error_log

Log level Usually every log entry has a log level, like INFO, ERROR, etc. It can be set what the minimum level is that the worker will take into account to report.
For example: we are only interested in the warning and above logs (i.e. if something is logged with DEBUG level even if there would be a match it won't be considered by the worker as an incident) so set it to WARN. Even if the pattern in the error detector matches the log entry and the matching log entry's log level is INFO then it won't be reported as the log level of the system is WARN. However the ERROR log entries will be examined!
Be aware if the log level has to be examined then the log header pattern has to be specified on this system or on any of its parents!
Possible values: TRACE, DEBUG, INFO, WARN, ERROR, FATAL
Log header pattern Usually every log entry in Log file path has a header.
For example: <Nov 10, 2017 3:12:34,698 PM CET> <Notice> <WebLogicServer> <BEA-000396>.
It needs to define if the Log level property is set. (how? -> see later at web GUI)
Type Three possible values are Application, Group and Log file.
In the example above the 'Hermes' would be an Application, the 'Hermes web' a Group and 'Hermes web 0' would be a Log file.
It is important to mention that only Log file systems will be sent back to the reader worker when it gets the list of systems (log file locations). It means if a GROUP or APPLICATION system is assigned to the error detector the Engine can handle it so when the reader workers ask for the log file locations to monitor each related hosts (reader worker) will be notified.
For example: if the Hermes web system is assigned to an error detector then both the workers on the Hermes web 0 and Hermes web 1 hosts will get the log file locations to monitor
Maintenance window Time periods can be defined when the execution flow cannot be executed so it will be scheduled outside the window.
For example: the maintenance window is 22:30-05:30 and the incident is reported at 12:30 then the flow will scheduled to run at 22:30 (the flow can be forced to run immediately if the confirmation needed option is set on error detector and during the confirmation the 'confirmed and forced started' option is chosen on web GUI).
You can enter date range like 20:00-23:00, 23:30-06:00. More date ranges can be set separated by comma. The date ranges can overflow to the following day. E.g. Mon: 23:00-04:00 If no maintenance window is set for a day then no execution flow can run on that day. If you set the 00:00-24:00 for one day then the window is open on the whole day for that day. The 19:00-23:59 means that the maintenance window is open between 7PM and the end of the day (i.e. it is open at 23:59:59,999 too).
WARNING! If you don't specify any date range (for none of the days) then the window is open on each day all day!
If you define the following time ranges for Monday 23:00-02:00, 04:00-05:30 then it means that the maintenance window will be open on Monday morning between 04:00 and 05:30 and on Monday night between 23:00-24:00 and on Tuesday morning between 00:00-02:00.

2.2.3.2. Execution flow

It is a series of tasks that are executed one after another. Operation system commands can be executed or email can be sent, it can contain branching (if-else condition) or the flow can be terminated.

Name the name of the execution flow
When creating a new version of the flow by copying it then it is recommended to add the version number at the end. For example: Restart Hermes servers - v2
Period between executions how many seconds does the engine have to wait to run the same flow again
Sometimes it is not recommended to execute the same flow (i.e. restart a server) in every minute...
Approximate execution time the approximate execution time of the flow
If it is long running flow then it shouldn't be executed just before the end of the maintenance window (i.e. the execution time is 60 min, the maintenance window is between 19:00 and 23:30 and the event occurred at 22:50 then it won't be executed). If it cannot be executed on a specific day then it will be rescheduled.
If an event arrives at 02:40, the maintenance window is open between 01:00 and 03:00 so theoretically the flow can be executed but the execution time of the flow is 30 min so it won't finish by 03:00 then the flow will be scheduled.
Access group groups of users can be set who can monitor, change the execution flow, etc.
For example: the execution flow Restart the Weblogic app server won't be interesting for DBAs or UNIX specialist or they are not permitted to see them at all! In the config file of the web GUI all the groups can be specified and they can be attached to the users on the user admin page ([host]:[port]/[context root]/admin) of web GUI.
The users in other group won't be able to see the flow in monitoring, statistics, etc. in the web GUI.
If an error occurs then send mail to the following email address(es) If an unexpected error occurs then a mail will be sent to these email addresses.
More email addresses can be specified separated by comma. If the field is left empty then mail will be sent to all the users that share the access groups that are assigned to the execution flow (e.g. the DBA access group is assigned to the X flow then all the users will receive mail that are in the DBA group in Django authentication).
If the flow is started (automatically) by a log event then send mail to the following email address(es) If a flow is started by an incident reported by the reader worker then send mail to these mail addresses.
The same rule apply here as above.

2.2.3.3. Execution flow task

The image below shows a sample execution flow that contains external (OS) commands (green rectangles), mail sending tasks (gray circles), if-else conditions (yellow diamonds) and failure tasks (the red ... something :) ). It is a detailed flow of restarting a Weblogic AdminServer.

Every task must have a name. The names mustn't be the same on the same level.

There 4 types of task:

1. external command execution

This task is for executing an OS (operating system) command on a host where the specific executor worker runs. The command execution was tested on Linux and on Windows.

Command   the command that will be executed on the host
Any command can be specified that the operating system can execute. For example:
. /local/wls12213/user_projects/domains/base_domain/bin/setDomainEnv.sh && echo "nmConnect('weblogic','weblogic','localhost','5556','base_domain','/local/wls12213/user_projects/domains/base_domain','plain')" | java weblogic.WLST  (This command will set the environment variables of the Weblogic domain and will connect to the nodemanager)
Please be aware that on Windows if the command to be executed doesn't give the control back after being executed (e.g. the command is to start an application server which keeps waiting for requests so it won't end) then the command won't finish i.e. the flow won't fail but won't finish either but it will be stuck. The recommendation is to use Windows service in this case.
If Linux shell built-in command has to be used (like [ -d /tmp/reactionflow ] && echo ...) then please always specify the OS user (even if it is not different from the user who runs the worker; e.g. root user)! If the OS user is set then the command will be executed with bash -c so the shell built-in will be interpreted correctly. Also if you want to execute nested command (with the back-quote character, like echo "Current time:`date +%Y%m%d%H%M%S`") then also recommended to use the bash -c (i.e. specifying the OS user).
MANDATORY
OS user the operating system (OS) user can be set that will run the command (used on Linux only)
The executor worker runs as root so by default the root user will execute the command set above. If we want the command to be executed by another OS user then it can be set here.
Please be aware that the root user must be able to log in to the specific user seamlessly (e.g. root -> [other user]). If a user is set then the following OS command will be executed:
    sudo -u [user] bash -c "[command]"
Host the host where the command has to be executed
It can be any text but it has to be equal to the host name defined in the config file of the worker (2.1.6. Configuration file -> host_name)
MANDATORY
Output pattern the output pattern that is used to get the value to be extracted from the output of the external command and to be used to evaluate the subsequent if-else condition
Usually every command has an output, like
    ...
    [sql] Executing resource: /home/build/generated/sql/oracle/gen_create_synonyms.sql
    [sql] Executing resource: /home/build/generated/sql/oracle/gen_create_synonyms_customized_views.sql
    [sql] 257 of 257 SQL statements executed successfully
If we want a logic like if the SQL command execution was successful then execute another command otherwise send a mail to a user then
  1. define an otuput pattern like \[sql\] 257 of 257 SQL statements executed (?<VALUETOBEEXTRACTED>[a-z]*)
  2. add an if-else condition right after this external command and set the condition to equal to successfully and add the next external command to the TRUE branch, and the mail sending to the FALSE branch
Regular expression can be used but the following named capturing group has to be specified: (?<VALUETOBEEXTRACTED>[a-zA-Z ]*) -> the desired pattern has to be set between [].

2. if-else condition

Evaluating an if-else condition.
The value that is used by the if-else is provided by a preceding external command so in front of every if-else an external command with output pattern defined must be.

Condition it can be one of the followings: =, !=, <, <=, >, >=
MANDATORY
Value the value that the extracted output value of the external command will be compared against
It can be indicated if it is number.
MANDATORY

3. mail sending task

Sending a mail.

Recipients the recipients of the mail separated by comma
MANDATORY
Subject the subject of the mail
MANDATORY
Content the content of the mail
HTMl content can be sent too
MANDATORY

4. failure task

The flow will be interrupted and marked it as FAILED.

2.2.3.4. Error detector

In the error detector you can define where you want to search (system, i.e. log file location) for what (the message pattern which is a regular expression) and what to do if you found it (execution flow).

For example: There is a memory leak in the Hermes backend application (which is deployed to Hermes backend 0 and Hermes backend 1) which results java.lang.OutOfMemoryError: PermGen space error message.
So define an error detector with the pattern like .*java.lang.OutOfMemoryError: PermGen space.*, assign the Hermes backend system to it (I assigned the Hermes backend and not the Hermes backend 0 nor Hermes backend 1! So only one error detector has to be specified but both reader workers on the hosts of Hermes backend 0 and Hermes backend 1 will get the log file location to monitor.) and choose the execution flow to be executed.

Name the name of the error detector
Message pattern regular expression that the reader worker will use to check against the log file line by line
The number of events needed to start the execution Sometimes only one event is not enough to start the flow. Let's say there is a NullPointerException in the log but we don't get scared if we see one but if the 2nd or 3rd arrives in a specific timeframe then we might have a situation.
Here we can define how many events should be reported before doing anything.
Timeframe while the events have to arrive the timeframe that the events (their number is defined above) have to arrive within
for example: to start the execution flow 3 incidents have to be reported by the reader in 15 mins
Confirmation needed If manual confirmation (approval) is needed by a user on web GUI before starting the flow then set it true.
Activated An error detector is taken into account to reporting an incident if it is active. If it is not active then the log file location and the message pattern assigned to it won't be sent to the reader worker(s).

2.2.4. Architecture

The reaction engine is a Java web application that is tested on Tomcat 8, Wildfly 10 and Weblogic 12 (separate WAR files are shipped for each application servers).
It has REST interface to the workers (see above) and to the web GUI (both needs HMAC authentication and secure communication can be switched on for the worker). Also it communicates with the database via JDBC.
In case of heavy load the engine can be clustered or just use more instances with a load balancer the logic is prepared to handle more instances.

2.2.5. How to install

As I mentioned above the engine is tested on Tomcat 8, Wildfly 10 and Weblogic 12 which means that there is separate WAT files to be deployed for these application servers:

The application server has to support server 3.0 specification. It needs JDK8.
Please be aware that in the download section a Dockerfile can be found that contains the specific Linux commands to install the engine!

Before deploying the file the following 3 system property must exist in the the JVM of the application server:

In Weblogic these system properties have to be added to the managed server where the engine has been deployed to ([managed server] -> Configuration -> Server Start -> Arguments).
sample values: -Dspring.profiles.active=commonjWorkmanager -Dspring.config.location=/local/reaction/reaction-engine/reaction-engine-application.yml -Dreaction.logback.config=/local/reaction/reaction-engine/logback-include.xml

For Wildfly please see here

For Tomcat add them to setenv.sh / setenv.bat.
For example: CATALINA_OPTS="-Dspring.profiles.active=threadPool -Dspring.config.location=/local/reaction/reaction-engine/conf/reaction-engine-application.yml -Dreaction.logback.config=/local/reaction/reaction-engine/conf/logback-include.xml"
Don't forget to add execute permission to setenv.sh if it is a new file.

The credentials file (the location of this file has to specified in reaction.security.credentials_file property of the engine config file) has to be set up correctly before deploying the application.
Sample content of the file:
localhost=f0dedb78-3eb6-4a56-8428-e8e40584a01c
reaction-management-web-app=e5574bf1-13c5-476a-b1d3-500bc640564d

The default URL for Weblogic and Wildfly is [host]:[port]/reaction-engine. In order to have the same URL for Tomcat I recommend to rename the WAR file to 'reaction-engine.war' before deploying it.

2.2.6. Database

The database design and building details will be discussed in 2.3.3.

2.2.7. Configuration

The reaction engine has 2 files that can be configured:

2.2.8. Status

If the Reaction Engine is deployed to application server and the application server runs then the status can be checked by opening the http://[host]:[port]/reaction-engine/status URL (e.g. http://localhost:7003/reaction-engine/status).
It just displays a simple HTML page.


2.3. Management web GUI

2.3.1. What does it do?

The web GUI provides the following functions:

2.3.2. How to install?

The web GUI is a Python-Django application which is recommended to use with Apache.

Generally the following steps have to be done to install the web GUI to Apache:

  1. install Apache or use the existing httpd
  2. install python3 (recommended to create a virtual environment)
  3. install Django and the required dependencies (first activate the virtual environment if created)
  4. install database client
  5. install mod_wsgi to Apache
  6. copy the source of management app to the host of Apache
  7. configure the web GUI
  8. initialise the database (create tables and load the initial data)
  9. create superuser to be able log in to the user management and create the users in the user management

Please be aware that in the download section a Dockerfile can be found that contains the specific Linux commands to install the management GUI!
I will describe a possible scenario to install Apache with mod_wsgi and deploy the application on Red Hat 7 with Oracle client using the existing httpd and by default python 2 is on Linux so python 3 has to be installed and use a virtual environment:

  1. install Apache (http://httpd.apache.org/docs/2.4/install.html) or use the existing httpd (I used the existing httpd on Red Hat 7)
  2. install python 3
    I used the following page to install python 3: https://www.digitalocean.com/community/tutorials/how-to-install-python-3-and-set-up-a-local-programming-environment-on-centos-7
    1. install python 3.6
      sudo yum -y install python36u
      sudo yum -y install python36u-devel
    2. install pip
      sudo yum -y install python36u-pip
    3. creating a virtual environment (I created in /local/reaction/environments)
      cd /local/reaction/environments
      python3.6 -m venv venv
    4. activate it
      after activation the python command can be used and it will point to python3.6 and not the default python2; also all the python package will be installed to the virtual environment and not to the default python2 (or to the installed python3.6!) so these installed packages won't interfere with the existing used python installations and python programs
      source /local/reaction/environments/venv/bin/activate
    5. after the activation of the virtual environments install django and dependencies
      pip install -r requirements.txt

      please be aware the requirements.txt contains both the the Oracle and the mysql dependency, please remove that one that is not needed (Oracle: cx-Oracle, mysql: mysqlclient)
  3. install mod_wsgi
    more information: http://modwsgi.readthedocs.io/en/develop/user-guides/quick-installation-guide.html
    1. download it : https://github.com/GrahamDumpleton/mod_wsgi/archive/4.5.20.tar.gz
    2. setup the package ready for building
      if virtual environment is used then first activate it (e.g. source /local/reaction/environments/venv/bin/activate) and then execute the following command in the folder where the source was unzipped to
      ./configure --with-python=/local/reaction/environments/venv/bin/python
      if not the default httpd is used then the apache folder can be set too with --with-apxs=...
    3. build and install the package
      make
      make install
  4. install database client (Oracle)
    1. install Oracle client first
      Download from: http://www.oracle.com/technetwork/database/features/instant-client/index.html
      the Basic Light package is enough if only English language is needed
      Installation instructions: https://oracle.github.io/odpi/doc/installation.html#linux
      If the python is 32bit then it will install a 32bit cx_oracle so the Oracle Client has to be 32bit too! Even if the machine/windows is 64 bit.
      To find out if the python is 32 or 64 bit is execute the following commands in the python shell (make sure the virtual env is activated and just type python in Linux command shell)
          import struct
          print(struct.calcsize("P") * 8)
    2. install the python Oracle package
      see more information: https://oracle.github.io/python-cx_Oracle/
      pip install cx-Oracle
      on Linux please set the LD_LIBRARY_PATH system variable!
      export LD_LIBRARY_PATH=/usr/lib/oracle/12.2/client64/lib:$LD_LIBRARY_PATH
    3. due to an error I had to create a link
      cd /usr/lib/oracle/12.2/client64/lib/
      ln libclntsh.so.12.1 libclntsh.so
    4. set up the settings-XXXX.py config file (the value of XXXX depends on what the value of REACTION_ENVIRONMENT will target="_blank"be, see later)
      For example:
           ...
          DATABASES = {
              'default': {
                  'NAME': 'xe',
                   'ENGINE': 'django.db.backends.oracle',
                   'USER': 'reaction',
                   'PASSWORD': 'reaction',
                   'HOST': 'localhost',
                   'PORT': '1521',
              }
           }
          ...
  5. configure httpd
    1. Create a config file (django.conf) in /etc/httpd/conf.d with the following content
      LoadModule wsgi_module modules/mod_wsgi.so

      Alias /${REACTION_SUBDOMAIN}/static /local/reaction/management-app/static
      <Directory /local/reaction/management-app/static>
        Allow from all
        Require all granted
      </Directory>
      <Directory /local/reaction/management-app/management-app>
        <Files wsgi.py>
          Require all granted
        </Files>
      </Directory>

      WSGIDaemonProcess management-app python-path=/local/reaction/management-app/management-app:/local/reaction/environments/venv/lib/python3.6/site-packages
      WSGIProcessGroup management-app
      WSGIScriptAlias /${REACTION_SUBDOMAIN} /local/reaction/management-app/management-app/wsgi.py process-group=management-app
      PassEnv LD_LIBRARY_PATH
      PassEnv REACTION_SUBDOMAIN
      PassEnv REACTION_ENVIRONMENT
    2. set the LD_LIBRARY_PATH, REACTION_ENVIRONMENT and REACTION_SUBDOMAIN system variables
      I added them to /etc/sysconfig/httpd : 
      LD_LIBRARY_PATH=/usr/lib/oracle/12.2/client64/lib:$LD_LIBRARY_PATH
      REACTION_ENVIRONMENT=local (specify which environment you are on; if it is in production then use the value 'production' but be aware that the settings_production.py file must exist in [APP_ROOT]/management-app; for example /local/reaction/management-app/management-app; possible values: local, development, test, production)
      REACTION_SUBDOMAIN=reaction-management (the application can be accessed on http://localhost/reaction-management)
  6. configuring management app web GUI
    1. download and unzip the source of management app to the host of Apache (I stored it to /local/reaction/management-app)
    2. configure the settings_[$REACTION_ENVIRONMENT].properties config file (DATABASES, REACTION_ENGINE_REST_URL, etc.)
    3. set the log location to a folder where there is write permission (management-app/settings.py -> LOGGING / handlers / filename)
    4. after having created the database (see in 2.3.3. Database) create superuser to be able log in to the user management and create the users in the user management
      1. First run the following command (activate the virtual environment first) to create a superuser that can be used to login to the user management app (http://localhost/reaction-management/admin)
          python manage.py createsuperuser
      2. Then in the user management application you can create users (with permissions) to use the management app web GUI.
      Be aware if you log in to the management app web GUI with the super user and just open the reaction management app then 'User has no profile.' error will occur => log out from Django admin

2.3.3. Database


Python-Django commands can be used to create / alter the database schema. The reason of that is that Django can handle many different database types so it is not needed to provide different SQL scripts per database type. Also Django can handle any change in the database schema without needing to recreate the whole database
However if needed I am more than happy to provide an SQL script that can build the schema and load the initial data, please open an issue in https://bitbucket.org/ric_flair_wcw/reaction/issues

In the following I provide steps how to create the database schema.
For mysql please create the database with latin1 character set. If you need other character set than latin1 (e.g. utf8mb4) then the length of the CharField fields in the management_app/.../models.py files mustn't be higher then 191 (otherwise getting the Specified key was too long; max key length is 767 bytes error.)! Another workaround (aside from decreasing the length of Charfield fields) is to set the innodb_large_prefix to true. And the 3rd workaround is to raise a ticket on https://bitbucket.org/ric_flair_wcw/reaction/issues

  1. create a user in the database and set the settings (the DATABASES variable in settings_[$REACTION_ENVIRONMENT].properties config file)
  2. go to [APP_ROOT]/management-app
    for example: cd /local/reaction/management-app
  3. set the REACTION_ENVIRONMENT system variable based on the environment that has to be used
    for example: export REACTION_ENVIRONMENT=production
    Hint: the same value of REACTION_ENVIRONMENT has to be used that is set for Apache; in my case it is set in the /etc/sysconfig/httpd file
  4. make migrations
    before doing so the virtual environment has to be activated (e.g. source /local/reaction/environments/venv/bin/activate) and execute the following command in the management_app folder
    python manage.py makemigrations admin_system admin_execution_flow admin_errordetector monitoring scheduler common worker_status
    (only for Oracle: before making the migrations I had to set the LD_LIBRARY system variable to the Oracle client like: export LD_LIBRARY_PATH=/usr/lib/oracle/12.2/client64/lib:$LD_LIBRARY_PATH)
  5. migrate the changes to the database
    execute the following command in the management_app folder
    python manage.py migrate
  6. execute reaction.sql

It is important not to remove the 'migrations' folder that is created during the migration! If the database needs to be changed then these migrations will be used only to apply the changes to the database.

2.3.4. Modules

The management web GUI has the following parts:

There are common characteristics of the web pages like

2.3.4.1. Log in

Before using the application the user has to log in with his / her credentials (user name and password). If the user forgets the password then he / she can reset the password by clicking on the Lost password? link.

The user is redirected to this page if the session expired. After the successful login he / she will be forwarded to the same page where they were before.

2.3.4.2. Monitoring

The events (started, failed, finished, etc. execution flows) can be monitored here.

List

The drop-down list (next to the label Event) is for filtering the events by the start date. The labels are self-explanatory, when choosing the Started between a date from and to can be defined for the filtering.

Below this drop-down list there is another one which is for defining if the future (scheduled) events should be displayed.

By clicking on the Export button the events can be exported to a CSV file. It can be set if the CSV file contains the event life (which are basically the statuses of the tasks of the execution flow) records.

Auto refresh can be turned on, the page will be refreshed in every 6 seconds. The progress of the flow can be followed.

At the end of every line there is a Details link which navigates the user to the Details page.

Details

On top the details of the event can be found.

In the middle the event life records are. Here the ouput of of the external command can be displayed (by clicking on the Show the full output button) or the extracted value (by clicking on the Show the full extracted value) which might be used in the next if-else condition.

At the end of the line of the event life records the following icons might be seen:

In the bottom the execution flow is displayed, those tasks are pale that weren't executed (due to the evaluation of the if-else conditions)

2.3.4.3. Administration - system

List

In the system list page the existing systems can be viewed.

Hierarchy of systems can be built so not all the systems can be seen here but those which are on the highest level (or depending on whichever level we are on).
Above the list (in the middle) the current path of the system is displayed and it can be used to navigate.

Also a tree list is provided where all the systems are displayed in a tree. At the end of every system line there is a Jump to the children link which can be used to jump the children list of the system. If the system's type is group or application then the Jump to the children of the system link will appear next to it. By clicking on the Edit the system button the edit page can be jumped to.

Edit

An existing system can be edited or a new one can be created.

It is important to note that the current system's type is Log file than the following properties (the name and the type properties are mandatory for every system) have to be specified on this system or on one of its parents:

The log header pattern can be built by clicking on the Build button (if the log doesn't contain a header then you don't have to specified the log header pattern).

First copy a part the log that contains the log header (it can have more lines) and click the end of the header in the text. The header will be highlighted (selected) automatically from the beginning of the line. Please click on the Next button.
On the next page the selected text can be seen on top and the pattern can be defined by clicking on the

buttons. First please select the text with the mouse where the date / loglevel / unknown value is then click on the specific button. The text will be replaced with that patter field.

For example: you selected the following text from the log as header
2017-11-24 08:49:24,166 [DEBUG] root:
First type the date format in the textfield next to the Date pattern label (for example: yyyy-MM-dd HH:mm:ss,SSS), then select the text 2017-11-24 08:49:24,166 on top and click on the DATE button. The text will be replaced with [~DATE:yyyy-MM-dd HH:mm:ss,SSS] [DEBUG] root: and the pattern text will be highlighted.
Then select the text DEBUG and click on the LOGLEVEL button.
In the end and type a-zA-Z 0-9 (however the text is root so the pattern a-z would be enough but you have to be careful here as the field may contain other value too that contains number, space, etc. Please examine the log file before adding the pattern text!), in the 'accepted characters' textfield (next to the button UNKNOWN), select the text root on top and click on the UNKNOWN button. The end result will be
[~DATE:yyyy-MM-dd HH:mm:ss,SSS] [[~LOGLEVEL]] [~UNKNOWN:a-zA-Z 0-9]:

Note: if you not sure that one or more space can be between fields (e.g. [DEBUG] root: there is one space between the text '[DEBUG]' and 'root:' now but can it be more?) then please select that one space and defined it as UNKNOWN field with a ' ' pattern. It means that field can have 1 or more space.

The created pattern can be checked if it can be used to extract the values from a real log entry.
You have to define a real log entry (like 2017-11-24 08:49:24,166 [DEBUG] root: get_queryset - admin_system.views| filter :{'parent': '2'}, order: ['name', 'id']) and by clicking on the Check button the extracted values will be displayed if the pattern is good (the output will be here: The following values were found: DATE: 2017-11-24T08:49:24.166   LOG LEVEL: DEBUG   UNKNOWN: root).

2.3.4.4. Administration - execution flow

List

In the execution flow list page the existing execution flows can be viewed.

By clicking the Copy the execution flow button (which is not displayed for invalid flows) the flow can be copied with all its tasks. Whit this function it is easy to create a new version of an existing flow.

Edit

An existing execution flow can be edited or a new one can be created.

On the top part of the page the fields of the execution flow can be edited / added (see at the reference data / execution flow).

At the bottom the execution flow can be edited / created. It basically means that new tasks can be added to the flow, an existing one can be altered or deleted.
It is important to note when creating a new flow the flow has to be saved first before adding any task to it.

If the execution flow was in use already then it has history (i.e. events saved that can be monitored) so it cannot be edited (otherwise the history couldn't be reviewed correctly in the monitoring page - i.e. if a task is deleted then the new flow will contain less tasks). In this case the execution flow is in read-only mode (FROZEN status).
However a new version of this can be made by clicking on the Copy button in the List view. If this history is not needed and the current flow has to be modified then by clicking on the Unfreeze it by deleting its history -> the full history of this flow will be deleted and the flow can be edited again (please be careful with this operation).

There is one rule that will apply for the flow: before every if-else condition an external task command has to be executed (to get the value to evaluate the if-else). If this rule breaks then the flow becomes invalid (INVALID status) and it cannot be copied and it cannot be used in an error detector.

When creating a new flow only the Start task can be seen. It is just the beginning point of the flow it doesn't do anything. When clicking on the Start button then a window will appear when the data of a new external task can be added and it will be inserted after the Start (i.e. it will be the first task).
All the other action (like add a new task, etc.) can be begun by clicking on an existing task. The window that will appear looks like as follows:

The top part of the window contains the data of the task that varies depending on the task type (see reference data / execution flow task).

In the bottom part operations can be done with a task:

 

2.3.4.5. Administration - error detector

List

On the list page the existing error detectors can be viewed.

Edit

An existing error detector can be edited or a new one can be created.

2.3.4.6. Scheduler

List

The scheduled execution flows can be viewed on this page.

First the scheduled execution flow (which is basically an execution flow and a crontab expression) has to be created but by default it won't be scheduled. It can be scheduled by clicking on Schedule it! button.
If the currently running instance of the scheduled flow ran successfully then it will rescheduled based on the crontab expression. If it didn't run successfully then the scheduled task will be descheduled (i.e. won't be rescheduled). It can be scheduled again by clicking on the button again.

The started instance of the scheduled execution flow can be monitored on the Monitoring page by filtering on the Scheduler column.

Edit

An existing scheduled execution flow can be edited or a new one can be created.

The crontab expression has a text description after being saved and the next possible run is displayed too. The next run date is calculated in the Reaction Engine (i.e. it is a REST call to it) so if the Engine doesn't run then an error message can be seen here (like The server cannot be called! 500(Internal Server Error)...)

2.3.4.7. Executor

An execution flow can be started immediately (select the Now option) or at a specific time in the future (select the on the following date option and choose a datetime).
A reason can be added (not mandatory) to explain why it was needed.

We'll have immediate feedback about the start and in case of the successful start the identifier of the started flow will be displayed (it is a link).

2.3.4.8. Approval

List

The list of execution flows are enlisted here that have to be manually approved (by the user) before they will be started (when defining an error detector it can be set if the flow has to be manually approved before starting it).

Approval

The flow can be approved or rejected by filtering an event on the list view and clicking on the Details link.

The top part of the page contains the main data of the event.
At the bottom the flow can be seen.

There are 4 buttons on the top right corner:

Please be aware that the flow will be rescheduled to next available timeslot if it doesn't fit in the current one with the execution time of the flow (e.g. an event arrives at 02:40, the maintenance window is open between 01:00 and 03:00 so theoretically the flow can be executed but the execution time of the flow is 30 min so it won't finish by 03:00).

2.3.4.9. Statistics

Incidents by flow

Statistics can be collected by a flow.

You can review how many events

The status of the event can be displayed as a separate column on the diagram (not mandatory).
The timeframe can be minute, hour, day or month. The events will be summarised by minute, hour, day or month and will be displayed.

It is possible to zoom in to the diagram by selecting a rectangle with the mouse.

Incidents by system

Statistics can be collected by a system.

2.3.4.10. Worker status

Whenever a reader worker refreshes its system list, reports an incident or an executor worker gets the commands to be executed, sends back the output/result of the executed command the Engine will store the timestamp of this last happening.

On this page all the workers can be seen that has ever communicated with the Engine and it can be checked when it happened. Possible errors with the workers can be found.
The reader's activity is considered too old (i.e. likely it stopped on the host machine or not able to communicated -> highlighted in red) if it got the system list 24 hours ago last. Similarly the executor is too old if no new REST call was sent to get the commands to be executed in the last 2 hours.

2.3.4.11. User management

The users of the web management GUI can be administered on the following link: http://[host]:[port]/[context root]/admin (for example: http://localhost/reaction-management/admin/ if a web server is used on the port 80).
It is important to note that this application is part of the Django framework and not the Reaction application. Some minor changes are possible but no big alteration can be done.

The users can be added to groups (not detailed).

Permissions can be added and removed. It is important to make sure that the following permissions have to exist (these permissions were inserted during 2.3.3. Database - step 5):

If the database was initialised properly then this page doesn't have to be used.

On the Users page can be the users managed. If a new user has to be created then click on the Add User button on the top right corner. To edit an existing user just click on his/her name. In order to delete one or more users first select it in the table, select the Delete selected users option and click on the Go button.
On the user detail page the following fields has to be filled:

Make sure the user is active and the management web GUI's users shouldn't be staff user or superuser.

2.3.5. Config file

The configuration file resides in the management_app folder. The application supports different configuration file per environment; e.g. a separate file for the local environment (when the management app locally on your machine), for development (when it is deployed to the development server), etc. The environment can be specified by setting the REACTION_ENVIRONMENT system variable (see in 2.3.2.).
The following environments are supported at the moment: local, development, test, production. If more environments have to be supported then please edit the management_app/setting.py file (at the end of the python file).

If you want to deploy the application to production the set REACTION_ENVIRONMENT=production and the settings.py and settings_production.py files must exist in the management_app folder.
The settings_production.py file contains those configurations which are environment-dependent so they have to be altered:

LOGGING

Specifying the logging configuration
formatters/standard/format - the log formatter
It might be worth to specify different formatter for different environments
For example:
    for local
    %(asctime)s [%(levelname)s] %(pathname)s/%(name)s.%(funcName)s:%(lineno)s --- %(message)s
    2017-11-28 08:32:44,922 [DEBUG] C:\work\reaction\src\reaction\management_app\executor\views.py/root.index:18 --- Starting a flow / GET is called
or
    for production
    %(asctime)s [%(levelname)s] %(name)s: %(message)s
    2017-11-28 08:33:41,850 [DEBUG] root: Starting a flow / GET is called
See more info: https://docs.python.org/3/library/logging.html#logrecord-attributes

handlers/file - the configuration of the file log handler
you can set the log level, the file handler type, the size of the config file, the filename, the formatter

handlers/console - the configuration of the console log handler

loggers/root - specifying which handler is active and what the default log level is

DEBUG A boolean that turns on/off debug mode.
Never deploy the management app into production with DEBUG turned on. One of the main features of debug mode is the display of detailed error pages.
Possible values: False / True
ALLOWED_HOSTS A list of strings representing the host/domain names that this Django site can serve. This is a security measure to prevent HTTP Host header attacks, which are possible even under many seemingly-safe web server configurations.
More info: https://docs.djangoproject.com/en/1.11/ref/settings/#allowed-hosts
DATABASES

Specifying the database connection details
NAME: the name of the database
ENGINE: the database backend to use
   for example: django.db.backends.mysql or django.db.backends.oracle
USER: the database user to connect with
PASSWORD: the password of the database user
HOST: the host where the database resides
PORT: the database's port
OPTIONS: Extra parameters to use when connecting to the database. For example at mysql the autocommit parameter can be set

Sample:
DATABASES = {
    'default': {
        'NAME': 'reactionstore',
       'ENGINE': 'django.db.backends.mysql',
       'USER': 'reaction',
       'PASSWORD': 'reaction',
       'HOST': 'localhost',
       'PORT': '3306',
       'OPTIONS': {
          'autocommit': False,
       },
    }
}

or

DATABASES = {
    'default': {
       'NAME': 'xe',
       'ENGINE': 'django.db.backends.oracle',
       'USER': 'reaction',
       'PASSWORD': 'reaction',
       'HOST': '127.0.0.1',
       'PORT': '1521',
    }
}

EMAIL_HOST Mail server host. More info: here
EMAIL_HOST_USER Username to use for the SMTP server defined in EMAIL_HOST. If empty, Django won't attempt authentication. More info: here
EMAIL_HOST_PASSWORD Password to use for the SMTP server defined in EMAIL_HOST. This setting is used in conjunction with EMAIL_HOST_USER when authenticating to the SMTP server. If either of these settings is empty, Django won't attempt authentication. More info: here
EMAIL_PORT Port to use for the SMTP server defined in EMAIL_HOST. More info: here
EMAIL_USE_TLS Whether to use a TLS (secure) connection when talking to the SMTP server. More info: here
REACTION_ENGINE_REST_URL The endpoint URL of the Reaction Engine
Sample value: http://10.20.213.149:7003/reaction-engine
ACCESS_GROUPS All the access groups which control which users (depending on the access groups assigned to it) can see what execution flow (depending on the access groups assigned to it) on the Monitoring page or on the Execution flow administration page.
Here all the access groups have to be enlisted that might be used.
WARNING! The name of the group MUSTN'T contain comma (,) and don't use space in front of or at the end of the group name! It is recommended to use only letters, numbers and space in the name.
Sample value:
ACCESS_GROUPS = [
    'Middleware',
    'DBA',
    'UNIX',
    'Microsoft Technologies',
]
 TIME_ZONE  A string representing the time zone for datetimes stored in this database.
Sample value:
TIME_ZONE = 'Europe/Budapest'
REACTION_REST_AUTH_PUBLIC_KEY
REACTION_REST_AUTH_PRIVATE_KEY
Public/private key (username/password) for authenticating the request that is sent to the Reaction Engine REST. The same username / password pair has to be in the credentials file of Reaction Engine (see 2.2.7 -> reaction.security.credentials_file).
Sample value:
REACTION_REST_AUTH_PUBLIC_KEY = 'reaction-management-web-app'
REACTION_REST_AUTH_PRIVATE_KEY = 'e5574bf1-13c5-476a-b1d3-500bc640564d'

APPENDIX

1. Creating certificates

If encryption between the worker and the engine is needed then there are 2 options:

Here I will describe how to create self-signed certificates for the server and the workers.

creating the server's public / private key to serverkeystore.jck

Execute the following command:
keytool -genkeypair -alias server -keyalg RSA -keysize 1024 -storetype jceks -validity 730 -keypass password -keystore serverkeystore.jck -storepass password

The output is:

What is your first and last name?
[Unknown]: Reaction Engine
What is the name of your organizational unit?
[Unknown]: Unknown
What is the name of your organization?
[Unknown]: Reaction
What is the name of your City or Locality?
[Unknown]: Bournemouth
What is the name of your State or Province?
[Unknown]:
What is the two-letter country code for this unit?
[Unknown]: UK
Is CN=Reaction Engine, OU=Unknown, O=Reaction, L=Bournemouth, ST=Unknown, C=UK correct?
[no]: yes

creating the ADNLT653 worker's (client) public / private key to client_ADNLT653_keystore.jck

Separate certificates have to be created for every worker. It is important the the alias name ('ADNLT653' in this sample) has to be the same as the host name (it is not necessarily the real host name of the machine) of the worker defined in the worker.yml configuration file!

Execute the following command:
keytool -genkeypair -alias ADNLT653 -keyalg RSA -keysize 1024 -storetype jceks -validity 730 -keypass password -keystore client_ADNLT653_keystore.jck -storepass password

The output is:

What is your first and last name?
[Unknown]: Reaction Worker ADNLT653
What is the name of your organizational unit?
[Unknown]:
What is the name of your organization?
[Unknown]: Reaction
What is the name of your City or Locality?
[Unknown]: Bournemouth
What is the name of your State or Province?
[Unknown]:
What is the two-letter country code for this unit?
[Unknown]: UK
Is CN=Reaction Worker ADNLT653, OU=Unknown, O=Reaction, L=Bournemouth, ST=Unknown, C=UK correct?
[no]: yes

exporting the public key of ADNLT653 worker (client) from client_ADNLT653_keystore.jck to .crt file

All the workers' public keys have to be exported.

Execute the following command:
keytool -export -alias ADNLT653 -storetype jceks -keystore client_ADNLT653_keystore.jck -storepass password -file client_ADNLT653.crt

The output is:

Certificate stored in file client_ADNLT653.crt

exporting the public key of server from serverkeystore.jck to .crt file

Execute the following command:
keytool -export -alias server -storetype jceks -keystore serverkeystore.jck -storepass password -file server.crt

The output is:

Certificate stored in file server.crt

importing the server's public key to the client's (ADNLT653 worker) truststore

Execute the following command:
keytool -importcert -alias server -file server.crt -keystore client_ADNLT653_truststore.jck -keypass password -storepass password

The output is:

Owner: CN=Reaction Engine, OU=Unknown, O=Reaction, L=Bournemouth, ST=Unknown, C=UK
Issuer: CN=Reaction Engine, OU=Unknown, O=Reaction, L=Bournemouth, ST=Unknown, C=UK
Serial number: 731601f6
Valid from: Wed Dec 06 20:57:38 CET 2017 until: Fri Dec 06 20:57:38 CET 2019
Certificate fingerprints:
         MD5: 42:A7:B1:AB:C5:C5:15:EE:25:69:17:74:43:AC:31:A7
         SHA1: FA:FF:71:38:1E:17:AE:58:55:7C:1E:D8:B2:53:CE:69:CA:CF:53:45
         SHA256: 0F:2B:EF:2D:21:14:B9:F1:FC:38:4F:83:5D:E7:8F:DB:93:4D:08:17:BC:AB:B2:2A:1F:69:B0:12:6F:CB:38:A0
         Signature algorithm name: SHA256withRSA
         Version: 3

Extensions:

#1: ObjectId: 2.5.29.14 Criticality=false
SubjectKeyIdentifier [
KeyIdentifier [
0000: 45 51 88 74 ED 62 F1 2B 05 8E E7 6B 21 6F 11 5F EQ.t.b.+...k!o._
0010: 70 93 9D 84 p...
]
]

Trust this certificate? [no]: yes
Certificate was added to keystore

importing the client's (ADNLT653 worker) public key to the server's truststore

The server's truststore must contain all the public keys of those workers where certificate based encryption should be used.

Execute the following command:
keytool -importcert -alias ADNLT653 -file client_ADNLT653.crt -keystore servertruststore.jck -keypass password -storepass password

The output is:

Owner: CN=Reaction Worker ADNLT653, OU=Unknown, O=Reaction, L=Bournemouth, ST=Unknown, C=UK
Issuer: CN=Reaction Worker ADNLT653, OU=Unknown, O=Reaction, L=Bournemouth, ST=Unknown, C=UK
Serial number: 12f6894
Valid from: Wed Dec 06 21:04:05 CET 2017 until: Fri Dec 06 21:04:05 CET 2019
Certificate fingerprints:
         MD5: B1:7B:78:B9:80:86:3B:26:EA:73:E1:82:7A:4A:81:DD
         SHA1: F8:C5:6C:A5:36:D2:39:DD:39:67:E5:1C:E5:A2:AC:3F:4F:6A:D7:7C
         SHA256: 5D:3A:60:84:D6:B0:CD:E6:88:2B:85:D6:2B:F0:67:12:1E:55:26:B8:0B:30:6B:67:81:A0:67:14:19:A8:9E:3D
         Signature algorithm name: SHA256withRSA
         Version: 3

Extensions:

#1: ObjectId: 2.5.29.14 Criticality=false
SubjectKeyIdentifier [
KeyIdentifier [
0000: BE 5C 65 FD 46 40 9D 34 C9 F5 D4 59 BC F0 32 94 .\e.F@.4...Y..2.
0010: 3C F0 77 AC <.w.
]
]

Trust this certificate? [no]: yes
Certificate was added to keystore

final result

After these commands the following files should exist in the folder:

2. Mysql database client for management web GUI

set up the settings-XXXX.py config file
DATABASES = {
    'default': {
        'NAME': 'reactionstore',
        'ENGINE': 'django.db.backends.mysql',
        'USER': 'reaction',
        'PASSWORD': 'reaction',
        'HOST': 'localhost',
        'PORT': '3306',
        'OPTIONS': {
            'autocommit': False,
        },
    }
}

3. Docker image

In order to demonstrate the capabilities of Reaction Engine a Docker image can be built. In the Download section a ZIP file can be downloaded that contains the Dockerfile (that has the instructions to build the image) and some file that are needed during the build.

During the build the following components will be installed to the Docker image (I enlist the important ones only):

Create Docker image

After downloading the Docker ZIP file from the download section the ZIP file has to be unzipped. The build of the image can be started with the following command (please notice the dot at the end):
docker build -t reaction/ubuntu:v1 .
If you are behind a proxy then you can define the proxy with the --build-arg. For example:
docker build --build-arg http_proxy=http://proxy.adnovum.hu:3128 --build-arg https_proxy=http://proxy.adnovum.hu:3128 -t reaction/ubuntu:v1 .

After the build finished the image can be started with the following command:
docker run -u reaction -it reaction/ubuntu:v1 /bin/bash
Similarly if proxy has to be set then use the following:
docker run -u reaction -e http_proxy=http://proxy.adnovum.hu:3128 -e https_proxy=http://proxy.adnovum.hu:3128 -it reaction/ubuntu:v1 /bin/bash

After starting the image with the command above the Linux user reaction will be logged in. A script is created to start all the Reaction services, just execute it when logged in with user reaction:
/local/reaction/start_reaction_services.sh
It will start mariaDB, apache2, Tomcat8, Reaction Reader worker and Reaction executor worker.

Linux users:
root / root
reaction / reaction

Use Reaction in Docker image

First check what the IP address of the Docker image is with executing the command hostname -i.
The endpoint URLs to be used are as follows:

User to log in to the management app web GUI: vikhor / reactionengine
User to log in to the user management of the management app web GUI: admin / reactionengine

During the build the Reaction components are downloaded from download section and stored in /local/reaction. The worker is installed to /local/reaction/worker, the management app can be found in /local/reaction/management_app and the configuration files of the Reaction Engine are in /local/reaction/reaction-engine.
The home directory of Tomcat 8 is /opt/tomcat, the deployed Engine is in the webapps folder.

Use the installed flow(s)

Reaction already contains 2 execution flows (they can be examined after logging in to the management GUI).

One of them (Hermes restart if out of memory error occurs) is the one that is used in the videos in the presentation (please see it here). It contains the data of a successful execution too (click on Monitoring and set the Events filter to empty). The flow cannot be executed successfully as no Weblogic servers are installed.
The flow restarts Weblogic managed servers on two different hosts. It also contains an IGNORED flow execution to demonstrate that the same flow is not started if an existing one is already running.

The other one (Record current time) is a sample flow that creates a folder if it doesn't exist yet and record the current time to afternoon.txt file if it's in the afternoon and to the morning.txt file otherwise.
The flow can be executed by the Executor or by the Scheduler.
The automatic incident resolving can be tried out too: in order to do that a log file entry is already created in the system resource (local Reaction Management App log) and an error detector is created too (Recording when the user clicked on the worker status in management app).
The log file points to the log file of the management app GUI itself. The error detector checks if the following pattern occurs in the log file: .+Getting the status of the workers.+GET.+. You'll get similar texts in the log file if you click on the Worker status menu in the management app GUI i.e. any time when you open the Worker Status page then the flow will be started.
The error detector is not activated by default so the flow will be started automatically only if the error detector will be set to active. First the flow has to be approved (please set the mail settings of the Engine if mail has to be sent) as it is set in the error detector. After the approval the flow might be scheduled (if not selecting the 'approve and force to start' button) if it is outside the maintenance window (check the details of the local Reaction Management App log system entry).

4. Demonstration videos

The following videos will demonstrate the capabilities of Reaction. First the installation of the components will be shown then the basic data will be built with the Reaction management web application and in the end an incident will be caused and Reaction will remedy it.

Installation

  1. install reader and executor workers
    First the worker ZIP file has to be downloaded, then it has to be unzipped, the manage_*.sh (in Linux) or manage_*.bat (in Windows) files has to be configured and the worker.yml configuration file has to be set up correctly.
  2. initialise the database and create users
    First the database objects have to be created with python-Django, then the initial data has to be loaded (executing reaction.sql) and the superuser and the normal users of the management web application have to be created
  3. install Reaction Engine
    The WAR file, the Engine's config file and the logging config file have to be downloaded, the application server needs to be configured and the WAR file has to be deployed.

Creating the reference data

In the demonstration the following incident will be fixed:
There is a memory leak in the Hermes CRM application that causes application crash. OutOfMemoryError can be seen in the application log file and the solution is to restart the server.
Hermes has 2 running instances on different machines. If the error is found in any of them then both servers have to be restarted. The middleware administrators don't want the whole restart process to be automatic so human intervention is needed before starting the execution flow.

  1. creating the systems
    Hermes CRM has two running instances on separate host machines:
    - ADNLT653
    - ADNLT654
    As only one log file has to be observed per application so two LOG_FILE systems have to be created.
    Also the two LOG_FILE systems differ only in the names of the hosts so a parent system can be created where the common properties (e.g. maintenance window) can be defined.
    Hermes CRM [APPLICATION] - specifying log file location and maintenance window
        Hermes ADNLT653 [LOG_FILE] - specifying host name
        Hermes ADNLT654 [LOG_FILE] - specifying host name
  2. building the execution flow
    The execution flow is for restarting the managed Weblogic server on both machine.
    The scenario is as follows:
    1. send a mail to the business users that the application will be restarted
    2. stop the managed server on ADNLT653
    3. if the server didn't stop correctly then send a mail to administrators
    4. otherwise start the managed server on ADNLT653
    5. if the server didn't start correctly then send a mail to administrators
    6-9. otherwise do steps 2-5 on ADNLT654
    10. send a mail to the business users that the application works again
  3. creating the error detector
    The log files (systems) and the execution flow can be assigned to each other by creating an error detector.
    The pattern (java.lang.OutOfMemoryError: Java heap space) that has to be observed in the log files can be specified in the error detector too.
    However there are two log files that have to be examined but only one error detector can be created as the parent system (Hermes CRM) will be used in the error detector.

Causing the error and see how it is fixed

1. causing the out-of-memory error in Hermes application
2. checking the mail arrived that a flow has to be confirmed
3. confirming (and forcing) the flow to start (if only confirming it then it would be scheduled as per the maintenance window)
4. observing in the Monitoring page of the Management GUI the progress of the execution flow runs
5. skipping the failed task (check the mail about the failed task)
6. causing the error again and check if it is ignored