Notes on using LabKey as a socket client to initiate analysis on remote processing PC

Local job execution managed by LabKey

LabKey was modified to initiate pipeline jobs on local machine using trigger script machinery. The process was split between two pieces of software:

  • LABKEY/analysisModule. The analysis module runs on LabKey server, written in javascript and combines data from instruction list (Runs), json configuration from Analysis/@files/configurationFiles to initiate a python script that will run as user tomcat8 on LabKey server. The actual python scripts are part of project specific code and are expected to be in standard software directories (e.g. /home/nixUser/software/src). The task of the module is to format a shell script command that will combine items from the line and execute it.
  • LABKEY/analysisInterface. This is a set of overhead python routines that will delegate execution to project specific scripts and will manage program execution flags (PROCESSING, FAILED, DONE) and log files that are shown together with job execution.
  • The native pipeline execution was abandoned due to lack of flexibility.

Remote job execution model

Processing PCs are kept distinct from the database PCs. LabKey requires such processing PCs to run an equivalent LabKey software, which makes the infrastructure overwhelming. Some thoughts:

  • As a replacement, python socket model is suggested. Identically, analysisModule formats the call, but sends it to socket rather than executes it directly. Since combining sockets over multiple programming language may be cumbersome, probably best to still start a shell command, but there should be a flag that tells analysisInterface to start the job remotely.
  • Remote sockets starts a python job involving analysisInterface; this is identical to current system call performed by analysisModule, except in python, which might enable some shortcuts, ie starting from python. The nice thing about shells is that they run as new threads. However, previous item already has analysisInterface running socket client, so it might be best for analysisInterface to use sockets directly.
  • analysisInterface runs on processing PC and manages the (local) logs and execution. The status on the initiating job is updated via embedded labkey/python interface and no additional sockets are needed. Log file can be transmitted at the end of the job, although running updates might be of interest, which may be handled by analysisInterfaceusing smart uploading strategies that append rather than transmit full files.
  • Due to asynchronity of the submissions, a queue could be implemented at the processing PC site, probably by the analysisInterface to make the socket itself as transparent as possible. Which begs the question on how could processes initiated by different users be aware of each other. But wait - the user running the socket will be the single user that will execute the code, hence a plain json database is fine. Speaking of databases - it might as well use the originating database, which will have to be modified anyhow also as a queue, eliminating the need for local json or other overhead.
  • This makes the remote pipeline fully opaque and the end-user has no additional overhead by using remote as opposed to local hosts.
  • Let's recapitulate: analysisInterface gets a submit job request via socket. It checks back to server whether it has any jobs running. Here we could apply filters that would allow multiple non-interfering jobs to be run simultaneously, but prevent interfering jobs to be started. The python instance that waits in a low budget loop and checks whether its turn has come. To perserve order all jobs issued previously must reach a conclusive state (DONE/FAILED) and no QUEUED job should be ahead in queue. Then the loop completes and shell command is issued, the loop is switched to wait for completion, part of which a potential log update could be. Once job is completed, status should be changed, now critical, since further jobs might await that flag.

Network considerations

  • List ports: ss -tulpn | grep LISTEN
  • Open ports iptables -I INPUT -p tcp -s X.X.X.X/32 --dport 8765 -j ACCEPT iptables -A INPUT -p tcp -s 0.0.0.0/0 --dport 8765 -j DROP
  • Remove iptables rule: sudo iptables -D INPUT -m conntrack --ctstate INVALID -j DROP
  • Message should contain - the calling server, jobId. analysisInterface should hold a mapping of server-configuration maps. Does websockets report caller id? It does and can be used - websocket.remote_address[0]

Server setup

Processor side:

  • Clone websocket. Edit serviceScripts/env.sh to change IPSERVER and IPCLIENT
  • Clone analysisInterface
  • Check .labkey/setup.json for proper paths and venv. Particularly, if softwareSrcis set in paths.
  • Start server: $HOME/software/src/websocket/serviceScripts/start.sh
  • Enable port: sudo $HOME/software/src/websocket/serviceScripts/open_port.sh

Client (labkey) side:

  • Check from labkey pc by cloning websocket (above), installing websockets pip3 install websockets and running: $HOME/software/src/websocket/send.py AA.BB.CC.DD:TEST:X, where AA.BB.CC.DDis the ip address of the server or its name, if set by DNS.
  • Install websockets for tomcat8 user.

Debug

Check iptables!

Discussion