DCImanager 6 Knowledge Base

Basic diagnostics

This article contains commands to identify causes of incorrect operation of the platform, as well as commands to restart containers and some services to restore their work.

Some commands may require superuser privileges.

General diagnostics

The section contains a list of commands that you can run as a first diagnostic step. These commands will eliminate basic problems and reduce troubleshooting time.

Operating system (OS) version

If the master or server OS for the node is not supported by the platform, the installation or connection will end with an error. To determine the OS version, run the command:

cat /etc/*release

See DCImanager 6 documentation for a list of supported operating systems:

Server date and time

During periodic synchronization with the license server, the date and time are checked. If an incorrect date or time is set on the server with the platform, the platform will be blocked or its operation will be incorrect. To determine the current date and time on the server, run the command:

date -R

Disk space and RAM usage

For the correct operation of the platform, free disk space and RAM must meet the requirements specified in the Server requirements  article in the VMmanager documentation. In addition, if there is not enough free space or RAM, virtual machines and backups will not be created. To check the amount of used disk space and file system data, run the command:

df -hT

To check the information about RAM, run the command:

free -h

Inodes

Inodes is the structure in which file metadata is stored. The platform will not work correctly if the server is out of inodes, even if there is free disk space. Typical behavior when there is a shortage of inodes includes reduced performance, inability to create files, incorrect output of information in the platform interface. To check the number and proportion of inodes used in the filesystem, run the command:

df -i

Troubleshooting Linux systems

Examine the system logs to troubleshoot your Linux system. Below are tools for troubleshooting and searching for errors in the server logs. For more information about the product logs, see the DCImanager 6 documentation Platform Logs article.

Circular kernel buffer

One way to find out if the system is working incorrectly is to look at the kernel log using the dmesg utility.  The kernel records all events in a circular buffer while the system is booting and running. dmesg will allow you to examine the kernel messages and identify hardware-related problems. To search for problems, run this command:

dmesg | grep -i -E 'error|failed|critical|bug|panic'

The journalctl utility

You can use the journalctl utility to analyze the logs and detect system problems. The utility displays the logs of the Linux system services. To detect abnormal behavior of a Linux system, run the command:

journalctl | grep -i -E 'error|failed|critical|bug|panic'

Restarting the platform

The log files will be deleted after the restart of the platform. Before restarting, we recommend that you use the script to collect logs. For instructions, see the DCImanager 6 documentation Platform Log Files article.

Restarting the platform can solve a number of basic problems, such as freezing, incorrect display of information in the interface, and incorrect behavior of service processing modules.  To restart the platform, run the command:

dci stop && dci start

Containers

The docker service

The Docker daemon is a service that manages containers as well as other docker entities: networks, storage and images. If this service is not running, the platform will not work. To check the status of docker, run the command:

systemctl status docker

 If the service is stopped, start it with the command:

systemctl start docker

 To check the version of docker, run the command:

docker version

Restarting the docker service

If the docker service is not working properly, restarting the service helps fix it. To do this, run the command:

systemctl restart docker.service

Restarting the service helps fix a number of errors that can occur when starting, restarting, or shutting down the platform:

  • error while removing network: network <network_name> has active endpoints

    Пример ошибки
    Removing network dci_auth
    error while removing network: network dci_auth id 777yy66lkjh has active endpoints
    exit status 1
  • ERROR: for input Cannot start service input: driver failed programming external connectivity on endpoint dci_input_1

To correct the above errors, restart the docker service with the command above.

If the problem could not be resolved, contact technical support through your client area under SupportSupport ticketsAdd.

Status of containers

To diagnose possible problems, display a list of containers and their statuses. To display a list of all running containers, run the command:

docker ps

To get a list of all containers, including stopped ones, run the command:

docker ps -a

If you want to check the status of a specific container, run the command:

docker ps | grep <container_name>

Restarting the container

If the container does not work correctly, restarting it may help. To do this, run the command:

docker restart <container_name>

To restart modified docker containers on the location server, use the commands:

/opt/ispsystem/dci/etc/location_compose.sh down
/opt/ispsystem/dci/etc/location_compose.sh up -d

For more information about the modified docker containers, see the article Modifying containers on the location server in the DCImanager 6 documentation.

Task manager

The task manager in DCImanager 6 consists of two parts:

  1. TaskManager worker is an executable file that runs as a service. The file receives the list of tasks from the TaskManager master service and runs files and scripts.
  2. TaskManager master is a service that accepts requests for registration of tasks, controls queues, and interacts with the database and TaskManager worker.

If the task manager does not work correctly, for example, if tasks freeze, restarting the TaskManager worker may help. To do this, run the command:

docker exec -it dci_back supervisorctl restart taskmgr_worker

If this does not help, restart the TaskManager master:

docker restart dci_taskmgr_1

Logging

To analyze a container's events, examine its log. To display the last 100 lines of the container log, run the command:

docker logs --tail 100 <container_name>

Proxying connections to BMC

If you have problems connecting to BMC through the proxy module, manually restart the proxy server for BMC. To do this:

  1. Restart the module on the location server:

    docker-compose -f /opt/ispsystem/ipmi_proxy_service/etc/docker/ipmi_proxy_v2.yml down
    docker-compose -f /opt/ispsystem/ipmi_proxy_service/etc/docker/ipmi_proxy_v2.yml up -d
  2. Restart the drop_sessions service in the dci_ipmi_proxy_v2_client_1 container on the server with the platform:

    Container names may differ depending on the used version of Docker Compose. A hyphen may be used instead of the underscore character in container names.

    To get the exact names of the containers, run the command: 

    docker ps -a

    docker exec -it dci_ipmi_proxy_v2_client_1 supervisorctl restart drop_sessions

For more information about the BMC proxy module, see the article BMC proxy module operation in the DCImanager 6 documentation.

Connecting the location server

This section describes how to troubleshoot location connectivity issues.

The sshd daemon program

The sshd is a SSH daemon program that accepts connection requests. Before connecting the location server, you should check that sshd is running on it. To do this, run the command:

systemctl status sshd

 Check the configuration file /etc/ssh/sshd_config for these parameters:

PermitRootLogin yes
PasswordAuthentication yes

Add them if they are missing. They are needed to connect the location server via SSH.

In addition, check the following:

  • availability of the location server from the platform server using the ping or curl utility;
  • space available on the location server with the command df -h;
  • compliance with other location requirements. Read more in Locations. General information in the DCImanager 6 documentation.

Firewall

If there are no rules for the docker service in the firewall, there may be problems with the platform and network. The necessary rules are created automatically when you start the docker service, we do not recommend that you modify or delete them manually.

Service status and configuration

To check the status of the firewall service, run the command depending on the OS:

For Ubuntu, Astra Linux
systemctl status nftables
For CentOS, AlmaLinux
systemctl status firewalld

To display the service configuration, run the command depending on the operating system:

For Ubuntu
nft list ruleset
For CentOS, AlmaLinux
firewall-cmd --list-ports

Restarting the service

Restarting the service is necessary if it does not work correctly, as well as to restore the default rules. To restart the service, run the command:

For CentOS, AlmaLinux
systemctl restart firewalld.service
For Ubuntu
systemctl restart nftables.service

To restore the default rules:

  1. Restart the firewall service with one of the commands presented above.
  2. Restart docker with the command:

    systemctl restart docker.service
  3. Restart the platform with the command from the section Restarting the platform of this article.

Searching for information in the database

There are potential risks involved in tampering with the DB. We do not recommend making manual edits to the database, as it can disrupt the correct operation of the platform.

Any actions with the database should be performed only after backing up the platform. 

Using queries to the database, you can see information about the state of VMs, nodes and other platform objects. Below is a list of queries for retrieving data from the database.

To run queries, connect to the MySQL container:

docker exec -it mysql bash -c "mysql auth -p\$MYSQL_ROOT_PASSWORD"

Location information

Information about the location will help you see its complete status.

To get information about a location, run the query:

SELECT userspace,proxy_params,is_main,settings,params_netflow,params_nginx,params_redis,params_dhcp,setup_info,ssh_user,ssh_port,ssh_address,geolocation,status_info,status,id FROM dci_location WHERE id='<location_id>'\G;
Comment

To check the OS templates on the location server, run the query:

select * from dci_location_os_template where id in (1,2,3) \G;
Comment

Tasks

Information about the task status will allow you to track it and check if the status of a frozen task has changed.

If the task freezes while it is in the “In progress” status, you can track the process with the query:

select * from taskmgr_task where status IN ('running')\G

To display all created tasks, run the query:

select count(*) from taskmgr_task where status='created';