The system behaves strangely — hangs once a day, the disk is slower than before, a program crashes for no visible reason. Before reinstalling everything in sight — there are tools that show exactly where the problem is.
System Logs: What Happened and When
dmesg — the kernel journal. Hardware messages, drivers, filesystem events. With readable timestamps:
dmesg -T
Errors and warnings only, without informational noise:
dmesg -T --level=err,warn
Watch new messages in real time — useful during diagnostics:
dmesg -Tw
journalctl — the systemd journal. Shows logs for the whole system and individual services.
Errors only from the current boot:
journalctl -p err -b
Errors from the previous boot — useful if the system crashed and rebooted:
journalctl -p err -b -1
Errors from a specific service in the last hour:
journalctl -u nginx --since "1 hour ago" -p err
List all boots — shows when restarts occurred:
journalctl --list-boots
Hard Drive and SSD: Catching Problems Before Failure
Drives do not die suddenly — they warn first. S.M.A.R.T. tracks error counts, reallocated sectors, temperature. Install the tool:
sudo apt install smartmontools
Full health report for a drive:
sudo smartctl -a /dev/sda
List available drives:
lsblk
Key fields to check in smartctl output: Reallocated_Sector_Ct — any non-zero value signals degradation. Current_Pending_Sector — unstable sectors not yet reallocated. Offline_Uncorrectable — sectors that cannot be read. SMART overall-health self-assessment: PASSED/FAILED — the final verdict.
Run a quick built-in test (takes 1-2 minutes):
sudo smartctl -t short /dev/sda
Check the result a few minutes later:
sudo smartctl -l selftest /dev/sda
Filesystem: Check and Repair
fsck checks the filesystem and fixes errors. Important: run only on an unmounted partition.
Force a root filesystem check on next boot:
sudo touch /forcefsck
Check a partition manually (unmount first):
sudo umount /dev/sda1
sudo fsck -f /dev/sda1
The -f flag forces the check even if the filesystem is marked clean. -y automatically fixes all found errors without prompting:
sudo fsck -fy /dev/sda1
RAM: Finding Bad Cells
Memory errors cause the strangest bugs: random program crashes, file corruption, kernel panics with no apparent reason.
memtest86 — a bootable tool that runs before the OS starts. Install to GRUB:
sudo apt install memtest86+
sudo update-grub
Reboot, select memtest86+ from the GRUB menu. A full test takes several hours — best left overnight.
stress-ng — load testing from within the running system:
sudo apt install stress-ng
sudo stress-ng --cpu 4 --io 2 --vm 1 --vm-bytes 512M --timeout 60s --metrics-brief
If the system hangs or crashes during the test — there is a hardware problem. --metrics-brief shows operations per second for each worker.
Network: Drops, Packet Loss, Route
Check basic connectivity with packet loss statistics:
ping -c 20 8.8.8.8
In the statistics line: 0% packet loss — all good. Any percentage of loss — problem somewhere on the route.
Route to host with per-hop latency:
traceroute 8.8.8.8
mtr combines ping and traceroute, updates in real time:
sudo apt install mtr
mtr 8.8.8.8
Errors on network interfaces — corrupted packets, collisions:
ip -s link
If errors and dropped counters are non-zero — problem at the network interface or cable level.
Quick Reference
| What to check | Command |
|---|---|
| Kernel errors | dmesg -T --level=err,warn |
| systemd errors this boot | journalctl -p err -b |
| Errors from previous boot | journalctl -p err -b -1 |
| Drive S.M.A.R.T. status | sudo smartctl -a /dev/sda |
| Quick drive self-test | sudo smartctl -t short /dev/sda |
| Filesystem check | sudo fsck -f /dev/sda1 |
| Hardware stress test | sudo stress-ng --cpu 4 --vm 1 --vm-bytes 512M --timeout 60s |
| Packet loss | ping -c 20 8.8.8.8 |
| Route with latency | mtr 8.8.8.8 |
| Network interface errors | ip -s link |