본문 바로가기

Linux

[Linux Kernel panic issue] hung_task_timeout_secs and blocked for more than 120 seconds problem

리눅스 로그상에 다음과 같이 발생

 

May 17 04:06:57 sam kernel: INFO: task python2.6:32287 blocked for more than 120 seconds.
May 17 04:06:57 sam kernel: INFO: task python2.6:32287 blocked for more than 120 seconds.
May 17 04:06:57 sam kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 17 04:06:57 sam kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 17 04:06:57 sam kernel: python2.6     D ffff8101dbaa20c0     0 32287  12604               32286 (NOTLB)
May 17 04:06:57 sam kernel: python2.6     D ffff8101dbaa20c0     0 32287  12604               32286 (NOTLB)
May 17 04:06:57 sam kernel:  ffff81016bd33ef8 0000000000000082 ffff810bff1a0570 ffffffff8001adcf
May 17 04:06:57 sam kernel:  ffff81016bd33ef8 0000000000000082 ffff810bff1a0570 ffffffff8001adcf
May 17 04:06:57 sam kernel:  000000000000680b 000000000000000a ffff8101f6d73820 ffff8101dbaa20c0
May 17 04:06:57 sam kernel:  000000000000680b 000000000000000a ffff8101f6d73820 ffff8101dbaa20c0
May 17 04:06:57 sam kernel:  003f11bdbeb7902a 000000000103012a ffff8101f6d73a08 0000001100001000
May 17 04:06:57 sam kernel:  003f11bdbeb7902a 000000000103012a ffff8101f6d73a08 0000001100001000
May 17 04:06:57 sam kernel: Call Trace:
May 17 04:06:57 sam kernel: Call Trace:
---- 이하 반복 ---

 

시스템의 부하율을 점검해본 결과 로그 발생한 후 CPU부하율이 급격히 상승함을 볼수 있다.

 

 일자  CPU MEMORY DISK SWAP  
2016-05-17 10:19 1.12 0.52 47.88 0 <-- 리부팅 이후의 부하율
2016-05-17 04:05 96.36 23.24 47.87 0  
2016-05-17 04:04 87.69 23.23 47.87 0  
2016-05-17 04:03 49.55 23.19 47.87 0  

 

서버 load를 확인할수 없이 엄청 높아지는 현상이 발생되었다.

 

 

 

로그에 나와있는것 처럼 다음과 같이 적용

 

As a root

 

echo 0 > /proc/sys/kernel/hung_task_timeout_secs

 

기존에 해당값이 뭐로 되어있는지 확인 후

#  sysctl -a | grep vm.dirty | grep ratio
vm.dirty_ratio = 40
vm.dirty_background_ratio = 10

 

#  sysctl -w vm.dirty_ratio=10

#  sysctl -w vm.dirty_background_ratio=5

#  sysctl -p

 

vi /etc/sysctl.conf 에 다음 두줄 추가

 

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

 

#  reboot

 

[Good Luck]

 

vm.dirty_background_ratio is the percentage of system memory that can be filled with “dirty” pages — memory pages that still need to be written to disk — before the pdflush/flush/kdmflush background processes kick in to write it to disk. My example is 10%, so if my virtual server has 32 GB of memory that’s 3.2 GB of data that can be sitting in RAM before something is done.

vm.dirty_ratio is the absolute maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk. When the system gets to this point all new I/O blocks until dirty pages have been written to disk. This is often the source of long I/O pauses, but is a safeguard against too much data being cached unsafely in memory.

vm.dirty_background_bytes and vm.dirty_bytes are another way to specify these parameters. If you set the _bytes version the _ratio version will become 0, and vice-versa.

vm.dirty_expire_centisecs is how long something can be in cache before it needs to be written. In this case it’s 30 seconds. When the pdflush/flush/kdmflush processes kick in they will check to see how old a dirty page is, and if it’s older than this value it’ll be written asynchronously to disk. Since holding a dirty page in memory is unsafe this is also a safeguard against data loss.

vm.dirty_writeback_centisecs is how often the pdflush/flush/kdmflush processes wake up and check to see if work needs to be done.

 

[ 참고 site ]

https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/

https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/