Frozen Job in Devcloud - Frozen Job in Devcloud
I have a frozen job in DevCloud. Time quota was 6 hours, but it's been running for more than 62 hours. I try to kill it with qdel <job id> but I get qdel: Server could not connect to MOM <job id> Any idea on what to do ?
Replies:
Re: Frozen Job in Devcloud
Let me add (for others having the same problem) that the DevCloud team finally cancelled my pending job. A general good advice is to always include a deadline in your batch jobs to avoid any issue with the queueing system in case something strange happen.
Replies:
Re: Frozen Job in Devcloud
Thanks Lawrence, I already sent them 2 maills (last saturday, and yesterday) but I have no response.
Replies:
Re: Frozen Job in Devcloud
The problem is that the node s005-n005 that was running the job went down (I don't know why) and the queue system has lost the control of the job. I cannot login to s005-n005 because it is not running. Apparently (with admin privileges) the problem would be simply solved by running qdel -p 18216.v-qsvr-fpga.aidevcloud
Replies:
Re: Frozen Job in Devcloud
Let me add if you post here and dont see a response, try fpgauniversity@intel.com . We have a fairly small team moderating technical inquiries on the FPGA devcloud, and dont check the forum frequently. Thanks Larry
Replies:
Re: Frozen Job in Devcloud
Do you know which server you launched the job from? If so, you can log back into the same server, you can try ps -auxw and kill -9 the job ID. Sometimes that kills the job. Make sure you use the walltime construct in batch mode so you don't time out i the future. Thanks, Larry
Replies:
Re: Frozen Job in Devcloud
Hi, I have forwarded your issue to the owner of this Dev Cloud platform and awaiting to hear back. I would request for them to answer to your post directly. Please give us a couple of days on this. -Hazlina - 2021-03-01
external_document