changju,
I didn't get the traces -- ran out of time. The problem did occur
again though.
This time, node 3 was hosting the master when it went dead. I shut
down node 3 and the master moved to node 2. When it loaded on node 2,
it was still dead. I had to wait for node 3 to come back, move
resources off of node 2, and restart node 2. Master then moved to node
3 and was fine.
Just before this happened, I had just defined a new cluster resource
using node 1. I presented the lun and forced a FC scan on node 1
followed by a PowerPath command to get it to recognize the newly
discovered lun.
I then proceeded to create the cluster resource using nssmu on node 1.
My procedure has been to follow this with the same set of commands on
the other two nodes to get them to see the newly presented lun and its
partition. It seems that while they see the lun device, they don't
resolve the partition I'd put onto it. No idea why.
Somewhere during this process, the master stopped responding to
iManager queries and to local cluster commands (ie., cluster status).
I'm still not sure of a cause/effect relationship here. I'll see if I
can chase it down when I get back on site.
If this is the source of the problem, it's a real pain. Adding or
expanding a resource requires reboot of each node in the cluster? Not
a big problem with three nodes, but a major headache with two 21-node
clusters.
Thanks and, as usual, all observations welcome.
Regards,
Don
changju <changju@no-mx.forums.novell.com> wrote:
I wish I could get my hands on your cluster to diagnose the problems.
If possible, please turn on NCS tracing (echo -n "TRACE ON" >
/proc/ncs/cluster), adminfs debugging (echo -n "debug" >
/admin/adminfs.cmd) and check /var/log/messages for clues on what
might
cause the problems.
Regards,
Changju
Don Horsfall;1847221 Wrote:
> Hi everyone,
>
> I just built a new 3-node OES2 SP1 Linux cluster. I'm beating it up
> pretty badly building new cluster resources and testing them.
>
> In the process, the master ip service just stops responding. I usually
> see it in iManager not being able to read the cluster status.
>
> It also shows if I do a cluster status (or any other cluster command)
> on the node hosting the master. The command just hangs.
>
> Cluster commands issued from other nodes work fine.
>
> The only fix I've found is to restart the node holding the master ip.
> This, of course, forces the master ip to move to another server and
> everything works fine again.
>
> This is a relatively small cluster. I am also responsible for a dual
> 21-node BCCed cluster currently running beautifully on NetWare. I'm
> going to have to move this to Linux (obviously). I'd prefer not to
> have this kind of problem in that environment.
>
> Has anyone seen this? Any clue what's going on?
>
> Thanks,
>
> Don