CodeVerge.Net Beta


   Item Entry   Register  Login  
Microsoft News
Asp.Net Forums
IBM Software
Borland Forums
Adobe Forums
Novell Forums




Can Reply:  Yes Members Can Edit: No Online: Yes
Zone: > NEWSGROUP > Novell Forums > novell.support.cluster-services Tags:
Item Type: Date Entered: 8/5/2008 3:56:02 PM Date Modified: Subscribers: 0 Subscribe Alert
Rate It:
NR
XPoints: N/A Replies: 13 Views: 19 Favorited: 0 Favorite
14 Items, 1 Pages 1 |< << Go >> >|
crivera <criver
NewsGroup User
Randomly problem in 12 nodes cluster.8/5/2008 3:56:02 PM
Reply

0


Hi!

We have 5 cluster:
.- DNS/DHCP Cluster (2 nodes)
.- BorderManager Cluster (3 nodes)
.- Zenworks 7 suite Cluster (3 nodes)
.- GroupWise Cluster (8 nodes)
.- File & Print Cluster (12 nodes)

All nodes have 10 GB of RAM, 4 Intel Xeon MP Processor @ 3.00 Ghz, dual
HBA�s adapter (Qlogic) and dual NICs (balancing and fault tolerance).

All nodes have NetWare 6.5 SP6 with all updates and Novell Cluster
Services 1.8.0.

All cluster attach to a SAN HP EVA 4000.

Eventually, some node of File& Print Cluster (randomly) reports
comatose state without more details.

The rest of cluster works perfectly

I have the doubt about to the size of the cluster, nevertheless I do
not locate specific information about how tunning clusters of great
size.

These are its current values:

Quorum Triggers
Membership (#nodes) 12
Timeout (secs) 60

Protocol Settings Value
Heartbeat (secs) 2
Tolerance (secs) 32
Master Watchdog (secs) 2
Slave Watchdog (secs) 32
Max Retransmits 30

Any recommendation or suggestion will be welcome.

Thanks


--
crivera
------------------------------------------------------------------------
crivera's Profile: http://forums.novell.com/member.php?userid=16009
View this thread: http://forums.novell.com/showthread.php?t=338836

Andrew C Taubma
NewsGroup User
Re: Randomly problem in 12 nodes cluster.8/5/2008 11:44:42 PM
Reply

0

On 06/08/08 crivera wrote:
> Tolerance (secs) 32
> Slave Watchdog (secs) 32

These are too high; client32 times out after 30 seconds to users will
disconnect and have to manually map drives. Drop to 16. Why were the
others changed from 1 to 2 seconds?

But that has nothing to do with your problem. What lan driver, date and
version, do teh nodes have?

You say "Eventually, some node of File& Print Cluster (randomly)
reports comatose state without more details", but nodes don't get
comatose states, clustered resources do. And it is almost always due to
faulty load or unload scripts, can we see those scripts for the problem
resources please?
--
Andrew C Taubman
Novell Support Forums Volunteer SysOp
http://forums.novell.com/
(Sorry, support is not provided via e-mail)

Opinions expressed above are not
necessarily those of Novell Inc.
crivera <criver
NewsGroup User
Re: Randomly problem in 12 nodes cluster.8/6/2008 3:46:01 PM
Reply

0


Hi!
Thanks Andrew for you reply.

We change the values by recommendation of Novell in the old File&Print
Cluster with 8 nodes with NetWare 6.0 SP5, but... I will try to drop to
16 and to test it.

The servers have two dual NIC:

HP NC360T PCIe DP Gigabit Server Adapter.
HP NC371i Multifunction Gigabit Server Adapter

Lan drivers and versions:
N1000E.LAN
Loaded from [C:\NWSERVER\DRIVERS\] on Aug 1, 2008 3:47:35 pm
(Address Space = OS)
HP NC-Series Intel N1E Ethernet driver
Version 10.38 February 4, 2007
Copyright 1998, 2007 Hewlett-Packard Development Company, L.P.
BX2.LAN
Loaded from [C:\NWSERVER\DRIVERS\] on Aug 1, 2008 3:47:35 pm
(Address Space = OS)
Broadcom NetXtreme II Gigabit Ethernet Driver
Version 3.41 April 30, 2007
Copyright (c) 2002 Broadcom Corporation. All rights reserved.

Certainly, the resources report comatose state and the node that
contains it ate the poison pill and reports an abend. But the problem
iof comatose state is randomly without some logical order.

Sorry by the mistake in the post.

I checked the load and unload scripts and apparently there is not
problems with the times of load or unload, neither with the IP Address
, neither with the protected memory

Thanks.


--
crivera
------------------------------------------------------------------------
crivera's Profile: http://forums.novell.com/member.php?userid=16009
View this thread: http://forums.novell.com/showthread.php?t=338836

Andrew C Taubma
NewsGroup User
Re: Randomly problem in 12 nodes cluster.8/6/2008 11:01:55 PM
Reply

0

OK, the poison pill triggering abends confirms you have a bad lan
problem here. Which of those two drivers does the heartbeat go over? If
it's the BX2 one, that's probably your problem, along with the excessive
timeouts already mentioned.

Is the heartbeat on a separate lan to the user connections? If so that's
an issue too. But it is very concerning that when a server abends the
resources do not fail over but go comatose; they should happily mount on
the next node.

In ConsoleOne go to the Cluster object and export the HTML report, and
attach that here please - put [FILE] in the title of that post to allow
the attachment.
--
Andrew C Taubman
Novell Support Forums Volunteer SysOp
http://forums.novell.com/
(Sorry, support is not provided via e-mail)

Opinions expressed above are not
necessarily those of Novell Inc.
crivera <criver
NewsGroup User
Re: Randomly problem in 12 nodes cluster.8/7/2008 3:26:02 PM
Reply

0


Which of those two drivers does the heartbeat go over?
The primary NIC is BX2.LAN. The heartbeat go over this NIC

Is the heartbeat on a separate LAN to the user connections?
No it isn�t. The heartbeat go in the same LAN of users.

Each NIC of is attach to two different Cisco Catalyst 6509E

I sending to you th HTML Cluster report (in PDF format).
I hope this help.

Thanks again.


+-------------------------------------------------------------------+
|Filename: C_ARCHIVO.pdf |
|Download: http://forums.novell.com/attachment.php?attachmentid=1433|
+-------------------------------------------------------------------+

--
crivera
------------------------------------------------------------------------
crivera's Profile: http://forums.novell.com/member.php?userid=16009
View this thread: http://forums.novell.com/showthread.php?t=338836

Andrew C Taubma
NewsGroup User
Re: Randomly problem in 12 nodes cluster.8/7/2008 10:56:27 PM
Reply

0

Hmm, well there are a few oddities in there but nothing that should case
what you see ... did you say the BX2 is a dual card set up to do load
balancing/fault tolerance? If so try disabling that, it's actually not
necessary on a cluster. And I see you haven't altered the timeouts yet,
so please do that too.

You mention two Ciscos, how does that work? Are 6 nodes attached to one
and 6 to the other? Does the problem happen on resources mounted on node
attached to a particular one of those ?
--
Andrew C Taubman
Novell Support Forums Volunteer SysOp
http://forums.novell.com/
(Sorry, support is not provided via e-mail)

Opinions expressed above are not
necessarily those of Novell Inc.
crivera <criver
NewsGroup User
Re: Randomly problem in 12 nodes cluster.8/8/2008 6:56:02 PM
Reply

0


Thanks Andrew for you help.

We are made changes one at time (one by one) and we are monitoring the
cluster.

The first change has been to eliminate the NIC balancing and we are
monitoring the cluster.

If the problem is repeated we will modify the timeouts.

Each node of the cluster have 2 dual NICs.
.- N1000E
.- BX2

A port of the N1000E NIC is connected to the Cisco. A port of the other
NIC (BX2) is connected to the other Cisco.
In other words, the 12 nodes are connected to the 2 Cisco.

The problem happen in different nodes and with different resources.

I agree with you, may be the problem is in the LAN, but I can�t
probe... by the moments. Also, I have 4 cluster more and it works
without problems, only this cluster with 12 nodes have this problem.


--
crivera
------------------------------------------------------------------------
crivera's Profile: http://forums.novell.com/member.php?userid=16009
View this thread: http://forums.novell.com/showthread.php?t=338836

Tim Heywood NSC
NewsGroup User
Re: Randomly problem in 12 nodes cluster.8/11/2008 4:47:15 PM
Reply

0

On Fri, 08 Aug 2008 18:56:02 +0000, crivera wrote:

> Thanks Andrew for you help.
>
> We are made changes one at time (one by one) and we are monitoring the
> cluster.
>
> The first change has been to eliminate the NIC balancing and we are
> monitoring the cluster.
>
> If the problem is repeated we will modify the timeouts.
>
> Each node of the cluster have 2 dual NICs. .- N1000E
> .- BX2
>
> A port of the N1000E NIC is connected to the Cisco. A port of the other
> NIC (BX2) is connected to the other Cisco. In other words, the 12 nodes
> are connected to the 2 Cisco.
>
> The problem happen in different nodes and with different resources.
>
> I agree with you, may be the problem is in the LAN, but I can´t probe...
> by the moments. Also, I have 4 cluster more and it works without
> problems, only this cluster with 12 nodes have this problem.


The BX2.LAN driver is a known problem - the 3.41 isn't the worst (do noy
use 3.70) but Broadcom has released a new version 4.41 with some help
from Novell. You can get the new driver from:
http://www.broadcom.com/support/ethernet_nic/netxtremeii.php


One of the problems with the BX2 is that it does not reset properly (or
the driver does not reset the card, therefore the larger the cluster -
the more likely the issue.

--
Tim
___________________
Tim Heywood (SYSOP)
NDS8
Scotland
(God's Country)
www.nds8.co.uk
___________________

In theory, practice and theory are the same
In Practice, they are different
crivera <criver
NewsGroup User
Re: Randomly problem in 12 nodes cluster.8/11/2008 8:46:01 PM
Reply

0


Thanks Tim

We are deploying a lab to test the BX2.LAN version 4.41.

Tree weeks ago, we tested the BX2.LAN version 3.70, but the problems
continued. We decide to put again the previous the 3.41 version.

Regards.


--
crivera
------------------------------------------------------------------------
crivera's Profile: http://forums.novell.com/member.php?userid=16009
View this thread: http://forums.novell.com/showthread.php?t=338836

Andrew C Taubma
NewsGroup User
Re: Randomly problem in 12 nodes cluster.8/12/2008 11:56:39 PM
Reply

0

.... or you could just ditch them and use two Intels instead ...
--
Andrew C Taubman
Novell Support Forums Volunteer SysOp
http://forums.novell.com/
(Sorry, support is not provided via e-mail)

Opinions expressed above are not
necessarily those of Novell Inc.
Tim Heywood NSC
NewsGroup User
Re: Randomly problem in 12 nodes cluster.8/13/2008 1:16:30 PM
Reply

0

On Mon, 11 Aug 2008 20:46:01 +0000, crivera wrote:

> Thanks Tim
>
> We are deploying a lab to test the BX2.LAN version 4.41.
>
> Tree weeks ago, we tested the BX2.LAN version 3.70, but the problems
> continued. We decide to put again the previous the 3.41 version.
>
> Regards.

3.70 was nasty - caused lots of Ethertsm.nlm abends...

4.41 has proven good so far at 5 sites that I have on it, the latest
iteration of that driver being what is on the web.

HTH

T



--
Tim
___________________
Tim Heywood (SYSOP)
NDS8
Scotland
(God's Country)
www.nds8.co.uk
___________________

In theory, practice and theory are the same
In Practice, they are different
crivera <criver
NewsGroup User
Re: Randomly problem in 12 nodes cluster.8/14/2008 2:16:02 PM
Reply

0


Andrew said ".. or you could just ditch them and use two Intels instead
..."

It is a good idea, ;-) ...We have 56 servers with the same
configuration.

Tim.
We are testing the 4.41 version, I hope this new driver be more
stable.

Since we disable the balancing, the cluster has been more stable, but
we still made labs and testing drivers, cluster parameters, switch
configuration and more issues.

Thanks Andrew and thanks Tim for your help and support.


--
crivera
------------------------------------------------------------------------
crivera's Profile: http://forums.novell.com/member.php?userid=16009
View this thread: http://forums.novell.com/showthread.php?t=338836

crivera <criver
NewsGroup User
Re: Randomly problem in 12 nodes cluster.8/25/2008 8:46:03 PM
Reply

0


Hi.
Several weeks ago...

The driver 4.41 version report a lot of Abends with the Ethertsm
module.

When the cluster have a lot of requirements the nodes report abends.

We decide to put again the previous the 3.41 version and still without
the NIC balancing.

:-(


--
crivera
------------------------------------------------------------------------
crivera's Profile: http://forums.novell.com/member.php?userid=16009
View this thread: http://forums.novell.com/showthread.php?t=338836

wiegandb <burkh
NewsGroup User
Re: Randomly problem in 12 nodes cluster.8/26/2008 10:39:45 PM
Reply

0

Hello,

we used in all four of our two-node-clusters NW6.5 SP6 version 2.86 of
BX2.LAN and for us it works fine. One of the clusters (nfs and ftp
cluster) was updated lately to SP7 with version 4.41a of BX2.LAN and is
stable for testing purposes.
We have also had abends in ethertsm.nlm with version 3.70 and 4.41
especially when we try to copy from one nfs-mount to the other one there
were reprocible fine abends but not for version 2.86 or 4.41a.
We also made the experience that version 3.41 of BX2.LAN send the poison
pill to "wrong" server if the network cables are pulled at the the
server having the master-ip-address-resource.

Regards

Burkhard Wiegand
Netware-Admin
Debeka-Versicherungen
56068 Koblenz
14 Items, 1 Pages 1 |< << Go >> >|


Free Download:







cannot find error in kb

anyone tried clustering of openslp on oes2 ?

how to migrate master_ip_address_resource

san migration

remove clustering from oes2 linux

migration ncs 1.6 to 1.8

really?

removing crashed cluster node from nds

master ip stops responding - oes2 sp1 linux cluster

cannot join 3rd node to 2-host 2-node ncs vmware cluster

is oes2 64bit cluster aware?

problem finishing cluster migration to oes2 linux

vmware esx reboot loop

how do i remove 1 node from 2 node cluster?

change ip for clients & cluster resources ?

license oes 2 linux - ncs

cluster pools not visible on some cluster nodes

unable to mount clustered nss volumes

moving resources from one cluster to another

netcrunch alerts with novell clusters

upgrading cluster servers that hold iprint

dhcp cluster address not being used

nfs, ncs with linux host behind firewall

cannot see volumes not on master node

imanager clustering error

performance - ncs with nss / ext3 / reiserfs

expanding netware 6.5sp8 cluster with sles10

uninstalling cluster services

one cluster volume, 2 clusters?

vldb does not exit properly during unload script

clustering intranet

any good advise

clustered services on oes2 linux

how do i avoid coredump and abend.log prompt?

imanager can't control cluster

newbie question about trustees on cluster volumes

how to check reason for node restart

multipath oes2 linux - server reboot after fc port disable

sbd partition on drbd

no server object with running volumes in nds

replicas on cluster servers?

slow cluster join after upgrade to sp7

ncs and crontab

reinstall oes2 nod cluster

netware to oes2 cluster migration

new oes2 w/clustering and all latest updates

please enter valid ip err msg - trying to save failover chg

dns failure on ncs

2 storage devices for 1 cluster

pool and ressource load/unload/migrate

   
  Privacy | Contact Us
All Times Are GMT