Cassandra node can’t complete joining operation

Cassandra node can’t complete joining operation

Trying to add a new node to an existing C* 2.1.11 cluster, the node appears to have completed the streaming phase of the bootstrap, but I can’t find an explanation of why it has not moved from the JOINING state; the cassandra logs for all the nodes don’t show errors during all the streaming process.
nodetool status reports the node as UJ in all the nodes, and the amount of load is greater that the rest of nodes:
# nodetool status
Datacenter: us-east-vpc
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
— Address Load Tokens Owns Host ID Rack
UN xx.xx.xx.78 564.96 GB 256 ? xxxx-f3c7d9d40e92 1d
UN xx.xx.xx.110 534.63 GB 256 ? xxxx-9419faa478ca 1a
UN xx.xx.xx.171 557.13 GB 256 ? xxxx-7a5b2723e438 1a
UN xx.xx.xx.203 406.98 GB 256 ? xxxx-1331d9c44992 1a
UN xx.xx.xx.26 579.55 GB 256 ? xxxx-88b202a8cedc 1c
UN xx.xx.xx.122 603.39 GB 256 ? xxxx-b0b81ebabeb2 1d
UN xx.xx.xx.233 565.3 GB 256 ? xxxx-a2fa9ad67741 1c
UJ xx.xx.xx.56 881.91 GB 256 ? xxxx-9863c7799fad 1d

nodetool netstats shows no activity in the other nodes but on the new one which shows an empty list of files to transmit:
# nodetool netstats
Mode: JOINING
Bootstrap xxxx-8d0c340f238b
/xx.xx.xx.233
/xx.xx.xx.122
/xx.xx.xx.171
/xx.xx.xx.78
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name Active Pending Completed
Commands n/a 0 50
Responses n/a 0 64941

nodetool info is throwing an error while trying to retrieve the token range information:
# nodetool info
ID : xxxx-9863c7799fad
Gossip active : true
Thrift active : false
Native Transport active: false
Load : 881.91 GB
Generation No : 1475450119
Uptime (seconds) : 12081
Heap Memory (MB) : 1480.71 / 1996.00
Off Heap Memory (MB) : 204.47
Data Center : us-east-vpc
Rack : 1d
Exceptions : 2
Key Cache : entries 3262, size 788.43 KB, capacity 99 MB, 43 hits, 3249 requests, 0.013 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 49 MB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
error: null
— StackTrace —
java.lang.AssertionError
at org.apache.cassandra.locator.TokenMetadata.getTokens(TokenMetadata.java:474)
at org.apache.cassandra.service.StorageService.getTokens(StorageService.java:2263)
at org.apache.cassandra.service.StorageService.getTokens(StorageService.java:2252)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
at com.sun.jmx.mbeanserver.PerInterface.getAttribute(PerInterface.java:83)
at com.sun.jmx.mbeanserver.MBeanSupport.getAttribute(MBeanSupport.java:206)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:647)
at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:678)
at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1445)
at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76)
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1309)
at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1401)
at javax.management.remote.rmi.RMIConnectionImpl.getAttribute(RMIConnectionImpl.java:639)
at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:324)
at sun.rmi.transport.Transport$1.run(Transport.java:200)
at sun.rmi.transport.Transport$1.run(Transport.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:568)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:683)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Any help will be greatly appreciated.
EDIT Oct 3
It was found that the instance was running out of space, at the end we got an error that there was not enough space to complete compactions. The partition was expanded and the /data folder cleared to start the bootstrap from scratch; With the expanded disk, the streaming completed, but it still can’t move from UJ to UN; there are no errors on the logs, nodetool tpstats show no pending tasks, nodetool netstats returned no pending activity, with the same bootstrap UUID:
# nodetool netstats
Mode: JOINING
Bootstrap xxxx-8d0c340f238b
/xx.xx.xx.233
/xx.xx.xx.122
/xx.xx.xx.171
/xx.xx.xx.78
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name Active Pending Completed
Commands n/a 0 130
Responses n/a 0 256088

There is still the question of why the increment of load for that node happened

Solutions/Answers:

Solution 1:

As there were no errors reported, and the streaming process was done, we assumed that the node was ready to join the cluster.

We added the auto_bootstrap: False directive to the cassandra.yaml file, restarted the service in the node, and it joined the cluster.

After joining the cluster a full repair and a cleanup were executed.

References

Loading...