We ran into a situation in work where our Cassandra cluster file system was filling up quickly. In the end, there was not enough room for the Cassandra to complete any compactions as there was simply not enough space in the /data file system.
If you get yourself into a situation like this you can manually perform User Defined Compactions on specific data files rather than the system running a major compaction and running out of space.

I will show how its done on my Google Cloud Cluster, so first thing is first…make sure your cluster is up and running and you have the movielens schema installed, if you are not at this step then visit my “Getting Started” section and follow those blog posts.

Step 1

In order to access the Cassandra jmx, we need to get the jmxterm client which will allow us to run a UDC.  Download it as follows, note by time your read this the version of jmxterm may be different.

[cassandra@cass-node-1 .ssh]$ cd /opt
[cassandra@cass-node-1 opt]$ ll
total 0
drwxr-xr-x. 3 cassandra cassandra 37 Aug 16 16:34 cass
[cassandra@cass-node-1 opt]$ mkdir jmxterm
[cassandra@cass-node-1 opt]$ ll
total 0
drwxr-xr-x. 3 cassandra cassandra 37 Aug 16 16:34 cass
drwxrwxr-x. 2 cassandra cassandra  6 Sep 18 13:36 jmxterm
[cassandra@cass-node-1 opt]$ cd jmxterm
[cassandra@cass-node-1 jmxterm]$ wget https://superb-sea2.dl.sourceforge.net/project/cyclops-group/jmxterm/1.0-alpha-4/jmxterm-1.0-alpha-4-uber.jar

Personally, I like to add a new alias to the bash_profile so I can run the jmxterm client from anywhere.

alias JMXTERM='java -jar /opt/jmxterm/jmxterm-1.0-alpha-4-uber.jar'

Step 2

Navigate to your data_file_directory and check the size of your keyspaces on the disc, we are looking for large files that we can compact to free up space.


cd /data/cassandra/data

[cassandra@cass-node-1 data]$ du -sh *
17M     movielens
1.1M    reaper_db
1.4M    system
52K     system_auth
108K    system_distributed
856K    system_schema
8.0K    system_traces

As you can see my movielens keyspace is the largest at 17M (huge!!). So this will be where I perform my compactions.

Step 3

Navigate into the data directory for movielens and check the size of each Column Family sorting by size.

[cassandra@cass-node-1 data]$ cd movielens/
[cassandra@cass-node-1 movielens]$ du -sh * | sort -h
8.0K    movies-90ffd110a17511e8bd7b0de73f1fad1f
8.0K    original_movie_map-408fceb0a20611e890116d9c24d5c4be
8.0K    original_movie_map-48b62740a45611e890116d9c24d5c4be
8.0K    original_movie_map-55da4d30a16c11e8b341c1915b5c3cd5
8.0K    original_movie_map-92753300a17511e8bd7b0de73f1fad1f
8.0K    original_movie_map-feec3a60a17511e8bd7b0de73f1fad1f
8.0K    ratings_by_movie-57248070a16c11e8b341c1915b5c3cd5
8.0K    ratings_by_movie-93fb84e0a17511e8bd7b0de73f1fad1f
8.0K    ratings_by_user-548eba60a16c11e8b341c1915b5c3cd5
8.0K    ratings_by_user-91d87dd0a17511e8bd7b0de73f1fad1f
8.0K    users-533fa520a16c11e8b341c1915b5c3cd5
8.0K    users-91580330a17511e8bd7b0de73f1fad1f
44K     movies-51edd0c0a16c11e8b341c1915b5c3cd5
132K    users-3fe8e050a20611e890116d9c24d5c4be
132K    users-480eeac0a45611e890116d9c24d5c4be
132K    users-fdb3ba60a17511e8bd7b0de73f1fad1f
200K    movies-477dce50a45611e890116d9c24d5c4be
208K    movies-3f89d060a20611e890116d9c24d5c4be
208K    movies-fd5a76d0a17511e8bd7b0de73f1fad1f
1.9M    ratings_by_movie-00ac8800a17611e8bd7b0de73f1fad1f
1.9M    ratings_by_movie-40d3b580a20611e890116d9c24d5c4be
1.9M    ratings_by_movie-49445d80a45611e890116d9c24d5c4be
3.3M    ratings_by_user-402dd890a20611e890116d9c24d5c4be
3.3M    ratings_by_user-483fbec0a45611e890116d9c24d5c4be
3.4M    ratings_by_user-fe615f80a17511e8bd7b0de73f1fad1f

Here are all the sstables on disk. The ratings_by_user table is the largest, lets compact this.

Step 4

Navigate to the ratings_by_user directory with the data files:

[cassandra@cass-node-1 movielens]$ cd ratings_by_user-483fbec0a45611e890116d9c24d5c4be
[cassandra@cass-node-1 ratings_by_user-483fbec0a45611e890116d9c24d5c4be]$ ll
drwxrwxr-x. 2 cassandra cassandra       6 Aug 20 08:51 backups
-rw-rw-r--. 1 cassandra cassandra     763 Sep 19 09:52 mc-5-big-CompressionInfo.db
-rw-rw-r--. 1 cassandra cassandra 3373648 Sep 19 09:52 mc-5-big-Data.db
-rw-rw-r--. 1 cassandra cassandra      10 Sep 19 09:52 mc-5-big-Digest.crc32
-rw-rw-r--. 1 cassandra cassandra    1192 Sep 19 09:52 mc-5-big-Filter.db
-rw-rw-r--. 1 cassandra cassandra   21328 Sep 19 09:52 mc-5-big-Index.db
-rw-rw-r--. 1 cassandra cassandra    7483 Sep 19 09:52 mc-5-big-Statistics.db
-rw-rw-r--. 1 cassandra cassandra     288 Sep 19 09:52 mc-5-big-Summary.db
-rw-rw-r--. 1 cassandra cassandra      92 Sep 19 09:52 mc-5-big-TOC.txt

Here is the file we are interested in “mc-5-big-Data.db“. Imagine a scenario where we have lots of tombstones on this table, a compaction on this file will remove them and free up some much-needed space.
Note down the full path this file.

/data/cassandra/data/movielens/ratings_by_user-483fbec0a45611e890116d9c24d5c4be/mc-5-big-Data.db

Step 5

On a large cluster its a good idea to set the compactiontrhoughput to 0 which will mean our compaction will run faster.

[cassandra@cass-node-1 ]$ nodetool getcompactionthroughput
Current compaction throughput: 16 MB/s
[cassandra@cass-node-1 ]$ nodetool setcompactionthroughput 0
[cassandra@cass-node-1 ]$ nodetool getcompactionthroughput
Current compaction throughput: 0 MB/s

Step 6

Next, we start up JMXTERM and use it to access the beans we want to allow us to compact this file.
You can read more about jmxterm here.

If you have added the alias to your bash_profile then all you need to do is type “JMXTERM”.
If not then run the following

[cassandra@cass-node-1 ratings_by_user-483fbec0a45611e890116d9c24d5c4be]$ java -jar /opt/jmxterm/jmxterm-1.0-alpha-4-uber.jar
Welcome to JMX terminal. Type "help" for available commands.
$>

Step 7

We now need to connect to the Cassandra JVM. To do this we run the following with 7199 being the JMX port as seen in the cassandra-env.sh file:

$>open localhost:7199
#Connection to localhost:7199 is opened
$>

Step 8

We need to set the domain we want to use. The domain we need for User Defined Compaction (UDC) is org.apache.cassandra.db.

$>domain org.apache.cassandra.db
#domain is set to org.apache.cassandra.db
$>

Step 9

Within the org.apache.cassandra.db domain we need to set the correct “Bean”. We are going to set “bean org.apache.cassandra.db:type=CompactionManager”.

$>bean org.apache.cassandra.db:type=CompactionManager
#bean is set to org.apache.cassandra.db:type=CompactionManager
$>

Step 10

Now we are set up to run the UDC on our sstable file we found earlier.
We run “run forceUserDefinedCompaction”

$>run forceUserDefinedCompaction /data/cassandra/data/movielens/ratings_by_user-483fbec0a45611e890116d9c24d5c4be/mc-5-big-Data.db
#calling operation forceUserDefinedCompaction of mbean org.apache.cassandra.db:type=CompactionManager
#operation returns: 
null
$>

Jmxterm does not give us much feedback. In order to know that the compaction has run successfully, we check the data directory.

[cassandra@cass-node-1 ratings_by_user-483fbec0a45611e890116d9c24d5c4be]$ pwd
/data/cassandra/data/movielens/ratings_by_user-483fbec0a45611e890116d9c24d5c4be
[cassandra@cass-node-1 ratings_by_user-483fbec0a45611e890116d9c24d5c4be]$ ls -l *Data*
-rw-rw-r--. 1 cassandra cassandra 3373648 Sep 19 10:30 mc-6-big-Data.db

Notice that the sstable filename has changed from mc-5-big-Data.db >> mc-6-big-Data.db. On a larger system, you would see that the file size has reduced.

That’s it, it’s easy to perform User Defined Compactions on an sstable. You can also compact a number of sstable files together, let’s do that quickly.

Compact Multiple Files

First we need to create another sstable on disk for our column family ratings_by_user. We do this by starting cqlsh and deleting some data which will generate another sstable for us.

[cassandra@cass-node-1 ~]$ cqlsh cass-node-1
Connected to Phils-Cool-Cluster at cass-node-1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.3 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> use movielens ;
cqlsh:movielens> select * from ratings_by_user limit 10;

 user_id                              | movie_id                             | name                       | rating | ts
--------------------------------------+--------------------------------------+----------------------------+--------+-----------
 b52fcdfc-0eaf-4432-9896-aa22db56edb2 | 003e4d1c-48d7-4952-8ebd-2da3d6ca2e1a |  This Is Spinal Tap (1984) |      5 | 877488546
 b52fcdfc-0eaf-4432-9896-aa22db56edb2 | 04aedc1b-40f4-473e-ac24-87b0b3c83521 |                Babe (1995) |      3 | 881288574
 b52fcdfc-0eaf-4432-9896-aa22db56edb2 | 0867adae-b16e-4b93-9a18-e84dccdfff9f |          Unforgiven (1992) |      4 | 881288620
 b52fcdfc-0eaf-4432-9896-aa22db56edb2 | 0c33ab47-28db-43bd-b325-75a07398ce7d |     Full Monty, The (1997) |      4 | 877487931
 b52fcdfc-0eaf-4432-9896-aa22db56edb2 | 0fe516e4-1b97-4594-b040-6baf9db5202a | Usual Suspects, The (1995) |      5 | 881288233
 b52fcdfc-0eaf-4432-9896-aa22db56edb2 | 10ea8a1a-f84d-4e7e-8cc3-92bfe5aef03e | Blues Brothers, The (1980) |      3 | 877488366
 b52fcdfc-0eaf-4432-9896-aa22db56edb2 | 1def3464-44a7-4a38-8c66-c259a25653fb |       Red Rock West (1992) |      4 | 881288234
 b52fcdfc-0eaf-4432-9896-aa22db56edb2 | 1e5e1c29-5de7-43c3-b487-8b1fb6817afd |             Ed Wood (1994) |      4 | 877488492
 b52fcdfc-0eaf-4432-9896-aa22db56edb2 | 21b98309-f0b9-4d8b-861c-3057a5fa6450 |       Shallow Grave (1994) |      5 | 881288265
 b52fcdfc-0eaf-4432-9896-aa22db56edb2 | 2caec11d-80b6-447a-8ac0-fec9ab9bccdd |          Highlander (1986) |      3 | 881288386

(10 rows)
cqlsh:movielens> delete from ratings_by_user where user_id = b52fcdfc-0eaf-4432-9896-aa22db56edb2;
cqlsh:movielens> 
cqlsh:movielens> select * from ratings_by_user where user_id = b52fcdfc-0eaf-4432-9896-aa22db56edb2;

 user_id | movie_id | name | rating | ts
---------+----------+------+--------+----

(0 rows)
cqlsh:movielens> 
cqlsh:movielens> exit
[cassandra@cass-node-1 ~]$ 
[cassandra@cass-node-1 ~]$ nodetool flush

We run “nodetool flush” so that the data is written from memtables to disk (sstables).

Now go back to the data directory where the sstables are located.

[cassandra@cass-node-1 ~]$ cd /data/cassandra/data/movielens/ratings_by_user-483fbec0a45611e890116d9c24d5c4be
[cassandra@cass-node-1 ratings_by_user-483fbec0a45611e890116d9c24d5c4be]$ 
[cassandra@cass-node-1 ratings_by_user-483fbec0a45611e890116d9c24d5c4be]$ ls -l *Data*
-rw-rw-r--. 1 cassandra cassandra 3373648 Sep 19 10:30 mc-6-big-Data.db
-rw-rw-r--. 1 cassandra cassandra      41 Sep 19 11:19 mc-7-big-Data.db
[cassandra@cass-node-1 ratings_by_user-483fbec0a45611e890116d9c24d5c4be]$ 

We now have 2 sstables on disk. We can run jmxterm again and compact these sstables together manually.
Files names are:

/data/cassandra/data/movielens/ratings_by_user-483fbec0a45611e890116d9c24d5c4be/mc-6-big-Data.db
/data/cassandra/data/movielens/ratings_by_user-483fbec0a45611e890116d9c24d5c4be/mc-7-big-Data.db

Once compacted we should see one sstable called mc-8-big-Data.db.
To compact more than one file you list the files separated by a comma.

[cassandra@cass-node-1 ratings_by_user-483fbec0a45611e890116d9c24d5c4be]$ JMXTERM
Welcome to JMX terminal. Type "help" for available commands.
$>open localhost:7199
#Connection to localhost:7199 is opened
$>domain org.apache.cassandra.db
#domain is set to org.apache.cassandra.db
$>bean org.apache.cassandra.db:type=CompactionManager
#bean is set to org.apache.cassandra.db:type=CompactionManager
$>run forceUserDefinedCompaction /data/cassandra/data/movielens/ratings_by_user-483fbec0a45611e890116d9c24d5c4be/mc-6-big-Data.db,/data/cassandra/data/movielens/ratings_by_user-483fbec0a45611e890116d9c24d5c4be/mc-7-big-Data.db
#calling operation forceUserDefinedCompaction of mbean org.apache.cassandra.db:type=CompactionManager
#operation returns: 
null
$>

Now we check our data directory again to see what sstables we have on disk now.

[cassandra@cass-node-1 ratings_by_user-483fbec0a45611e890116d9c24d5c4be]$ ls -l *Data*
-rw-rw-r--. 1 cassandra cassandra 3375666 Sep 19 11:23 mc-8-big-Data.db
[cassandra@cass-node-1 ratings_by_user-483fbec0a45611e890116d9c24d5c4be]$ 

As expected we have one sstable from the 2 we compacted together.

Conclusion

User Defined Compaction can be a very useful tool in your arsenal when you may be quickly running out of space and can’t do a full compaction.

LEAVE A REPLY

Please enter your comment!
Please enter your name here