Discussion Closed This discussion was created more than 6 months ago and has been closed. To start a new discussion with a link back to this one, click here.

Comsol 4.1 Cluster Start Errors

Please login with a confirmed email address before reporting spam

Hi,

I'm trying to get comsol 4.1 to work on our linux cluster (open suse 11.1) in order to solve much larger models than what I can do currently do on one computer.

I VNC to one of the computers on campus (I'm a grad student) and can successfully boot the mpd daemons. For now I boot only 2 nodes.

I can see they're active with "comsol mpd trace" command.

However, when I try to start comsol server with command "comsol -nn 2 server" I get a whole slew of warnings and errors. Below is what I'm seeing in the way of errors.

Note: If I do "comsol -nn 1 server" it does in fact start on the node that I have the VNC session on. I thought maybe that was enough, but I get the same error in the log part of comsol's GUI if I connect to the server running on that 1 node.

Talking to the computing guys at school (they're stumped too) it's probably an X11 forwarding issue. Specifically, they noted that comsol calls ssh with -x command not -X (-x actually turns off X11 forwarding I think)....I'm looking for a way to get this going.

Any help is much appreciated!!

Thanks! --Matt

Output from "comsol -nn 2 server" command:

----------------------------------------
No Protocol Specified

(Comsollauncher:15553): GLib-GObject-WARNING **: invalid (NULL) pointer instance

(Comsollauncher:15553): GLib-GObject-CRITICAL **: g_signal_connect_data: assertion `G_TYPE_CHECK_INSTANCE (instance)' failed

(Comsollauncher:15553): Gtk-CRITICAL **: gtk_settings_get_for_screen: assertion `GDK_IS_SCREEN (screen)' failed

(Comsollauncher:15553): GLib-GObject-CRITICAL **: g_object_get: assertion `G_IS_OBJECT (object)' failed

(Comsollauncher:15553): GLib-GObject-WARNING **: value "TRUE" of type `gboolean' is invalid or out of range for property `visible' of type `gboolean'

(Comsollauncher:15553): Gtk-CRITICAL **: gtk_settings_get_for_screen: assertion `GDK_IS_SCREEN (screen)' failed

(Comsollauncher:15553): GLib-GObject-CRITICAL **: g_object_get: assertion `G_IS_OBJECT (object)' failed

(Comsollauncher:15553): Gtk-WARNING **: Screen for GtkWindow not set; you must always set
a screen for a GtkWindow before using the window

(Comsollauncher:15553): Gdk-CRITICAL **: gdk_pango_context_get_for_screen: assertion `GDK_IS_SCREEN (screen)' failed

(Comsollauncher:15553): Pango-CRITICAL **: pango_context_set_font_description: assertion `context != NULL' failed

(Comsollauncher:15553): Pango-CRITICAL **: pango_context_set_base_dir: assertion `context != NULL' failed

(Comsollauncher:15553): Pango-CRITICAL **: pango_context_set_language: assertion `context != NULL' failed

(Comsollauncher:15553): Pango-CRITICAL **: pango_layout_new: assertion `context != NULL' failed

(Comsollauncher:15553): Pango-CRITICAL **: pango_layout_set_text: assertion `layout != NULL' failed

(Comsollauncher:15553): Pango-CRITICAL **: pango_layout_set_attributes: assertion `layout != NULL' failed

(Comsollauncher:15553): Pango-CRITICAL **: pango_layout_set_alignment: assertion `layout != NULL' failed

(Comsollauncher:15553): Pango-CRITICAL **: pango_layout_set_ellipsize: assertion `PANGO_IS_LAYOUT (layout)' failed

(Comsollauncher:15553): Pango-CRITICAL **: pango_layout_set_single_paragraph_mode: assertion `PANGO_IS_LAYOUT (layout)' failed

(Comsollauncher:15553): Pango-CRITICAL **: pango_layout_set_width: assertion `layout != NULL' failed

(Comsollauncher:15553): Pango-CRITICAL **: pango_layout_get_extents: assertion `layout != NULL' failed

(Comsollauncher:15553): Gtk-CRITICAL **: gtk_icon_theme_get_for_screen: assertion `GDK_IS_SCREEN (screen)' failed

(Comsollauncher:15553): Gtk-CRITICAL **: gtk_settings_get_for_screen: assertion `GDK_IS_SCREEN (screen)' failed

(Comsollauncher:15553): Gtk-CRITICAL **: gtk_icon_size_lookup_for_settings: assertion `GTK_IS_SETTINGS (settings)' failed

(Comsollauncher:15553): Gtk-WARNING **: Invalid icon size 6


(Comsollauncher:15553): Gtk-CRITICAL **: gtk_icon_theme_load_icon: assertion `GTK_IS_ICON_THEME (icon_theme)' failed
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f5107e70bf1, pid=15553, tid=139987637663472
#
# JRE version: 6.0_20-b02
# Java VM: Java HotSpot(TM) 64-Bit Server VM (16.3-b01 mixed mode linux-amd64 )
# Problematic frame:
# C [libgtk-x11-2.0.so.0+0xffbf1] gtk_icon_set_render_icon+0x721
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid15553.log
#
# If you would like to submit a bug report, please visit:
# java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

3 Replies Last Post 05.05.2011, 17:10 GMT-4
Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 28.04.2011, 22:19 GMT-4
Be sure to be logged into the server node when you issue the server command with COMSOL. Also, your mpd nodes and server nodes must be the same. For example, let's say my cluster is named jim, and I have 5 compute nodes and one head (or control node). The cluster looks like this:

jim
jim01
jim02
jim03
jim04
jim05

Now, let's say I want to run COMSOL on the compute nodes jim05 and jim04. I first ssh into the head node

ssh jim

I could have logged into the compute node first, but be patient with me.

In my home directory, I have the mpd.hosts file include jim05 and jim04 only to be sure I only use the two compute nodes I want. I could boot more, and use less. So, the command "comsol mpd trace -l" shows jim05 and jim04 running mpi.

Next, I ssh into jim05

ssh jim05

From here, I issue "comsol -nn 2 server"

I should then get a response back from comsol server indicating jim05 and jim04 are ready to connect.

Now I start up comsol in the client mode on the head node jim. I could also do this from another client machine (like my desktop), but the head node is the most reliable network connection to the compute nodes by definition. Right ?

Then, from the comsol client, I can issue the connect to server button either before or after I open my model. I usually do it before, because it is faster that way. Now you are ready to run in parallel.

You can also just run in batch mode from jim05 with a -nn 2 switch once you have a good model that will run.

I would next like to learn how to do all this from the GUI. I just have not taken the time to learn how.
Be sure to be logged into the server node when you issue the server command with COMSOL. Also, your mpd nodes and server nodes must be the same. For example, let's say my cluster is named jim, and I have 5 compute nodes and one head (or control node). The cluster looks like this: jim jim01 jim02 jim03 jim04 jim05 Now, let's say I want to run COMSOL on the compute nodes jim05 and jim04. I first ssh into the head node ssh jim I could have logged into the compute node first, but be patient with me. In my home directory, I have the mpd.hosts file include jim05 and jim04 only to be sure I only use the two compute nodes I want. I could boot more, and use less. So, the command "comsol mpd trace -l" shows jim05 and jim04 running mpi. Next, I ssh into jim05 ssh jim05 From here, I issue "comsol -nn 2 server" I should then get a response back from comsol server indicating jim05 and jim04 are ready to connect. Now I start up comsol in the client mode on the head node jim. I could also do this from another client machine (like my desktop), but the head node is the most reliable network connection to the compute nodes by definition. Right ? Then, from the comsol client, I can issue the connect to server button either before or after I open my model. I usually do it before, because it is faster that way. Now you are ready to run in parallel. You can also just run in batch mode from jim05 with a -nn 2 switch once you have a good model that will run. I would next like to learn how to do all this from the GUI. I just have not taken the time to learn how.

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 05.05.2011, 14:28 GMT-4
HI Jim,

Thanks for your help! It took a while for me to sort everything out, but at least for models in the 1-2 million DoF range, the cluster I've setup works great! Doing what you suggested in the above post pointed me to the subtle error I had in initializing the cluster (I only stumbled on it once I tried the steps you outlined, as I wasn't using a the head node in comsol client mode).

For anybody else that has the errors that I have above in the initial post:

You're probably working on a cluster where you have only limited access to system and comsol files as a user. For some reason, when comsol initializes in cluster mode, it talks to the other nodes and wants to write temporary and configuration files.

For the node that you're logged into when you boot the mpd daemons, this initialization is no trouble. However, when it goes to the rest of the compute nodes, I believe you still look like an "other" user as far as linux permissions are concerned. (At least at the cluster at school, i think this is compounded by several authentication protocols (not ssh, I passwd disabled that). that I'm rather unfamiliar with...)

In this case, comsol can't write config files, and this gets thrown as the errors I show in the first post.

The solution:

simply boot comsol -nn nodes server with the switch -configuration /somewhere/everybody/has/allpermissions

for example: -configuration /scratch or -configuration /tmp

Also, if using the gui to start the cluster running, use the same "-configuration /tmp/orwhatever" in the "post-append" command box in batch settings....

This got me up and running...to some extent...

In solving larger models, (4-10 Million DoF) I'm experiencing mpi errors such as the following:

[cli_4]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(1175)...: MPI_Allreduce(sbuf=0x7f946f0b5f78, rbuf=0x7f946f0b5f58, count=1, MPI_LONG_LONG_INT, MPI_MAX, comm=0x84000004) failed
MPIR_Allreduce(487)...:
MPIC_Send(39).........:
MPID_Send(180)........: failure occurred while attempting to send an eager message
MPIDU_Sock_writev(585): connection closed by peer (set=0,sock=2,errno=32:(strerror() not found))
.
.
.
It goes on for lines with more like above
.
.
.
and then finally the log ends with
.
,
.
rank 20 in job 1 ece005.ece.cmu.edu_40638 caused collective abort of all ranks
exit status of rank 20: return code 13
rank 19 in job 1 ece005.ece.cmu.edu_40638 caused collective abort of all ranks
exit status of rank 19: return code 13
rank 18 in job 1 ece005.ece.cmu.edu_40638 caused collective abort of all ranks
exit status of rank 18: return code 13
rank 17 in job 1 ece005.ece.cmu.edu_40638 caused collective abort of all ranks
.
.
.
etc...
.
.
.
The job hangs and even after 4 hours, was still doing nothing at 91% external process in the GUI progress window

I don't suppose anybody has ever seen something like this above?

Comsol support said I might be running out of resources on a node.....with MUMPS allocating more than it can handle...

So, if the above doesn't ring a bell, with a maximum of 32 machines running linux SuSE 11.1 and 16 GB RAM total, what would be an upper limit (at least in DoFs in the RF module) that I'd expect to solve?

Any more help on this is appreciated. Thanks!!!

--Matt
HI Jim, Thanks for your help! It took a while for me to sort everything out, but at least for models in the 1-2 million DoF range, the cluster I've setup works great! Doing what you suggested in the above post pointed me to the subtle error I had in initializing the cluster (I only stumbled on it once I tried the steps you outlined, as I wasn't using a the head node in comsol client mode). For anybody else that has the errors that I have above in the initial post: You're probably working on a cluster where you have only limited access to system and comsol files as a user. For some reason, when comsol initializes in cluster mode, it talks to the other nodes and wants to write temporary and configuration files. For the node that you're logged into when you boot the mpd daemons, this initialization is no trouble. However, when it goes to the rest of the compute nodes, I believe you still look like an "other" user as far as linux permissions are concerned. (At least at the cluster at school, i think this is compounded by several authentication protocols (not ssh, I passwd disabled that). that I'm rather unfamiliar with...) In this case, comsol can't write config files, and this gets thrown as the errors I show in the first post. The solution: simply boot comsol -nn nodes server with the switch -configuration /somewhere/everybody/has/allpermissions for example: -configuration /scratch or -configuration /tmp Also, if using the gui to start the cluster running, use the same "-configuration /tmp/orwhatever" in the "post-append" command box in batch settings.... This got me up and running...to some extent... In solving larger models, (4-10 Million DoF) I'm experiencing mpi errors such as the following: [cli_4]: aborting job: Fatal error in MPI_Allreduce: Other MPI error, error stack: MPI_Allreduce(1175)...: MPI_Allreduce(sbuf=0x7f946f0b5f78, rbuf=0x7f946f0b5f58, count=1, MPI_LONG_LONG_INT, MPI_MAX, comm=0x84000004) failed MPIR_Allreduce(487)...: MPIC_Send(39).........: MPID_Send(180)........: failure occurred while attempting to send an eager message MPIDU_Sock_writev(585): connection closed by peer (set=0,sock=2,errno=32:(strerror() not found)) . . . It goes on for lines with more like above . . . and then finally the log ends with . , . rank 20 in job 1 ece005.ece.cmu.edu_40638 caused collective abort of all ranks exit status of rank 20: return code 13 rank 19 in job 1 ece005.ece.cmu.edu_40638 caused collective abort of all ranks exit status of rank 19: return code 13 rank 18 in job 1 ece005.ece.cmu.edu_40638 caused collective abort of all ranks exit status of rank 18: return code 13 rank 17 in job 1 ece005.ece.cmu.edu_40638 caused collective abort of all ranks . . . etc... . . . The job hangs and even after 4 hours, was still doing nothing at 91% external process in the GUI progress window I don't suppose anybody has ever seen something like this above? Comsol support said I might be running out of resources on a node.....with MUMPS allocating more than it can handle... So, if the above doesn't ring a bell, with a maximum of 32 machines running linux SuSE 11.1 and 16 GB RAM total, what would be an upper limit (at least in DoFs in the RF module) that I'd expect to solve? Any more help on this is appreciated. Thanks!!! --Matt

Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Please login with a confirmed email address before reporting spam

Posted: 1 decade ago 05.05.2011, 17:10 GMT-4
Our RHEL cluster here is set up so that a single set of filesystems serve all the nodes. So, we don'r run into this problem of having to worry about where the files are relative to the compute node, etc. This seems like something that should be corrected on your cluster.
Our RHEL cluster here is set up so that a single set of filesystems serve all the nodes. So, we don'r run into this problem of having to worry about where the files are relative to the compute node, etc. This seems like something that should be corrected on your cluster.

Note that while COMSOL employees may participate in the discussion forum, COMSOL® software users who are on-subscription should submit their questions via the Support Center for a more comprehensive response from the Technical Support team.