Note: This discussion is about an older version of the COMSOL Multiphysics® software. The information provided may be out of date.
Discussion Closed This discussion was created more than 6 months ago and has been closed. To start a new discussion with a link back to this one, click here.
Comsol Cluster
Posted 08.05.2011, 01:02 GMT-4 Studies & Solvers Version 4.1 4 Replies
Please login with a confirmed email address before reporting spam
Hello,
I've been trying to get comsol to run on our Linux cluster to solve some larger models (ideally, 6-12 MDoF's). I'm at a point, however, where I could use some advising.
Any model I can solve on 1 computer I can also currently solve using more than 1 node. (So far tried up to 20 nodes out of 30 available). The problems come when I try the larger problems I'm really shooting for.
I'll list the things that have me puzzled in order of how straightforward they are (I think ;) ...). I'm using comsol 4.1 and a Linux cluster. Machines have identical hardware (specifically, 16 GB RAM).
1. In my choice of solvers, is Pardiso usable for a shared memory implementation? Reading other posts here, it comes up in cluster context, but I'm not sure if that's ok for shared memory. As far as I knew, comsol says MUMPS and SPOOLES only for this.
2. I once solved a problem with 1.7 MDoFs on one machine. It took about 48 hours, but it completed nonethelesss (I'm fine with waiting that long). I tried a similar model with 3.7 MDoFs, 20 nodes, but to no avail. The solvers (tried MUMPS and SPOOLES both) get close to finishing ~90% completion, then the log file starts printing some confusing mpi errors (exit status of rank 18: return code 13....rank 17 in job 1 ece005.ece.cmu.edu_40638 caused collective abort of all ranks).
That ece005.ece.cmu.edu is the node I launch the server on. It always seems to be more active than any other of the compute nodes, and it seems to be the troublemaker amongst the group as it is listed in the log in that way as what I assume to be the root of the job quitting.
Also, monitoring it with top command in linux, it's the only compute node using more than 100% processor usage during ALL the TIME (the others jump up to >1000% during matrix assembly and other such portions of the solution process more cores can be utilized).
I'm just kinda having a hard time swallowing the thought that, while one computer can solve 1.7 MDoFs, then 20 nodes could not complete a problem that's only slightly less than double the DoFs.......I know the relation between memory requirements and DoFs might not be linear, but still......
3. In general, with, let's say 8 GB available to me (because the cluster is shared resource, more often than not I might not have 16 GB to play with) and 30 nodes running (for a grand total of 240 GB usable RAM) what is a reasonable number to put to how large a model I could solve (in Dofs anyway)...
If i'm going to keep debugging I'd like to have a feel who the actual culprit is here....If someone can tell me "... at 240 GB shared memory, you'll never get past ...#dofs...." that's be a big help too. Then I know that horesepower either is or is not an inherent limitation and look elsewhere (comsol pointed me to stack size, which I'm quite lost on also...and there may be firewall issues I haven't sorted out yet).....
Ok, in any event, any input, suggestions, etc that anyone can give me would be a HUGE help! Thanks a million!!
--Matt
I've been trying to get comsol to run on our Linux cluster to solve some larger models (ideally, 6-12 MDoF's). I'm at a point, however, where I could use some advising.
Any model I can solve on 1 computer I can also currently solve using more than 1 node. (So far tried up to 20 nodes out of 30 available). The problems come when I try the larger problems I'm really shooting for.
I'll list the things that have me puzzled in order of how straightforward they are (I think ;) ...). I'm using comsol 4.1 and a Linux cluster. Machines have identical hardware (specifically, 16 GB RAM).
1. In my choice of solvers, is Pardiso usable for a shared memory implementation? Reading other posts here, it comes up in cluster context, but I'm not sure if that's ok for shared memory. As far as I knew, comsol says MUMPS and SPOOLES only for this.
2. I once solved a problem with 1.7 MDoFs on one machine. It took about 48 hours, but it completed nonethelesss (I'm fine with waiting that long). I tried a similar model with 3.7 MDoFs, 20 nodes, but to no avail. The solvers (tried MUMPS and SPOOLES both) get close to finishing ~90% completion, then the log file starts printing some confusing mpi errors (exit status of rank 18: return code 13....rank 17 in job 1 ece005.ece.cmu.edu_40638 caused collective abort of all ranks).
That ece005.ece.cmu.edu is the node I launch the server on. It always seems to be more active than any other of the compute nodes, and it seems to be the troublemaker amongst the group as it is listed in the log in that way as what I assume to be the root of the job quitting.
Also, monitoring it with top command in linux, it's the only compute node using more than 100% processor usage during ALL the TIME (the others jump up to >1000% during matrix assembly and other such portions of the solution process more cores can be utilized).
I'm just kinda having a hard time swallowing the thought that, while one computer can solve 1.7 MDoFs, then 20 nodes could not complete a problem that's only slightly less than double the DoFs.......I know the relation between memory requirements and DoFs might not be linear, but still......
3. In general, with, let's say 8 GB available to me (because the cluster is shared resource, more often than not I might not have 16 GB to play with) and 30 nodes running (for a grand total of 240 GB usable RAM) what is a reasonable number to put to how large a model I could solve (in Dofs anyway)...
If i'm going to keep debugging I'd like to have a feel who the actual culprit is here....If someone can tell me "... at 240 GB shared memory, you'll never get past ...#dofs...." that's be a big help too. Then I know that horesepower either is or is not an inherent limitation and look elsewhere (comsol pointed me to stack size, which I'm quite lost on also...and there may be firewall issues I haven't sorted out yet).....
Ok, in any event, any input, suggestions, etc that anyone can give me would be a HUGE help! Thanks a million!!
--Matt
4 Replies Last Post 11.03.2012, 22:52 GMT-4