LSF作业调度系统(6.0).pdf
Using Platform LSF® with FLUENT November 2003 Platform Computing Comments to: doc@platform.com Platform LSF® software (“LSF”) is integrated with products from Fluent Inc., allowing FLUENT jobs to take advantage of the checkpointing and migration features provided by LSF. This increases the efficiency of the software and means data is processed faster. This document provides instructions for installing, configuring, and using LSF with FLUENT. It is assumed you are already familiar with using FLUENT software and checkpointing jobs in LSF. Contents ◆ ◆ ◆ ◆ ◆ “How LSF Works with FLUENT” on page 2 “Obtaining Distribution Files” on page 3 “Configuring LSF for FLUENT” on page 4 “Submitting a FLUENT Job” on page 5 “Checkpointing and Restarting FLUENT jobs” on page 6 How LSF Works with FLUENT How LSF Works with FLUENT To checkpoint jobs, LSF uses two executable files called echkpnt and erestart. LSF supplies special versions of echkpnt and erestart that allow checkpointing with FLUENT. Checkpoint directories When you submit a checkpointing job, you specify a checkpoint directory. Before the job starts running, LSF sets the environment variable LSB_CHKPNT_DIR. The value of LSB_CHKPNT_DIR is a subdirectory of the checkpoint directory specified in the command line. This subdirectory is identified by the job ID and only contains files related to the submitted job. Checkpoint trigger files When you checkpoint a FLUENT job, LSF creates a checkpoint trigger file (.check) in the job subdirectory, which causes FLUENT to checkpoint and continue running. A special option is used to create a different trigger file (.exit) to cause FLUENT to checkpoint and exit the job. FLUENT uses the LSB_CHKPNT_DIR environment variable to determine the location of checkpoint trigger files. It checks the job subdirectory periodically while running the job. FLUENT does not perform any checkpointing unless it finds the LSF trigger file in the job subdirectory. FLUENT removes the trigger file after checkpointing the job. Restart jobs If a job is restarted, LSF attempts to restart the job with the -r option appended to the original FLUENT command. FLUENT uses the checkpointed data and case files to restart the process from that checkpoint point, rather than repeating the entire process. Each time a job is restarted, it is assigned a new job ID, and a new job subdirectory is created in the checkpoint directory. Files in the checkpoint directory are never deleted by LSF, but you may choose to remove old files once the FLUENT job is finished and the job history is no longer required. 2 Using Platform LSF with FLUENT Using Platform LSF® with FLUENT Obtaining Distribution Files Distribution files for LSF to be used with FLUENT are available from Platform Computing. Installation instructions are included. The files are available from your LSF vendor, and from Platform’s web (www.platform.com)and FTP sites (ftp.platform.com). Access to the download area of the Platform Web site and the Platform FTP site is controlled by login name and password. If you are unable to access the distribution files, send email to support@platform.com. Using Platform LSF with FLUENT 3 Configuring LSF for FLUENT Configuring LSF for FLUENT LSF provides special versions of echkpnt and erestart to allow checkpointing with FLUENT. You must make sure LSF uses these files instead of the standard versions. Configure LSF for FLUENT ◆ Overwrite the standard versions of echkpnt and erestart with the special FLUENT versions. OR ◆ Complete the following steps: a Leave the standard LSF files in the default location and install the FLUENT versions in a different directory. b In lsf.conf, modify the LSF_ECHKPNTDIR environment variable to point to the FLUENT versions. The LSF_ECHKPNTDIR environment variable specifies the location of the echkpnt and erestart files that LSF will use. If this variable is not defined, LSF uses the files in the default location, identified by the environment variable LSF_SERVERDIR. c d 4 Using Platform LSF with FLUENT Save the changes to lsf.conf. Reconfigure the cluster with the commands lsadmin reconfig and badmin reconfig. LSF checks for any configuration errors. If no fatal errors are found, you are asked to confirm reconfiguration. If fatal errors are found, reconfiguration is aborted. Using Platform LSF® with FLUENT Submitting a FLUENT Job Submit the job as usual, but include the parameters required for checkpointing. Syntax The syntax for the bsub command to submit a FLUENT job is: bsub [-k checkpoint_dir | -k "checkpoint_dir[checkpoint_period]" [bsub options] FLUENT command [FLUENT options] - lsf The checkpointing feature for FLUENT jobs requires all of the following parameters: -k checkpoint_dir Regular option to bsub that specifies the name of the checkpoint directory. FLUENT command Regular command used with FLUENT software. - lsf Special option to the FLUENT command. Specifies that FLUENT is running under LSF, and causes FLUENT to check for trigger files in the checkpoint directory if the environment variable LSB_CHKPNT_DIR is set. Using Platform LSF with FLUENT 5 Checkpointing and Restarting FLUENT jobs Checkpointing and Restarting FLUENT jobs Checkpointing Syntax The syntax for the bchkpnt command is: bchkpnt [bchkpnt options] [-k] [job_ID] FLUENT The following parameters are used with FLUENT: parameters ◆ -k Regular option to bchkpnt command, specifies checkpoint and exit. The job will be killed immediately after being checkpointed. When the job is restarted, it does not have to repeat any operations. ◆ job_ID Job ID of the FLUENT job. Used to specify which job to checkpoint. Restarting Syntax The syntax for the brestart command is: brestart [brestart options] checkpoint_directory [job_ID] FLUENT The following parameters are used with FLUENT: parameters 6 ◆ checkpoint_directory Specifies the checkpoint directory, where the job subdirectory is located. ◆ job_ID Job ID of the FLUENT job, specifies which job to restart. At this point, the restarted job is assigned a new job ID, and the new job ID is used for checkpointing. The job ID changes each time the job is restarted. Using Platform LSF with FLUENT Using Platform LSF® with FLUENT Technical Support Contacting Platform Contact Platform Computing or your LSF vendor for technical support. Use one of the following to contact Platform support: Email support@platform.com World Wide Web www.platform.com Phone ◆ ◆ ◆ North America: +1 905 948 4297 Europe: +44 1256 370 530 Asia: +86 10 6238 1125 Toll-free phone 1-877-444-4LSF (+1 877 444 4573) Mail Platform Support Platform Computing 3760 14th Avenue Markham, Ontario Canada L3R 3T7 When contacting Platform Computing, please include the full name of your company. We’d like to hear from you If you find an error in any Platform documentation, or you have a suggestion for improving it, please let us know: Email doc@platform.com Mail Information Development Platform Computing 3760 14th Avenue Markham, Ontario Canada L3R 3T7 Be sure to tell us: ◆ ◆ ◆ The title of the manual you are commenting on The version of the product you are using The format of the manual (HTML or PDF) Using Platform LSF with FLUENT 7 Copyright Copyright © 1994-2003 Platform Computing Corporation All rights reserved. Although the information in this document has been carefully reviewed, Platform Computing Corporation (“Platform”) does not warrant it to be free of errors or omissions. Platform reserves the right to make corrections, updates, revisions or changes to the information in this document. UNLESS OTHERWISE EXPRESSLY STATED BY PLATFORM, THE PROGRAM DESCRIBED IN THIS DOCUMENT IS PROVIDED “AS IS” AND WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT WILL PLATFORM COMPUTING BE LIABLE TO ANYONE FOR SPECIAL, COLLATERAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING WITHOUT LIMITATION ANY LOST PROFITS, DATA, OR SAVINGS, ARISING OUT OF THE USE OF OR INABILITY TO USE THIS PROGRAM. Document redistribution policy This document is protected by copyright and you may not redistribute or translate it into another language, in part or in whole. Internal redistribution You may only redistribute this document internally within your organization (for example, on an intranet) provided that you continue to check the Platform Web site for updates and update your version of the documentation. You may not make it available to your organization over the Internet. Trademarks ® LSF is a registered trademark of Platform Computing Corporation in the United States and in other jurisdictions. ™ ACCELERATING INTELLIGENCE, THE BOTTOM LINE IN DISTRIBUTED COMPUTING, PLATFORM COMPUTING, and the PLATFORM and LSF logos are trademarks of Platform Computing Corporation in the United States and in other jurisdictions. ® Rational, ClearCase, are trademarks or registered trademarks of Rational Software Corporation in the United States and/or in other countries. UNIX is a registered trademark of The Open Group in the United States and in other jurisdictions. Other products or services mentioned in this document are identified by the trademarks or service marks of their respective owners. Last update Latest version 8 November 13 2003 www.platform.com/services/support/docs_home.asp Using Platform LSF with FLUENT