Well, Jenkins CI server scalability has limitations. But for vast majority of applications its performance is fairy enough. There are installations with hundreds of slaves running about 10k builds daily. While Jenkins configuration is relatively simple, some art required to setup and maintain a busy server. There are some suggestions how to keep it fast - divided to Master configuration, Slave configuration and Job design. Plus few notes on Multi-master Jenkins.
Jenkins Master configuration
Number of plugins. Plugins cause performance issues for builds (because of hooks) and UI (because they adds stuff to it). Do not add too many plugins and anyway - evaluate them thoroughly [ref].
Number of jobs. Jenkins gets slow (at least in UI) with 1000+ jobs [ref]. Moving jobs to several masters (manual static sharding) helps. E.g. one master - for builds, another - for tests. Functional segregation lets to simplify Jenkins configuration and to decrease number of plugins. While splitting a big master to two similar ones leaves two complex configurations on each master.
Keep the number of active jobs reasonable, remove unused.
Utilize Git and Gerrit Trigger plugins to serve multiple branches by one set of jobs.
Jobs on Master. Should be none. Only light internal tasks crucial for Jenkins housekeeping. Definitely - no application jobs.
SCM polling on Master. SCM polling for Git or Perforce require execution of the CLI program for each check for each job. For reliable polling it should be configured to run on master. Polling on slaves (default for both VCS) is bad because slaves are trashable.
Use push hooks instead of polling. For Git use Gerrit trigger - “Ref update” event can replace SCM polling in most cases. For Perforce … set polling period to something large, use “H” or “@hourly” for Cron expression in polling configuration.
Subversion uses SVNkit instead of CLI so it is not affected.
Builds lazy-loading. When JVM minimum and maximum heap sizes differ, WeakReferences (lazy loading uses them) garbage-collected before JVM tries to expand the heap [ref]. It causes extra load on builds re-loading and sometimes may lead to disappearing build records.
JVM configuration for servers should have minimum and maximum heap sizes set to the same value.
Access control. Authenticated users should be allowed to do anything excluding system administration [ref]. “Trust users not to be malicious. Don’t trust users not to do daft things - or read documentation, or to have well behaved unit tests.” [ref] Trust encourages. But also it helps to save on authorization. Complex authorization (e.g. Role Strategy plugin) kills UI performance, API performance suffers too.
Disk IO performance. Use fast disks for configuration (startup time) and build records (build lazy loading) [ref]. SSD on master helps a lot [ref]. Separate configuration, builds records and artifact storage. Worth to look at Pluggable artifact transfer and storage (JENKINS-17236).
Use external API/UI frontend for Jenkins. Jenkins is not very good at UI performance. UI plugins worsen it even more. Workarounds - external UI dashboards or frontend systems [ref]. Examples of problematic plugins:
- Dashboard view plugin is having a real problem with lazy-loading (it thought being fixed though) [ref].
- Nested Views plugin causes permission re-evaluation for each job on the server several times. Using regexp to filter jobs makes it worse. Worth to try - use explicit lists of jobs, not regexps. Replace it with Cloudbees Folders Plugin - it might help but needs evaluation.
HTTP cache. Fast HTTP proxy in front of Jenkins to cache static data [ref] might help. But it requires further evaluation.
Servlet container. Embedded Winstone (before 1.535) or Jetty8 (1.535+, but not in 1.532.1 LTS) vs Tomcat. Jetty used to be better on consistent throughput and resource consumption than Tomcat. But for recent Jetty 8-9 and Tomcat 7 there are no clear evidence of it.
Jenkins Slave configuration
Number of slaves. There is “X1K initiative” - goal for Jenkins developers to assure smooth operation of master with 1000 executors on all slaves [ref]. It is still a challenge. Somewhere around 250 slaves and lots of builds slave connections start getting broken in the middle of a build [ref], there are evidences of Jenkins tending to lose connection to slaves when there are about hundred of slaves [ref]. Since thread usage improvement in Jenkins remoting in Jenkins core 1.521 and SSH Slaves plugin 0.27 it should not be an issue [ref, ref], but it is not proven yet.
Number of executors per slave. Increasing number of executors over the slave capabilities decreases overall throughput - due to clashes, IO congestion or RAM swapping. Leverage RAM, CPU cores and build type. RAM should be enough for maximum number of builds at maximum memory setting + file cache. CPU should be enough to work below 100% utilization, taking IO into account - IO releases some CPU time. Have less than 1 executor per CPU core for single-thread builds. Consider IOPS limit - to avoid disk IO being a bottleneck. Generally if 15 min Load Average more than the number of cores, the number of executors should be decreased. There is a suggestion - 1 executor per slave for isolation [ref]. It is reasonable in cloud but for dedicated hardware the same isolation can be achieved by lightweight containers.
Job design
Workspace cleanup - removing job workspace before build start to get a clean build or after it - to save disk space. It adds time for fresh checkout and even longer - for Maven to download dependencies. Finally build may run few times longer.
Address it in the build system - have a reliable “clean” target in the build script, do not create files outside of temporary build directories, never touch files under version control. Clean up workspaces periodically to be sure. Do it always for “release” builds when the build speed not as important as build sanity.
Artifact fingerprinting. Large fingerprint database may kill Jenkins master performance. Copy Artifact plugin always check fingerprints. Maven builds record artifacts fingerprints unconditionally.
So - prevent code review (Gerrit) builds recording Fingerprints for Maven2/3 builds [ref], maybe - by disabling Maven artifact archiving. Applies to freestyle builds too, but it is controllable there.
Post-build actions. Limit post-build steps, they serialise parallel build (JENKINS-9913). Move the work to to build steps. E.g. use custom artifact archiver (as a build step) such as “mvn deploy”.
Maven jobs vs Freestyle jobs. Use Freestyle - Maven jobs are notably slower. And has its own set of bugs. Even Maven job type inself considered bad by a core Jenkins contributor [ref].
Large build log. Build log is loaded to master memory causing OoM error if the log is too big. Use Log File Size Checker plugin to fail the job if console log reaches a limit.
Sonar analysis. Sonar analysis at each build makes it longer 2-3 times while adds little value - Sonar is a monitoring and code inspection tool, not a gatekeeper. Run it nightly, do not - in each build.
Reference repository for Git SCM. Git repository on the local file system can be used as a reference - only update is downloaded, the rest is hardlinked.
Multi master?
There are no multi-master Jenkins clusters. And it is not expected in a foreseen future [ref, ref]. The only way to share load between masters without custom software - setup 2 masters each for its set of jobs.
- Jenkins Enterprise by Cloudbees - just for fault tolerance [ref]. It is “active - spare” cluster. No load balancing.
- Jenkins Operations Center by Cloudbees - simplifies management of multiple masters and slaves. Does not provide multi-master instance with single point of entry. [ref]
- Openstack/HP multi-master uses custom software (Zuul + Gearman) and specific standardized workflow over it [ref]. It does not use Jenkins UI, only provides direct link to builds in Zuul or Gerrit. Build history, analytics and trends are collected via an external search engine [ref].
General & Cultural tips
Follow Keep it simple, stupid and You aren't gonna need it principles.
References
- “Keynotes”. Kohsuke Kawaguchi, Cloudbees. Jenkins User Conference 2013 - Palo Alto.
Slides: http://www.cloudbees.com/sites/default/files/juc/juc2013/2013-1023-JUC-PaloAlto-Kohsuke-Keynote.pptx
Video: http://www.youtube.com/watch?v=FaMoiVpKUvQ - “Multiple Jenkins Master Support” Khai Do, Hewlett Packard. Jenkins User Conference 2013 - Palo Alto.
Slides: http://docs.openstack.org/infra/publications/gearman-plugin/
Video: http://www.youtube.com/watch?v=pLQddm85fPQ - “Maintaining Huge Jenkins Clusters - Have We Reached the Limit of Jenkins?” Robert Sandell, Sony Mobile Communications. Jenkins User Conference 2013 - Palo Alto.
Slides: http://www.cloudbees.com/sites/default/files/juc/juc2013/2013-1023-Palo-Alto-Robert_Sandell-Maintaining-Huge-Jenkins-Clusters.pdf
Video: http://www.youtube.com/watch?v=LRonDiXUx1U - "To Infinity & Beyond the Small Team" James Nord, Cisco
Slides: http://www.cloudbees.com/sites/default/files/JUC_Palo_Alto_2013_TIaBTST.pdf
Video: http://www.youtube.com/watch?v=CGjgS16dVUc - “Scaling Jenkins Horizontally with Jenkins Operations Center by Cloudbees”. Cloudbees blog: http://blog.cloudbees.com/2013/12/scaling-jenkins-horizontally-with.html
- “Jenkins at Three Years: Becomes Literate, Does Mobile in the Cloud and Handles Multi-Branch”. Harpreet Singh & Kohsuke Kawaguchi, CloudBees. Jenkins User Conference 2013 - Palo Alto.
Slides: http://www.slideshare.net/kohsuke/jenkins-user-conference-2013-literate-multibranch-mobile-and-more
Video: http://www.youtube.com/watch?v=AKcQuOROFlI - “Jenkins Scalability Summit notes”. Jenkins Scalability Summit, Oct 2013 - Los Altos. https://docs.google.com/document/d/1GqkWPnp-bvuObGlSe7t3k76ZOD2a8Z2M1avggWoYKEs/edit#
- “Kohsuke with OSS hat / Core improvements”. Jenkins Scalability Summit, Oct 2013 - Los Altos.
Slides: https://wiki.jenkins-ci.org/download/attachments/68747344/Kohsuke.pptx - “Sony Mobile list to Santa Claus”. Robert Sandell, Sony Mobile. Jenkins Scalability Summit, Oct 2013 - Los Altos.
Slides: https://wiki.jenkins-ci.org/download/attachments/68747344/Sony+Mobile.pptx - “Reducing the # of threads in Jenkins: SSH slaves”. Kohsuke Kawaguchi, Cloudbees. Jenkins CI blog: http://jenkins-ci.org/content/reducing-threads-jenkins-ssh-slaves
- “High availability”. Jenkins Enterprise: http://www.cloudbees.com/jenkins-enterprise-cloudbees-features-high-availability-plugin.cb
- “Jenkins' Maven job type considered evil”. Stephen Connolly. Stephen's Java Adventures. http://javaadventure.blogspot.ru/2013/11/jenkins-maven-job-type-considered-evil.html