







Paternité-Pas d'Utilisation Commerciale-Partage des Conditions Initiales à l'Identique 2.0 France

http://creativecommons.org/licenses/by-nc-sa/2.0/fr/



The Box Copyright © 2008 Park Jug, Lib roc CC - Creative Commons 2.0 France - Paternite - Pas of Utilization





### Any performance tuning advice provided in this presentation.....

will be wrong!



The Box www.kodewerk.com ويتدرينا تدامين المنافعة المنافعة المنافعة المنافعة المنافعة المنافعة المنافعة المنافع المنافعة المناف

#### ✤ Work as independent (a.k.a. freelancer)

- performance tuning services
- benchmarking
- Java performance tuning course and seminars
- Co-author: <u>www.javaperformancetuning.com</u>
- Contributing editor: <u>www.theserverside.com</u>
- Nominated Sun Java Champion
- 🛧 Blah blah blah

Change the way you think about performance tuning





# Changes in hardware are now redefining the rules of coding, design, and Architecture



The Box www.kodewerk.com (بودوبر برباز داند سرع دمدمد) (۲۰۰۰) (۲۰۰۰ دهم المعام المع







- Arrest Arres
- A Cray supercomputers
  - Fortran, C, CAL, Special purpose languages
- Special Purpose Devices (VHDL)
- Smalltalk Systems
- A Java Platform (97)



### How did we get better performance?



08/04/2008

10

The Box www.kodewerk.com ((\*) www.kodewerk.com Copyright © 2008 Park Jirg. Libence CC - Creatize Commons 2.0 France - Paternitis - Pas d'Utilisation Com

### Historical Improvements

Sometimes better algorithms
 Mostly faster Hardware
 Clock speeds (read CPU)
 Bus
 Memory
 Networks

Exotic hardware



## Needed to study existing or create new hardware



12

The Box www.kodewerk.com (بعدد من التراثية من المنظر ا

### Cray CPU Block Diagram



#### **Developers Adapted to Hardware**

- Code needed to utilize key features altering coding style
  - Short loops with no branching
  - regular memory strides
    - always increment loop counters by 1
  - statistically acceptable errors
- Align short loops and functions on instruction buffer boundaries



08/04/2008



National Center for Atmospheric Research

درودارداتسويديديد (<u>www.kodewerk.com</u> عدوداردات







### **Parallel Computing**

17 17 08/04/2008

The Box www.kodewerk.com ويتدريد إن التسمير المرتبية (العالي) (المحمد المحمد المحم المحمد المحمد المحمد المحمد المحمد المحمد المحمد المحمد المحمد المحم المحمد المحم المحمد المحم المحمد المحم المحمد ال





### Posix Threading Support (Early 90s)

UNIX kernels single threaded (80)

- ✤ SunOS is made SMP safe (91)
  - entire kernel is protected with a single lock

threaded in 93

- ✤ AIX pthread support 93?
- ✤ Windows NT released 93

simplified alternative to pthreads

✤ HP-UX POSIX suffers setback (95)



### Languages Play Catch-up

Java Platform explodes onto the scene (96)
 support for distributed and parallel computing
 Strong play to virtualize hardware
 cross-platform threading model



The Box www.kodewerk.com آورددردازدات (۱۹۹۰) The Box

### Java Thread Support

Synchronized statement and modifier

- map to OS level locks
- volatile keyword
  - no one knows what it does
- java.lang.Thread
- java.lang.Object.wait()
- java.lang.Object.notify()
- java.lang.Object.notifyAll()



### **Java Threading 1.0** Single threaded model green threads used by JVM eventually mapped onto a single OS thread A Java Memory Model hiding concurrency bugs CPU Memory Model hiding concurrency bugs

08/04/2008

### Memory Models

- Formal specification of how memory operations will function
  - ensure consistency in our view of variables
  - enforces strict ordering of memory operations
  - allow or disallow compiler optimizations
- A Java Memory Model
- A Chip level Memory Model
  - Intel
  - MD
  - Sparc
  - PowerPC

08/04/2008

The Box <u>www.kodewerk.com</u> (() اربونو بروی از ترک بر مرد مدینه Constitute 2008 Park Jud. Les de Cole attle Common 2.0 Parker H - Park (Utilisatie Commerciale - Parkare des Coulitions initiales a fuie

### **Hints of the Future**



Beginning with J2SE 1.4.1, the Java HotSpot Server VM does not support operations on chips with Sparc V8 architecture





### Hardware Plays Catch-up

Sparc V9 contain pseudo instructions to sync L1, L2 cache with main memory on multi-cpu machines





26

The Box www.kodewerk.com ويتارك التستير عنترين فرندان Copyright © 2008 Park Jig. Lis no. C.C. - Creattle Commons 2.0 Finice - Pate rite - Park Utilization Commerciale - Partage des Conditions initiales al ritering

### **Hardware Acceleration Slows**

- Intel announces that focus will shift from clock speed to multi-core/hyperthreading
  - multi-core Xeon processors ship late 2005
- 2007, C|Net reports, Intel and Microsoft state that software needs to heed Moore's law



The Box www.kodewerk.com sussignmentsjupper

08/04/2008



Copyright © 2008 Paris Jag, Libe ace CC - Creattre Commons 2.0 Fisance - Paternite - Pas d'Utilisation Commerciale

### Kabutz: Law of Sudden Riches

We no longer have uni-processor systems to hide behind



Applications suddenly have more CPU
 bigger problem for older 3rd party libraries



The Box www.kodewerk.com

#### Dangers

- All existing threading bugs start exposing themselves
- ✤ We have to worry about
  - deadlock
  - live lock
  - thread stalls
  - race conditions
- Lock contention
  - serialized execution
- ✤ Strange behavior in clusters



### Database Vendors React



31 2008 08/04/2008

The Box www.kodewerk.com ويتوريد إن المستر عندين ((\*)) ويتوريد إن المستر عندين (\*) (\*) ويتوريد إن المستر عندين



### Late 2006, ~50% of Java performance course attendees show up with multi-core laptops



32

The Box www.kodewerk.com ويون در زيار تركيبين المنظر المستور المنظرين المنظر المستور المنظر المنظ

### Multi-core is a fact of life

- A Developers must deal with concurrency
  - truly threaded applications are more the norm
  - Multi-core puts more pressure on
    - memory

08/04/2008

- I/O resources
- shared variables
- Databases?



The Box www.kodewerk.com sussessed

Sharing is a big performance issue
 points of serialization now hurt more than ever











Copyrty it © 2008 ParkJig. Lte ice CC - Creattre Commons 2.0 Fisice - Patersite - Pasid Utilisatos Commerciale - Partage iditions hitiales à l'ident



## Are You Awake?

#### L1/L2 caches can thrash

for ( int i = 0; i < matrix.length; i++) { for ( int j = 0; j < matrix[i].length; j++) { matrix[i][j] \*= 2;

#### benches in 430ms

for ( int i = 0; i < matrix.length; i++) for ( int j = 0; j < matrix[i].length; j++) matrix[j][i] \*= 2;



# Locking is Pessimistic



#### The glass is half full



The Box www.kodewerk.com ويترزيز التراسيين المراجع الم المراجع الم المراجع المراح

## **Reducing Contention**

- A Share nothing designs
- Pipelined designed
  - messaging and mail boxes
- Minimize transactions
  - duration
  - numbers
- A Minimize locking
  - Concurrency package
- A Garbage collection
- A Hotspot/JIT

08/04/2008

The Box www.kodewerk.com



# Automated Memory Management

#### GC is "stop-the-world"

- GC needs exclusive access to Java heap
- all application threads must be paused
- point of serialization in your application
- A GC is CPU intensive

08/04/2008

- application pause time tied to clock speed
- An improperly configured Java heap hinders performance
  - Too small => too frequent, risk OOME
  - Too large => long pause times



# **Keeping Friends Close**

- Large page support now on all platforms
  - keeps related objects on the same page
  - helps avoid TLB misses (expensive to resolve)



- lock pages into RAM
- Solaris support is up to 256m (depending on class of machine)
- Linux/Windows is up to 4m

08/04/2008

The Box <u>www.kodewerk.com</u> دروماز*والمصريديني*د

# **Garbage Collection**

08/04/2008

1.5 parallel becomes default
 consider using concurrent
 1.6 support escape analysis
 references that remain local can be dealt with more efficiently



# More to Come? Dominate chip architecture is cache-coherent non-uniform memory access (NUMA) local access is very quick remote access is much slower encourages thread/core affinity mitigates L1/L2 cache coherency issues

reduces contention on bus and remote memory

# **Garbage Collection Improvements**

## GC/JVM allocations aware of NUMA

- Iocalized allocations GC'd faster
- Iocalized allocations stay remain in CPU cache
- enabled using -XX:useNUMA (1.6 Update 2)
  - Solaris is simple
  - Windows and Linux require more complex configuration
- <u>http://java.sun.com/javase/technologies/hotspot/</u> <u>largememory.jsp</u>





## Acquiring a lock is expensive

- maybe
- ✤ Vast majority of locks are not contended
- RDB vendors have known for more than 20 years, locking kills performance

what can we learn from RDBs



The Box www.kodewerk.com ويتد ترينا رُتاب مر مردم مر المالي ويتدرينا رُتاب مر مردم مر وي المراجع و ا و مراجع و المراجع و و المراجع و ال

## Optimizations

- Use observations to guide optimizations
- Relax constraints
- A Throughput vs. fairness
- Cache to avoid using expensive resources



51

The Box www.kodewerk.com ويون مريد لرائد سر عدم منه و www.kodewerk.com ويون مريد لرائد سر عدم منه و Constitution Commerciale - Parton des Conditions at Men tan

# Hardware to Reduce Contention

#### Transactional Memory

Iooks more like an optimistic transaction

lock defines "transactional region"

allows all threads simultaneous access

hardware watches for write-write conflict

thread rollback and memory repair







The Box <u>www.kodewerk.com</u> אינענעלטעע (metabox www.kodewerk.com) אינענעלטעע (metabox www.kodewerk.com) אינענעלטעע

# **Software Improvements** A JSE 5.0 provides a laundry list of improvements aimed at reducing contention atomic variables improved volatile java.util.concurrent (JSR 166) semantically richer concurrency Collections with copy on write semantics ConcurrentHashMap ReentrantLock ReadWriteLock

08/04/2008

The Box <u>www.kodewerk.com</u> اوید دوبارداده (۱۳۸۰)



# Monitoring public void run() { boolean detected = false; while (running) { if ( ( counter < 0) || (counter > 2)) { if (! detected) { System.out.println( "Corrupted " + counter); detected = true;

55



```
private int counter = 0;
```

```
Runnable mutator = new Runnable() {
    public void run() {
        long localCount = 0;
        while ( running) {
            counter++;
            counter--;
            localCount++;
            }
            addToTotalCount( localCount);
        }
};
```



```
volatile private int counter = 0;
```

```
Runnable mutator = new Runnable() {
    public void run() {
        long localCount = 0;
        while ( running) {
            counter++; counter--; localCount++;
        }
        addToTotalCount( localCount);
    }
};
```

## **Doubly Synchronized**

08/04/2008

```
// Instance based counter
private int counter = 0;
// Runnable block
Runnable mutator = new Runnable() {
  public void run() {
     long localCount = 0;
     while (running) {
       synchronized( this) { counter++;}
       synchronized(this) { counter--; }
       localCount++;
     addToTotalCount( localCount);
```



## Synchronized

08/04/2008

```
// Instance based counter
private int counter = 0;
```

```
// Runnable block
Runnable mutator = new Runnable() {
  public void run() {
     long localCount = 0;
     while (running) {
       synchronized {
          counter++; counter--;
       localCount++;
     addToTotalCount( localCount);
  }};
```

The Box <u>www.kodewerk.com</u> ریددنویدژنداند. (www.kodewerk.com) ریددنویدژنداند.

## **Doubly Reentrant Lock**

```
// Instance based counter
private int counter = 0;
private ReentrantLock lock;
```

```
// Runnable block
    try {
        lock.lock();
        counter++;
    } finally { lock.unlock(); }
    try {
        lock.lock();
        counter--;
    } finally { lock.unlock(); }
```

08/04/2008

# Reentrant Lock

// Instance based counter
private int counter = 0;
private ReentrantLock lock;

// Runnable block
try {
 lock.lock();
 counter++;
 counter--;
} finally {
 lock.unlock();
}

08/04/2008



// Instance based counter
private AtomicInteger counter;

// Runnable block
while ( running) {
 counter.incrementAndGet();
 counter.decrementAndGet();
 localCount++;

# Results

| Bench               | Counter   |  |
|---------------------|-----------|--|
| Not Thread Safe     | 750526139 |  |
| Volatile            | 333765152 |  |
| Double Synchronized | 28829033  |  |
| Synchronized        | 28799357  |  |
| Double Locked       | 28966764  |  |
| Locked              | 28830148  |  |
| AtomicInteger       | 203393689 |  |

JDK 1.5.0\_10, Intel 3.4 Ghz Hyper-threaded, Window XP





## **Compare and Set**

Atomic primitive wrappers rely on CAS

- unsynchronized thead safe type
- good for atomic operations
- CAS is used to support thread safe lock-free algorithms

needs support from the hardware

cas mem\_addr, old\_value, new\_value



# **Coming Soon?** Cliff Click's lock-less concurrent HashTable still a research project extremely complex implementation allows race conditions to determine state in the supporting state-machine relies on CAS ♠ FIFO, LIFO?

08/04/2008

The Box www.kodewerk.com www.gentejuser

# **JVM Improvements** A JSE 6.0 adds to the list of features that can reduce contention spin-waits (adaptive spinning) lock coarsening lock elision (with escape analysis) biased locking altered notify semantics (less lock jamming)



## Does any of this stuff actually work?

08/04/2008

68

The Box www.kodewerk.com ويتدخر بريز ترتر تستريز منترين (() و www.kodewerk.com ريتد خريد ترتيز ترتبر منترين () و www.kodewerk.com و Copyright © 2008 Park Jug. Lie too Co-Cre attre Control : 2.9 Finite – Pate d'Utilitatio Commerciale – Partice des Conditions hittake a filter three

## How Good is Escape Analysis

#### A Bench devised by Jeroen Borgers (Xebia)

```
public String concatBuffer(String s1, String s2, String s3) {
  StringBuffer sb = new StringBuffer();
  sb.concat( s1);
  sb.concat( s2);
  sb.concat( s3);
  return sb.toString();
public String concatBuilder(String s1, String s2, String s3) {
  StringBuilder sb = new StringBuilder();
  sb.concat( s1);
  sb.concat( s2);
  sb.concat( s3);
  return sb.toString();
```

08/04/2008

# Lock Overhead

| Benchmark                   | StringBuffer | StringBuilder | %Overhead |
|-----------------------------|--------------|---------------|-----------|
| Baseline                    | 7896         | 2760          | 186%      |
| Escape Analysis             | 7875         | 2756          | 185%      |
| Elimination                 | 4068         | 2739          | 48%       |
| Biased                      | 5489         | 2843          | 93%       |
| Escape, Elimination         | 4078         | 2813          | 45%       |
| Escape, Biased              | 5500         | 2849          | 93%       |
| Elimination, Biased         | 4718         | 2812          | 68%       |
| Escape, Elimination, Biased | 4740         | 2828          | 68%       |



## The Future is Clear

#### Processors will contain

- many more cores
- Memory will be segmented
  - local segments as part of a global space
- Applications will continue need to be hardware aware
- Languages improvements
  - could closures offer better expression of parallelism?

The Box www.kodewerk.com margante

Totally new language?

08/04/2008

# The Future is Clear Operating systems and hardware are being optimized to better support virtual machines



72

The Box www.kodewerk.com ويود دوريد از تدكيب مركز دوريد و Copyright D 2008 Park Jug. Lie ac CC - Creatile Commons 2.0 Finite - Pate 1018 - Pat d'Utilitation Commerciale - Partice des Conditions inflates af kier that









