REF:
====
https://www.bfilipek.com/2019/11/perfguidecpu.html
https://easyperf.net/notes/
https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux

PERF 
====
# Is a sampling profiler. Not 100% precies. Artifacts of measurement will have precision loss.
  perf stat <exe>  # course grain
  
# By default perf shows calle data. Expanding the tree will show the callers.
# But des not show proper call graph
  perf record <exe>
  perf report
  
# For callgraph use -g,  this requires, frame pointers
# Hence make sure to compile/link with -fno-omit-frame-ponter 
# This tells the compiler to stop deleting the frame pointer,allowing us to walk up/down the call stack
# At the cost of one register. 
  perf record -g <exe>
  perf report -g
 
# Can have a flags file for compiler options:
-O3
-std=c++14
-lc++abi
-fno-exceptions
-fno-rtti
-pedantic
-fno-omit-frame-ponter

  clang++ $(< flags) -o <test-name> <src file> -lbenchmark && ./<test-name>

# For proper callgraphs (top down) use:
# # 0.5 if for filtering function ?, caller means caller on top, then callee
  perf record -g <exe>
  oerf report -g "graph,0.5,caller"  

==================================================================
- http://www.brendangregg.com/USEmethod/use-linux.html                - Good collection of performance debug
- https://easyperf.net/notes/
- https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux
- https://easyperf.net/blog/2018/08/26/Basics-of-profiling-with-perf  - Basic perf understanding

Use of perf:  http://www.brendangregg.com/perf.html
===============================================================
  cd $WK ; cp /var/tmp/ma0/DEFS/metabuilder.def .
  Base/bin/gcc-7.3.0/release/perf_job_gen ./metabuilder.def
  
  # Show system call overhead,as a summary.
  strace -c Base/bin/gcc-7.3.0/release/perf_job_gen ./metabuilder.def
  strace -e trace=open Base/bin/gcc-7.3.0/release/perf_job_gen ./metabuilder.def  # show what files opened
  
  # track library calls. It intercepts and records the dynamic library calls which are called 
  # by the executed process and the signals which are received by that process.  
  # It can also intercept and print the system calls executed by the program
  ltrace -c Base/bin/gcc-7.3.0/release/perf_job_gen ./metabuilder.def
  ltrace -S -c Base/bin/gcc-7.3.0/release/perf_job_gen ./metabuilder.def
  
  # Show page faults, data and instruction cache misses use:
  perf stat -d Base/bin/gcc-7.3.0/release/perf_job_gen ./metabuilder.def
  
  # perf stat to run the same test workload multiple times and get for each count, the standard deviation from the mean.
  perf stat -r 5 -d Base/bin/gcc-7.3.0/release/perf_job_gen ./metabuilder.def
  
  # For more detail and higher level overview, compile debug, -g means record stack traces
  perf record -g Base/bin/gcc-7.3.0/debug/perf_job_gen ./metabuilder.def
  perf report --sort comm,dso  # high level overview
  
  # show report to stdout, adjust sampling, avoid high number
  perf record -F 99 -g Base/bin/gcc-7.3.0/debug/perf_job_gen ./metabuilder.def
  perf report -n --stdio

  # for a graphical display, you can use flame graphs. This *ONLY* works properly with debug builds
  # and you must use -g (collect stack traces)
  # http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
  # assume you have down loaded Flamegraph, git clone https://github.com/brendangregg/FlameGraph
  # The wider the graph the more time is spent.
  perf record -g Base/bin/gcc-7.3.0/debug/perf_job_gen ./metabuilder.def
  perf script | $HOME/FlameGraph/stackcollapse-perf.pl > out.perf-folded
  cat out.perf-folded | $HOME/FlameGraph/flamegraph.pl > my_perf.svg
  display my_perf.svg
  
  *alternatively* view from the browser, as this will also show:
    - *full* function names
    - no of samples
    - percentage cpu times


The metabuilder.def below needs preparation for perf test below: 
o/ module load ecflow/5new
o/ mb; git checkout ci; ./generate -a  # switch to metabuilder ci, and generate all, so job generate can find all includes and scripts
o/ ecflow_client --port 3142 --host ecflow-metab --get > metabuilder.def
o/ Edit metabuilder.def and replace 
      edit ECF_HOME '/home/ma/deploy/servers/ecflow-metab.5062/metabuilder/...'
   with
      edit ECF_HOME '/var/tmp/ma0/workspace/metabuilder/...'
o/ Edit metabuilder.def and replace:
      edit REMOTE_HOST 'ecflow-metab' | edit WSHOST 'ecflow-metab'
   with 
      edit REMOTE_HOST 'polonius' | edit WSHOST 'polonius'


Use of perf:  http://www.brendangregg.com/perf.html
===============================================================
  cd $WK ; cp /var/tmp/ma0/DEFS/metabuilder.def .
  time Base/bin/gcc-7.3.0/release/perf_job_gen ./metabuilder.def
  
  # Show system call overhead,as a summary.
  strace -c Base/bin/gcc-7.3.0/release/perf_job_gen ./metabuilder.def
  
  # Show page faults, data and instruction cache misses use:
  perf stat -d Base/bin/gcc-7.3.0/release/perf_job_gen ./metabuilder.def
  
  # perf stat to run the same test workload multiple times and get for each count, the standard deviation from the mean.
  perf stat -r 10 -d Base/bin/gcc-7.3.0/release/perf_job_gen ./metabuilder.def
  
  # For more detail and higher level overview, compile debug, -g means record stack traces
  perf record -g Base/bin/gcc-7.3.0/debug/perf_job_gen ./metabuilder.def
  perf report --sort comm,dso  # high level overview
  
  # for a graphical display, you can use flame graphs. This *ONLY* works properly with debug builds
  # and you must use -g (collect stack traces)
  # assume you have down loaded Flamegraph, git clone https://github.com/brendangregg/FlameGraph
  # The wider the graph the more time is spent.
  perf record -g Base/bin/gcc-7.3.0/debug/perf_job_gen ./metabuilder.def
  perf script | $HOME/FlameGraph/stackcollapse-perf.pl > out.perf-folded
  cat out.perf-folded | $HOME/FlameGraph/flamegraph.pl > my_perf.svg
  display my_perf.svg
  
  
Using valgrind
==========================================================================
# To see the callgraph make sure you use valgrind --tool=callgrind
# and *NOT* valgrind --tool=cachegrind
valgrind --tool=callgrind Base/bin/gcc-7.3.0/debug/perf_job_gen ./metabuilder.def
kcachegrind callgrind.out.18473


test job creation
==========================================================================
 - Note writing to scratch can be slow, this can be overridden by user specfiying
 - thier own directory:
 
export PYTHONPATH=/var/tmp/ma0/workspace/ecflow/Pyext/ecflow
cat > tmp.py << EOF
import shutil
from ecflow import *

defs = Defs("metabuilder.def")

job_ctrl = JobCreationCtrl()
job_ctrl.set_dir_for_job_creation("/var/tmp/ma0/tmp/ecflow")  # generate jobs file under this directory
#job_ctrl.set_verbose(True)
defs.check_job_creation(job_ctrl)
print(job_ctrl.get_error_msg())

#print("removing job generation directory tree " + job_ctrl.get_dir_for_job_creation())
#shutil.rmtree(job_ctrl.get_dir_for_job_creation())     
EOF

strace -c python tmp.py


strace
===========================================================================

# strace with table of system calls and percentages ******
cd $WK ; cp /var/tmp/ma0/DEFS/metabuilder.def .
strace -c Base/bin/gcc-7.3.0/release/perf_job_gen ./metabuilder.def

