On-StackReplacement
StephenJ.Fink
IBMT.J.WatsonResearchCenterFengQian
SchoolofComputerScience,McGillUniversity
YorktownHeights,NY10598sjfink@us.ibm.com
Abstract
Modernvirtualmachinesoftenmaintainmultiplecom-piledversionsofamethod.Anon-stackreplacement(OSR)mechanismenablesavirtualmachinetotransferexecutionbetweencompiledversions,evenwhileamethodruns.Re-lyingonthismechanism,thesystemcanexploitpowerfultechniquestoreducecompiletimeandcodespace,dynami-callyde-optimizecode,andinvalidatespeculativeoptimiza-tions.
Thispaperpresentsanew,simple,mostlycompiler-independentmechanismtotransferexecutionintocompiledcode.Additionally,wepresentenhancementstoananalyticmodelforrecompilationtoexploitOSRformoreaggres-siveoptimization.WehaveimplementedthesetechniquesinJikesRVMandpresentacomprehensiveevaluation,in-cludingastudyoffullyautomatic,online,profile-drivende-ferredcompilation.
1Introduction
Modernvirtualmachines(VMs)oftenmaintainmultiplecompiledversionsofamethod.Forexample,state-of-the-artJavaVMsemployadaptivestrategiestoselectivelycom-pileandrecompilehotmethods[2,12,14,10].AVMcaneasilytransitiontoanewcompiledversionofamethodbymodifyingdispatchstructures,sothatfuturemethodinvo-cationsbranchtothenewcompiledversion.However,thetransitionforamethodthatiscurrentlyexecutingonsomethread’sstackpresentsaharderengineeringchallenge.Toperformthistransition,theSelfprogramminglan-guageimplementations[5,10]pioneeredon-stackreplace-ment(OSR)technology.OSRtechnologyenablesfunda-mentalVMenhancements,includingdebuggingoptimizedcodeviade-optimization[9],deferredcompilationtoim-
Montreal,Canada
fqian@sable.mcgill.ca
Instead,theprogramtrapsuponreachinganuncommonblock,andinvokesOSRtorecompileandtransitiontonewcode.Deferredcompilationbringstwopotentialbenefits.First,theoptimizercanbenefitfromimproveddataflowbyeliminatingcontrolflowmergesdownstreamfromuncom-monblocks.Second,thetechniquereducescompile-time.Recentresearchsuggestspotentialgainsusingprofile-directeddeferredcompilationforJava.Thepreviousstud-iesreliedonoffline[16,3]profiledataandofflinewhole-programbytecode-leveltransformations[16].Buildingonthiswork,thispaperpresentsthefirstpublishedevalu-ationofcompletelyautomatic,online,profile-drivende-ferredcompilation.Weevaluatecompile-time,codesize,andcodequalityimprovementsatmultipleoptimizationlevels,consideringbothanideallimitcaseandacompleteimplementation.Wecomparedeferredcompilationbasedon“perfect”offlineandonlineprofiledata,andadditionallyevaluatetheimpactofdeferredcompilationonadaptivere-compilationactivitydrivenbyananalyticmodel.Themaincontributionsofthispaperare
anew,simple,relativelycompiler-independentmecha-nismtotransferanactivation’sexecutiontospecializedoptimizedcode,
enhancementstoananalyticmodeltodriverecompila-tiontoaccountfordeferredcompilationandon-the-flyreoptimization,and
thefirst(toourknowledge)comprehensiveexperimen-talevaluationofautomatic,online,profile-drivende-ferredcompilation.
Theexperimentalresultsshowthatintheidealcase,profile-drivendeferredcompilationimprovescompile-timeby3-13%,dependingontheoptimizationlevel.Codesizereductionsrangefrom7-16%.ConstrainingoptimizationtoallowOSRinvalidationreducesthesebenefitsslightly.Overall,theperformanceimpactofeliminatingdataflowmergesandconstrainingoptimizationappearssmall,rang-ingintheidealcasefrom0.5%to2.2%,dependingonopti-mizationlevel.UsingOSRtoenableclass-hierarchybasedunguardedinlininghasanegligiblebenefitcomparedtothesystem’sdefaultstrategy.
Afullyonlineimplementationofdeferredcompila-tion,asexploitedbyamodel-drivenadaptiverecompila-tionsystem,improvesperformanceonthefirstrunoftheSPECjvm98benchmarksbyanaverageof2.6%,withim-provementonindividualbenchmarksrangingfrom+8%to-4%.Duringthefirstrun,theindividualbenchmarkstriggerbetween0and10OSRevents.
Theremainderofthispaperproceedsasfollows.Sec-tion2reviewssomeJikesRVMbackgroundrelevanttothisstudy.Section3presentsanewOSRtransitionmecha-nismanddiscussesimplementationdetailsinoursystem.
Section4presentsapolicytoallowtheadaptiveoptimiza-tionanalyticmodeltoexploittheOSRmechanisminitsdecision-making.Section5presentstheexperimentaleval-uation.Finally,Section6reviewsrelatedworkandSec-tion7concludes.
2Background
WenowreviewsometechnicaldetailsoftheJikesRe-searchVirtualMachine(RVM)whichpertaintothisstudy.TheRVMprovidesanexecutionengineforJavabyte-codes,althoughnotacompleteJavaplatform.TheRVMhasacompile-onlyapproach;thesystemcompilesallbyte-codetonativecodebeforeexecution.Thesystemhastwocompilers,thebaselinecompilerandtheoptimizingcom-piler.Thebaselinecompilergeneratescodequickly,imple-mentingastraightforwardinterpretationoftheJVMstackarchitecture[11].Theresultantcodequality,naturally,isrelativelypoor.Theimplementationofthebaselinecom-pilerismostlyarchitecture-dependent,andthesystemcur-rentlysupportstwoarchitectures(IA32andPowerPC).Theoptimizingcompilerprovidesafullsuiteofop-timizations.Theimplementationismostlyarchitecture-independent.Forthisstudy,theoptimizingcompilerpro-videsthreelevelsofoptimization:
Level0consistsmainlyofasetofoptimizationsper-formedon-the-flyduringthetranslationfrombyte-codestotheintermediaterepresentation.Currently,thecompilerperformsthefollowingoptimizationsdur-ingIRgeneration:constant,type,non-null,andcopypropagation;constantfoldingandarithmeticsimplifi-cation;unreachablecodeelimination;eliminationofredundantnullchecks,checkcasts,arraystorechecks,andinliningoftinymethodsatmonomorphiccallsites.Level1augmentslevel0withadditionallocalopti-mizationssuchascommonsubexpressionelimination,arrayboundscheckelimination,andredundantloadelimination.Italsoaddsinliningbasedonstatic-sizeheuristics,1globalflow-insensitivecopyandconstantpropagation,globalflow-insensitivedeadassignmentelimination,synchronizationoptimizations,andscalarreplacementofaggregatesandshortarrays.
Level2augmentslevel1withSSA-basedflow-sensitiveoptimizations.InadditiontotraditionalSSAoptimizationsonscalarvariables[6],thesystemalsousesHeapArraySSAform[8]toperformredundant
loadelimination,globalcodemotion,andcommonsubexpressionelimination.
Theadaptiveoptimizationsystemdrivesrecompilationdecisionsbasedontheestimatedcostandbenefitofcom-pilingeachmethod.Thesysteminitiallycompileseachmethodwiththebaselinecompiler,andastheprogramruns,mayrecompileeachmethodwiththeoptimizingcompiler.Atagivenpointduringprogramexecution,theadaptivesys-temcontrollerwillconsideramethodforrecompilationguidedbythefollowinganalyticmodel.
Supposemethodiscurrentlyoptimizedatoptimiza-tionlevel,,definingbaseline-compiledcodetobeoptimizationlevel-1.Definethefollowing:
,theexpectedtimetheprogramwillspendexecutingmethodinthefuture,ifisnotrecompiled.
,thecostofrecompilingmethodatoptimization
level,for.
,theexpectedtimetheprogramwillspendexecutingmethodinthefuture,ifisrecompiledatlevel.
Usingtheseestimatedvalues,thecontrollerchoosestherecompilationlevelthatminimizestheexpectedfuturerunningtimeofarecompiledversionof;i.e.,itchoosesthethatminimizesthequantity.If,
thecontrollerdecidestorecompile
atlevel;otherwiseitdecidestonotrecompile.
Thecontrollerestimatesthequantitiesandbasedonofflineprofiledata,andsimplemodelsofcompilerandprogrambehavior[2].
3OSRMechanism
TheOSRtransformationperformsthefollowingsteps:1:extractcompiler-independentstatefromasuspendedthread,
2:generatenewcodeforthesuspendedactivation,and3:transferexecutioninthesuspendedthreadtonewcom-piledcode.
Thissectionpresentsmechanismstoaccomplishthesesteps.
3.1JVMscopedescriptor
ForJava,wedefinethe’compiler-independent’stateofarunningactivationtobeascopedescriptor[9]basedonthestack-basedJavaVirtualMachinearchitecture[11]:Definition1TheJVMscopedescriptorofamethodactiva-tionconsistsofthethreadrunningtheactivation,thepro-gramcounterasabytecodeindex,valuesofeachlocalvari-ableandstacklocation,andareferencetotheactivation’sstackframe.
Figure1(a)and(b)showasimpleexampleinJavasourceandbytecode,respectively.Figure1(c)showsaJVMscopedescriptorforarunningactivationmethodofmethodsum.ThisscopedescriptorcapturesaruntimesnapshotoftheJVMstatein“sum”beforebytecodeindex16afterloop-ing50times.
3.2Specializedcodegeneration
TogeneratetargetcodeforanOSRtransition,previoussystemssuchasHotspot[12]wouldcompileoneversionofthetargetmethodwithtwoormoreentrypoints,andusethisversionbothforOSRtransitionsandfuturemethodin-vocations.Wepresentasubstantiallydifferentmechanismtogeneratetargetcode.
RecallthatJikesRVMhasmultiplecompilerimplemen-tations;so,wesetouttoimplementOSRwithminimalchangestotheunderlyingcompilers.Tothisend,ourdesigncompilesaspecializedversionofthemethodforeachacti-vationthatisreplaced,aswellasanewversionforfutureinvocations.Ourdesignisrelativelycompiler-independent,asthetransitiontooptimizedcodeisexpressedinbytecodeandhandledwithnormalcompilermechanisms.
ThekeyinsightunderlyingtheJikesRVMmechanismisthatgiventheJVMscopedescriptorofamethod,wecanconstructamethod,inbytecode,thatsetsupthenewstackframeandcontinuesexecution,preservingsemantics.Todothis,weprependtotheoriginalbytecodesaspecializedprologuethata)savesvaluesintolocals,b)loadsvaluesonthestack,andthenc)jumpstothecurrentprogramcounter.Notethatweexpressthestackframesetupinthesourcelanguage(bytecode),andrelyonthetargetcompilertoim-plementthesetupprocedureasitseesfit.
Figure1(d)showsthespecializedbytecodegeneratedfortheJVMscopedescriptorfromFigure1(c).Byconstruc-tion,executingthespecializedmethodofFigure1(d)hasthesamesemanticsascontinuingexecutionoftheinter-ruptedactivationbasedonthecodein1(b).
WhenanOSRhappensinaninlinedcontext,were-coveraJVMscopedescriptorforeachvirtualinlinedstackframe.Wegenerateaspecializedversionofeachinlinedmethod,constructedsuchthatcallingthespecializedrootmethodoftheinlinedcontextrestoresallthereplacementstackframes.Theprologueofthespecializedrootrestorestherootmethod’sstate,thenimmediatelycallsaspecializedversionofthecallee.Whenthespecializedcalleereturns,therootmethodcontinuesexecutionimmediatelyafterthecall.Figure2showsonesimpleexamplewhereOSRhap-pensinthecalleebar.Thecallerfoorestoresitsexecutionstatebeforethecallsite,callsthespecialized“bar
classC{
staticintsum(intc){inty=0;
for(inti=0;i (a)asimpleexample 0123478910111415161920iconst_0istore_1iconst_0istore_2goto14iload_1iload_2iadd istore_1iinc21iload_2iload_0 if_icmplt7iload_1ireturn (b)bytecodeforsum ldc100istore_0ldc1225istore_1ldc50istore_2ldc50ldc100goto16 runningthread:MainThreadframepointer:0xSomeAddressprogramcounter:16 localvariables:L0(c)=100;L1(y)=1225;L2(i)=50;stackexpressions:S0=50;S1=100; (c)JVMscopedescriptorforanactivationofsum 0iconst_0... 16if_icmplt7 ... 20ireturn (d)specializedversionofsumforOSRtransition Figure1.ExampleofOSRtransitionmechanism voidfoo(){ bar();A:...} voidbar(){ ...B:...} (a)Sourcecode.Assumetheoptimizingcompilerin-linesbarintofooandthenforcesanOSRatprogrampointB. foo_prime(){ bar_prime(){ (b)targetcodeforOSRtransition Figure2.ExampleofOSRtransitionwithinlining.Callingfoo IRmodelsthesemanticsofOSRPointsimilartoaCALLinstructionthatusesallthelivevariablesneededtore-coverJVMstate.LikeaCALLinstruction,anOSRPointwillthusconstrainsomeoptimizations,includingdeadcodeelimination,loadelimination,storeelimination,andcodemotion.UnlikeaCALLinstruction,anOSRpointtrans-ferscontroltotheexitblock,sothereisnomergebacktoreachablecodeafteranOSRPoint. IfanOSRPointisrequiredinaninlinedmethodcontext,theinstructionkeepsaliveallstaterequiredtorecovertheentireinlinedcontext. Afterregisterallocation,thecompilermapsthestateusedbyOSRPointinstructionstothecorrespondingphys-icalpositions(registernumberorspillingoffset),maintain-ingtypeinformation.Thesystemencodesthemapinata-ble. ThecompilerexpandsOSRPointinstructionsintocallsintotheJikesRVMthreadscheduler.WhencalledfromanOSRPoint,thethreadsystemsuspendsthecallingthread.Aftercompilingcodeforthespecializedactivation,therun-timesystememitsmemorybarrierstoforceSMPmemoryconsistency,andthenwakesupthethreadandtransfersex-ecutiontothenewcompiledcode. 4OSRPolicy 4.1DeferredCompilationPolicy TheSelf-91implementation[5]introduceddeferredcompilation(sometimesreferredtoasuncommonbranchextensionorpartialmethodcompilation)asatechniquetoreducecompiletimeandpotentiallyimprovecodequality.Withthistechnique,thecompileravoidsgeneratingcodethatitpredictswillrarelyexecute.Ifthepredictionfails,thecompilertraps,recompilesthemethod,andperformsOSRtotransfertothenewcode. Previousworkhaspresentedseveralpoliciesforpredict-inguncommonbranchesfordeferredcompilation.InSelf,thecompilercouldoftenpredicttypeswithhighconfidence,andpredictbranchesthatsignaltheappearanceofothertypestobeuncommon.Forexample,thecompilercouldpredictthatavariableusedinanarraysubscriptexpressionhadconcretetypeinteger.Thistechniquewashighlyef-fectiveforSelf,sincethis“pureobject-oriented”languageemployspolymorphictypeseverywhere. TheHotSpotservercompiler[12]similarlyusestypepredictionbasedonthecurrentclasshierarchytoinlinecur-rentlymonomorphiccallsites,andtrapsandOSRsshouldclassloadinginvalidatetheprediction.Thiscompileralsopredictsotherlanguagefeatures,suchasclassinitialization,tobeuncommonevents. Whaley[16]studiedprofile-directeddeferredcompila-tion,predictinguncommonbranchesbasedonofflinepro- filedata. WehaveimplementeddeferredcompilationcombiningsomeofthepoliciespresentedbyWhaleyandusedinHotSpot.Namely,wehaveimplementedanonlinever-sionofprofile-directeddeferredcompilation,aswellastheclass-hierarchybasedinliningtrapsusedinHotSpot. 4.2IntegrationintoAdaptiveOptimizationSys-tem Asmentionedearlier,theJikesRVMincludesasophisti-catedadaptiveoptimizationsystemthatdriveson-linecom-pilation.Wemodifiedthecontrollerevaluationalgorithmtoaccountfordeferredcompilation. AsdescribedinSection2,theadaptiveoptimizationsys-temrecompilesamethodatoptimizationlevelchosen tominimize .Deferredcompilationshouldaccel-eratecompilation,soadaptiverecompilationcanaffordtobemoreaggressive.Tothisend,wemodifiedtheanalyticmodeltoaccountforthereducedcompiletime.Whencon-sideringamethod forrecompilation,thesystemcom-putesthepercentageofthesourcecode,,thatprofiledataindicateswasdynamicallyreached.Sincenearlyalltheop-timizingcompilerphasesaredesignedtoruninlineartime, weestimatethecostofoptimizingtobe .Thus,theanalyticmodelaccountsforthebenefitofdeferredcom-pilationonaper-methodbasis. Toreduceoverhead,thesystemcomputesthefractiononlyonceforeachmethod,andcachesit.Thismightresultinthemodelusinganoutdatedformakingdecisions;wesuspectthiseffectissmallinpracticesincethemodelwillnotcomputeuntilthemethodhasrunforsometime.Inadditiontodeferredcompilation,wehaveimple-mentedpromotion,wherebythesystemmayoptimizealong-runningactivationrunningbaseline-compiledcode.Normally,theadaptiveoptimizationsystemconsidersop-timizingamethodbasedonthetotaltimespentinthatmethodinthepast.Toreconcilepromotionwiththispolicy,whenthesystemoptimizesamethod,itstartstomonitortimespentinoutdatedversionsofthecode(i.e.,baseline-compiledversions).Toconsiderpromotinganoutdatedac-tivation,thesystemusesthedefaultanalyticmodel,butes-timatesthefuturetime basedonthetimespentintheoutdatedcode,ratherthanthetotaltimespentinallcom-piledversionsofthemethod. 5ExperimentalResults WehaveimplementedthesetechniquesinJikesRVMv2.1.1.WereportexperimentalresultsontheSPECjvm98benchmarks[15].TheresultsreporteddonotconformtotheofficialSPECrunrules,soourresultsdonotdirectlyorindirectlyrepresentaSPECmetric.Allrunsusethesize 100inputs,andruneachdistinctbenchmarkinanewVMinstance. AllexperimentsranonadedicatedIBMRS/6000ModelF80withsix500MHzprocessorsand4GBofmainmem-ory,runningAIX4.3.3.JikesRVMwasrunwithaFas-tAdaptiveSemispaceconfiguration,usingonevirtualpro-cessor,400MBofsmallobjectheap,and100MBoflargeobjectheap. 5.1Offlinedata First,weexamineexperimentalresultsbasedonoffline“perfectprofiles”.Thatis,wegathertheprofiledataof-fline,andthenruntheprogramusingthesameinput.Theseexperimentsreveal,tothefirstorder,anupperboundongainspossiblewithon-linedeferredcompilation.Fortheseruns,weuseaconfigurationofJikesRVMthatloadsandprecompilesallclassesbeforethefirstrunoftheprogram;thus,compilationactivitydoesnotfactorintothereportedperformanceresults. Weconsiderdeferredcompilationusingoneorbothofthefollowingtwopolicies: EdgecountersWedefercompilationofanybasicblocks thattheprofiledataindicateswereneverexecuteddur-ingtheprofilingrun.StaticheuristicsWedefercompilationofthefailedbranch ofanyinlineguardatacallsitethatismonomorphicintheloadedclasshierarchy.Notethatusingtheperfectprofiles,deferringcompila-tionissafeinthattheuncommoncodeneverexecutes.Weconsiderthefollowingfourvariantsofthesystem:IdealThisconfigurationdeferscompilationbasedonboth edgecountersandstaticheuristics.Inplaceofuncom-moncode,thecompilerinsertsanunconditionaltrapinstructionthatdoesnotkeepanyprogramstatelive.Ideal-OSRThisconfigurationdeferscompilationbasedon bothedgecountersandstaticheuristics.Inplaceofuncommoncode,thecompilerinsertsanOSRPointthatkeepsstatelivetoallowrecoveryofaJVMscopedescriptor.Static-OSRThisconfigurationdeferscompilationbased solelyonstaticheuristics.Inplaceofuncommoncode,thecompilerinsertsanOSRPoint.EagerThisconfiguration(thedefaultJikesRVM)doesnot performdeferredcompilation.Intheseofflineexperiments,thedifferencebetweenIdealandIdeal-OSRisthatOSRpointsconstrainoptimiza-tion.Thus,thedifferencebetweentheseconfigurations quantifiesthecostoftheseconstraints,comparedtoaper-fectoraclethatcouldpredictthefutureandthrowawayun-reachedcodeunconditionally. ThedifferencebetweenStatic-OSRandIdeal-OSRquantifiesthebenefitsofperformingdeferredcompilationbasedonprofiledata,asopposedtosimplyusingOSRforclass-hierarchy-basedinlining. Clearly,theimpactofdeferredcompilationoncompilerspeedorcodequalitydependshighlyontheunderlyingcompilerinfrastructure.Toexplorethisspace,wereportstatisticsforeachoftheoptimizingcompiler’sthreeopti-mizationlevels. Figure3showsthecompilerspeed,inbytecodebytespermillisecond,ofeachsystemconfiguration.Theresultsshowanaverageupperbound(Ideal)improvementincompile-time,comparedtoeagercodegeneration,of3%,13%,and11%atoptimizationlevelsO0,O1,andO2,respectively.Whenconstrainingoptimizationtoaccountforpotentialin-validation(Ideal-OSR),theimprovementsdropto3%,7%,and10%,respectively.So,constrainingoptimizationhasasignificanteffectoncompile-timeimprovements,espe-ciallyathigheroptimizationlevels.Acrosstheboard,main-tainingOSRpointsonlyatstatically-predictedinlineguardfailuresactuallyslightlydecreasescompilerspeed,ascom-paredtosimplygeneratingtheeliminatedcallinstruction.WeconcludethatmaintaininganOSRpointintheIRen-tailsslightlymoreworkthanmaintainingacallinstruction.Figure4showsthesizeoftheresultantmachinecode,inmachinecodebytesperbytecodebyte.Theresultsshowanaverageupperbound(Ideal)improvementingeneratedcodesize,comparedtoeagercodegeneration,of7%,16%,and13%atoptimizationlevelsO0,O1,andO2,respectively.WhenconstrainingoptimizationwithOSRPoints,(Ideal-OSR),theimprovementsdropto7%,13%,and9%,respec-tively.MaintainingOSRpointsonlyforinlineguardshasanegligibleeffectoncodesize;asexpected,sincethede-ferredcompilationelidesonlyasinglecallinstructionanditsassociatedsetup. Figure5showsthespeed,relativetotheeagerconfig-uration,oftheresultantcodeforeachconfiguration.ThedifferencebetweentheIdealandEager(baseline)config-urationshowsthemaximumspeedupthecompilercouldhopetoobtainbythistechnique.Theresultsshowthatonaverage,idealdeferredcompilationcouldimprovegen-eratedcodequalityby0.5%,1%,and2.2%atoptimiza-tionlevelsO0,O1,andO2,respectively.Naturally,im-proveddataflowhelpsmoreathigheroptimizationlevels.Thejessandmtrtbenchmarksseethelargestpotentialgain,eachroughly5%.Nevertheless,overallthesenum-bersshowonlyamodestpotentialforgainduetoimproveddataflow. TheIdeal-OSRconfiguration,whichkeepsOSRdatalive,onaveragedoesnotchangeperformance.Thejess 10IdealIdeal-OSRStatic-OSR5Eager0compressjessdbjavacmpegaudiomtrtjackG. Meana)64IdealIdeal-OSRStatic-OSREager20compressjessdbjavacmpegaudiomtrtjackG. Mean4b)3Ideal2Ideal-OSRStatic-OSREager10compressjessdbjavacmpegaudiomtrtjackG. Meanc) Figure3.Compilerspeedusingoffline“per-fect”profiledataatoptimizationlevela)O0,b)O1,andc)O2. 1086IdealIdeal-OSRStatic-OSR4Eager20compressjessdbjavacmpegaudiomtrtjackG. Meana)10IdealIdeal-OSRStatic-OSR5Eager0compressjessdbjavacmpegaudiomtrtjackG. Meanb)10IdealIdeal-OSRStatic-OSR5Eager0compressjessdbjavacmpegaudiomtrtjackG. Meanc) Figure4.Finaloptimizedmachinecodesizeusingoffline“perfect”profiledataatopti-mizationlevela)O0,b)O1,andc)O2. Machine Code Size (mcb/bcb)Compilation Rate (bcb/ms)Machine Code Size (mcb/bcb)Compilation Rate (bcb/ms)Machine Code Size (mcb/bcb)Compilation Rate (bcb/ms)1.10noitarugifn1.05oc regIdealaEIdeal-OSR otStatic-OSR evita1.00lercompressjessdbjavacmpegaudiomtrtjackG. Mean deepS0.951.10a)noitarugifn1.05oc regIdealaEIdeal-OSR otStatic-OSR evita1.00lercompressjessdbjavacmpegaudiomtrtjackG. Mean deepS0.951.10b)noitarugifn1.05oc regIdealaEIdeal-OSR otStatic-OSR evita1.00lercompressjessdbjavacmpegaudiomtrtjackG. Mean deepS0.95c) Figure5.Speedupobtained(biggerisbetter)comparedtothedefault(eager)codegener-ationstrategy,whenusingdeferredcompila-tionstrategiesatoptimizationlevela)O0,b)O1,andc)O2.Theseresultsuseoffline“per-fect”profiledata. benchmarkseesthelargestimprovement,of4%.The5%idealmtrtgainvanishes,insteadproducinga1.5%degra-dation.jackseesasubstantialdegradation,of4%.Over-all,thenumbersindicatethatthecostofconstrainingopti-mizationwithOSRPointsroughlynegatesthesmallbene-fitduetoimproveddataflow,withvariancesonindividualbenchmarks. ThecompilershouldbeabletoapproachtheIdealper-formancebysinkingcodeintouncommonsectionstoma-terializestatethatisotherwisedead.Whaley[16]im-plementedaspecializedsinkingtransformationtoaccom-plishthis.WearecurrentlyworkingonimprovementstotheRVMoptimizer’scodemotion,tomoreeffectivelysinkcomputationintouncommonbranchextensions. PlacingOSRpointsusingthestaticstrategyhasanegli-gibleimpactonperformance.ThisindicatesthattheJikesRVMpatch-pointbasedguardedvirtualinlining[13]andstaticsplittingheuristics[4]handleclass-hierarchybasedinliningeffectively. 5.2Onlinedata Now,weexaminetheperformanceofdeferredcompila-tioninafullyautomaticon-lineadaptiveoptimizationsys-tem.Fortheseexperiments,thebaselinecompilerinsertsin-strumentationintoeachmethodtogatheredgecounterdata.Thatis,thesysteminsertscountersthatcountthenumberoftimeseachbranchistakenandnottaken.Whenoptimizingamethod,theoptimizingcompilerpropagatestheinforma-tionandinsertsanOSRPointatthebeginningofeachbasicblockthatthecountersindicatewasneverreachedwhiletheunoptimizedcoderan. Naturally,aftercodeisoptimized,itmaytraverseanewpathandexecuteablockthatwasnotpreviouslyexecuted.Inthiscase,thesystemperformsOSRtogenerateuncom-moncodeasneeded.WhenperformingOSR,thesystemimmediatelyinvalidatestheoptimizedcodewhichtriggeredthedeferredcompilation.Whentheprogramnextenterstheinvalidatedmethod,thesystemre-optimizesthemethodwithoutusingdeferredcompilation. Wereportperformancebasedonwall-clocktime,whichincludesallonlineprofiling,decisionmaking,recompila-tion,invalidation,andOSRtransitionactivity. Figure6ashowstheperformanceduringthefirstrunofeachoftheSPECjvm98benchmarks.Theresultsshowthatonaverage,theOSRtechniquesimproveperformanceofthefirstrunofaSPECjvm98benchmarkby2.6%.UsingonlystaticplacementofOSRpointsforinliningimprovesperformanceby1.8%.Thecompressbenchmarkshowsasignificantimprovement(around8%);onlythisbenchmarkrunslong-runningloopswhichtriggerOSRpromotionfrombaselinetooptimizedcode.Thedbandjavacbench-marksshowimprovementsof8%and4%respectively,pre- 1.15noitarug1.10ifnoc regaE o1.05OSR-edge countstOSR-static evitaler pu1.00decompressjessdbjavacmpegaudiomtrtjackG. MeanepS0.951.15a)noitarug1.10ifnoc regaE o1.05OSR-edge countstOSR-static evitaler pu1.00decompressjessdbjavacmpegaudiomtrtjackG. MeanepS0.95b) Figure6.Speedupobtained(biggerisbetter)comparedtothedefault(eager)codegenera-tionstrategy,whenusingmulti-leveladaptiverecompilation:a)firstrun.b)bestrunof10. sumablyduetomoreaggressivecompilation.Themtrtperformancedegradesby2-4%.Thisbenchmarkcontainsafewhotmethodsthatdominateperformanceafterashortstart-upphase.ItappearsthatwithOSR,theadaptivesys-temismoreaggressivecompilingmethodsduringthestart-upphase.Thisdelaysthetimetoreachthecomputationallyintensivemethods,anddelaysoptimizationofthesemeth-ods,hurtingperformance.Itremainsanopenissueonhowtodesignanadaptiverecompilationstrategythatavoidsthispathologicalbehavior. Figure6bshowstheperformanceonthebestof10runsofeachbenchmark,usingtheOSRtechniques.Theoverallperformance,comparedtoeagercompilation,isvirtuallyunchanged,althoughindividualbenchmarksvaryslightly.Thisisconsistentwiththedatafromtheofflinestudy. Table1showsthecompilationactivityduringthefirstrunofeachbenchmark,withandwithoutOSRtechniques.Thefirsttwocolumnsreportsthenumberofon-stackre-placementeventsinthefirstrunofeachbenchmark.Onlycompresstriggerspromotionevents.SixofthesevenbenchmarkstriggerOSRinvalidationsduetodeferredcom-pilation.Alltheseinvalidationsresultfromincompletecodecoverageprofiledata;nonerepresentfailedinliningsinval-idatedbydynamicclassloading. Thus,inallcases,bythetimetheadaptivesystemopti-mizedamethod,ithadloadedallclassesthataffectinlineguards.However,inafewcases,theon-lineprofiledata didnotcoveralldynamicallyreachedinstructions,causinglaterinvalidations.Weverifiedthateachinvalidationtakesatmostontheorderof10microseconds;thuscausingasmalldirectperformancehit. Theremainingcolumnsofthetableshowthenumberofmethodstheadaptivecontrollerchosetocompileateachoptimizationlevel.RecallthatusingOSR,thecontrollermodelshouldcompilemoreaggressivelysinceitestimatesareductionincompile-time.Ingeneral,thedatabearsthisout,asoverallthesystemcompilesmoremethodsusingtheOSRheuristics.Thechangeincompileractivityisnotdra-matic(4%moremethodscompiled),corroboratingearlierexperiencesthattheanalyticrecompilationmodelisrela-tivelyinsensitivetovariancesincompilerperformance[1].Thetableshowsaninterestingcounter-intuitivetrendinthatwithOSR,fewermethodsrisetothehighestoptimiza-tionlevel.Wesuspectthatthecontrollercost-benefitmodelassignshigherprioritytothefirst(low-level)optimizationofamethod,sincethisoptimizationgivesthegreatestab-solutegain.ItseemsthatthisphenomenonisexaggeratedbythemodifiedOSRcontrollermodel,causingmore“low-level”optimizationandindirectlydelaying“high-level”op-timization. Wenotethatonshorttime-scales,thebehavioroftheadaptivesystemdependsonacomplexandnon-linearfeed-backloop,varyingsubstantiallybasedoncompetingesti-matesofcostsandbenefitsandsampledata.Althoughthebehaviorappearstoberelativelystableonaggregate,itre-mainsanopenproblemtobetterunderstandhowchangestocompilerperformanceandcodequalitytranslateintoover-allsystemperformance. 6RelatedWork On-stackreplacementtechniquesoriginatedinSelf-91’sdeferredcompilation[5].TheSelf-91compilerdefersgen-eratingcodeforthetargetsofuncommonbranches,predict-ingthesebranchesbasedontypesappearinginnormalcode.Inthecasethatanuncommonbranchtargetexecutes,thesystemgeneratescodeondemandusinguncommonbranchextension.TheoriginalmotivationfordeferredcompilationinSelf-91wastoreducecompiletimeandavoidconser-vativedataflowmerges.Later,Self-93[10]usedOSRtoimproveperformancewithadaptiverecompilation,transfer-ringexecutionfromslowcodetofastercodeon-the-fly.Selfalsopioneeredtheconceptofusingde-optimizationtodebugoptimizedcode[9],whichremainsoneofthemostcompellingapplicationsofOSR.Wehavenotyetimple-mentedthisapplication,astheJikesRVMdebuggerframe-workisstillimmatureandunstable. OfcurrentproductionJavaVirtualMachines,onlytheHotSpotservercompiler[12]hasreportedOSRtechnol- Benchmarkcompressjessdbjavac mpegaudiomtrtjackTotal WithOSR InvalidationsO0 617049181017116855715924429 O2 212273825 WithoutOSRO1917416291126112 total2860171871017585553 Table1.CompilationactivitybytheJikesRVMAdaptiveOptimizationSystemduringthefirstrunofeachSPECjvm98benchmark,measuredinmethodscompiled.Promotionsrepresenttransitionsfrombaselinetooptimizedcodeforalong-runningactivation.InvalidationsrepresentOSRtransitionsduetoincorrectlypredicteddeferredcompilation.Theremainingcolumnsreportthenumberofmethodsoptimizedateachoptimizationlevel. ogy.2HotSpottransitionsfrominterpretedcodetoopti-mizedcodeforlong-runningloops,basedoninvocationandloopcounters.Additionally,theHotSpotcompilerusesde-ferredcompilationtoavoidgeneratingcodeforuncommonbranches,suchasthefailedbranchofclass-hierarchy-basedguardedinlining,andprogrampointsthatinvokeclassini-tializers.Toourknowledge,HotSpotdoesnotperformprofile-drivendeferredcompilationexcepttoinlinecallsitesthatprofiledataindicatesaredynamicallymonomor-phic. Whaley’s[16]partialmethodcompilationessentiallyap-pliesSelf-91’suncommonbranchextensionto“rare”blocksasdeterminedbyheuristicsonprofiledata.Thepresentedresultsfocusmainlyoncompile-time;thereweresmallim-provementsinrunningtime,attributedtobetterdataflow.Themethodologyofourdeferredcompilationstudydif-fersfromWhaley’sinmanydimensions.Inparticular,fortheexperimentalresults,Whaleyimplemented“methodoutlining”atthebytecodelevel,ratherthanengineeringOSRintotheoptimizingJIT.Thishasimplicationsonbothcorrectness(aswediscussedforpseudo-bytecodes),andonthequalityofthegeneratedcode.Furthermore,sinceWha-leyusedtheIBMDKasa“blackbox”torunthetrans-formedcode,itisnotclearhowtheunderlyingJITin-linesandoptimizesthegeneratedsourcecode.Incontrast,ourexperimentalresultsbreakdowntheimpactofOSRpointsoncodequalityduetodataflowimprovementsandoptimizationconstraints.Finally,asdescribedinthelit-erature[2,14],thetwosystemsusedrasticallydifferentadaptiverecompilationstrategies,whichmakeexperimen- existencebasedoptimizationsasOSRrestrictedtoOSRatmethodentry,whenreplacingastackframeistrivial. 7Conclusion Wehavepresentedarelativelysimpleon-stack-replacementmechanism,andpresentedacomprehensiveevaluationofon-lineprofile-directeddeferredcompilationandpromotionofunoptimizedcode.Theresultsshowthatinourimplementation,deferredcompilationprovidesmod-estgainsincodequality,compilerspeed,andcodespace.WeshowedhowtointegratetheOSR-basedtechniquesintoamodel-drivenadaptiveoptimizationsystem,andshowedthatthesystemrespondsbymoreaggressivecompilation,providingasmallperformanceimprovement. Shouldaproductionvirtualmachineimplementon-stackreplacement?Inourexperience,thispaper’sOSRmecha-nismenablesanimplementationwithminimaldisruptiontotherestofthecodebase.Theabilitytodebugopti-mizedcodeviade-optimizationmayjustifytheinvestment.Regardingperformanceimprovements,ourresultssofarshowamodestbutnotcompellingimprovementfromon-lineprofile-directeddeferredcompilation.Ourresultsindi-catethatclass-hierarchy-basedinliningcanbehandledef-fectivelywithacombinationofsimplertechniques;namely,pre-existence,codepatching,andstaticsplitting.Onthepositiveside,OSRprotectsagainstpathologicallybadper-formanceduetoasinglelong-runningloop;thispathologyoftenappearsinmicrobenchmarks. Webelievethatmoredramaticperformanceimprove-mentswillcomewithmoreadvancedprogramtransforma-tions.OSRprovidesaremarkablypowerfulinvalidationmechanism,whichtheoptimizercanrelyontoimplementmoreinvasivetransformationssuchasobjectinlining,ortodrawoptimisticconclusionswhereconservativeanalysisfails.WeplantomaketheOSRimplementationpubliclyavailableinaJikesRVMrelease,andhopethatitsavail-abilitywillspurfutureresearchinthisdirection. 8Acknowledgements TheauthorswouldliketothankDaveGrovefornumer-ousproductivediscussions,andtheanonymousreviewersforvaluablefeedback. References [1]M.Arnold,S.Fink,D.Grove,M.Hind,andP.Sweeney. AdaptiveoptimizationintheJalape˜noJVM:Thecontroller’sanalyticalmodel.In3rdACMWorkshoponFeedback-DirectedandDynamicOptimization(FDDO-3),Dec.2000. [2]M.Arnold,D.Grove,S.Fink,M.Hind,andP.Sweeney. AdaptiveoptimizationintheJalape˜noJVM.InProceed-ingsoftheACMSIGPLANConferenceonObject-OrientedProgrammingSystems,Languages,andApplications(OOP-SLA2000),Minneapolis,MN,Oct.2000.AlsopublishedasACMSIGPLANNotices,volume35,number10. [3]D.BrueningandE.Duesterwald.ExploringOptimalCom-pilationUnitShapesforanEmbeddedJust-In-TimeCom-piler.InProceedingsofthe2000ACMWorkshoponFeedback-DirectedandDynamicOptimizationFDDO-3,Dec2000. [4]C.Chambers.TheDesignandImplementationoftheSelf Compiler,anOptimizingCompilerforObject-OrientedPro-grammingLanguages.PhDthesis,StanfordUniversity,Mar.1992.PublishedastechnicalreportSTAN-CS-92-1420.[5]C.ChambersandD.Ungar.Makingpureobject-oriented languagespractical.InACMConferenceonObject-OrientedProgrammingSystems,Languages,andApplications,pages1–15,Nov.1991. [6]R.Cytron,J.Ferrante,B.K.Rosen,M.N.Wegman,and F.K.Zadeck.Anefficientmethodforcomputingstaticsingleassignmentformandthecontroldependencegraph.ACMTransactionsonProgrammingLanguagesandSys-tems,13(4):451–490,1991. [7]D.DetlefsandO.Agesen.Inliningofvirtualmethods. In13thEuropeanConferenceonObject-OrientedProgram-ming,June1999. [8]S.Fink,K.Knobe,andV.Sarkar.Unifiedanalysisofarray andobjectreferencesinstronglytypedlanguages.InSev-enthInternationalStaticAnalysisSymposium(2000),June2000.[9]U.H¨olzle,C.Chambers,andD.Ungar.Debuggingopti-mizedcodewithdynamicdeoptimization.ACMSIGPLAN Notices,27(7):32–43,July1992. [10]U.HolzleandD.Ungar.Athirdgenerationselfimplementa-tion:Reconcilingresponsivenesswithperformance.InACMConferenceonObject-OrientedProgrammingSystems,Lan-guages,andApplications,pages229–243,1994. [11]T.LindholmandF.Yellin.TheJavaVirtualMachineSpec-ificationSecondEdition.TheJavaSeries.Addison-Wesley,1999. [12]M.Paleczny,C.Vick,andC.Click.TheJavaHotspot(tm) ServerCompiler.InUSENIXJavaVirtualMachineRe-searchandTechnologySymposium,pages1–12,Apr2001.[13]T.Suganama,T.Ogasawara,M.Takeuchi,T.Yasue, M.Kawahito,K.Ishizaki,H.Komatsu,andT.Nakatani.OverviewoftheIBMJavaJust-in-Timecompiler.IBMSys-temsJournal,39(1),2000. [14]T.Suganuma,T.Yasue,M.Kawahito,H.Komatsu,and T.Nakatani.AdynamicoptimizationframeworkforaJavajust-in-timecompiler.InACMConferenceonObject-OrientedProgrammingSystems,Languages,andApplica-tions,pages180–195,Oct.2001. [15]TheStandardPerformanceEvaluationCorporation.SPEC JVM98Benchmarks.http://www.spec.org/osg/jvm98/,1998. [16]J.Whaley.Partialmethodcompilationusingdynamicprofile information.InACMConferenceonObject-OrientedPro-grammingSystems,Languages,andApplications,Oct2001. 因篇幅问题不能全部显示,请点此查看更多更全内容