您的当前位置：首页 Design, implementation and evaluation of adaptive recompilation with on-stack replacement

Design, implementation and evaluation of adaptive recompilation with on-stack replacement

来源：爱问旅游网

Design,ImplementationandEvaluationofAdaptiveRecompilationwith

On-StackReplacement

StephenJ.Fink

IBMT.J.WatsonResearchCenterFengQian

SchoolofComputerScience,McGillUniversity

YorktownHeights,NY10598sjfink@us.ibm.com

Abstract

Modernvirtualmachinesoftenmaintainmultiplecom-piledversionsofamethod.Anon-stackreplacement(OSR)mechanismenablesavirtualmachinetotransferexecutionbetweencompiledversions,evenwhileamethodruns.Re-lyingonthismechanism,thesystemcanexploitpowerfultechniquestoreducecompiletimeandcodespace,dynami-callyde-optimizecode,andinvalidatespeculativeoptimiza-tions.

Thispaperpresentsanew,simple,mostlycompiler-independentmechanismtotransferexecutionintocompiledcode.Additionally,wepresentenhancementstoananalyticmodelforrecompilationtoexploitOSRformoreaggres-siveoptimization.WehaveimplementedthesetechniquesinJikesRVMandpresentacomprehensiveevaluation,in-cludingastudyoffullyautomatic,online,proﬁle-drivende-ferredcompilation.

1Introduction

Modernvirtualmachines(VMs)oftenmaintainmultiplecompiledversionsofamethod.Forexample,state-of-the-artJavaVMsemployadaptivestrategiestoselectivelycom-pileandrecompilehotmethods[2,12,14,10].AVMcaneasilytransitiontoanewcompiledversionofamethodbymodifyingdispatchstructures,sothatfuturemethodinvo-cationsbranchtothenewcompiledversion.However,thetransitionforamethodthatiscurrentlyexecutingonsomethread’sstackpresentsaharderengineeringchallenge.Toperformthistransition,theSelfprogramminglan-guageimplementations[5,10]pioneeredon-stackreplace-ment(OSR)technology.OSRtechnologyenablesfunda-mentalVMenhancements,includingdebuggingoptimizedcodeviade-optimization[9],deferredcompilationtoim-

Montreal,Canada

fqian@sable.mcgill.ca

Instead,theprogramtrapsuponreachinganuncommonblock,andinvokesOSRtorecompileandtransitiontonewcode.Deferredcompilationbringstwopotentialbeneﬁts.First,theoptimizercanbeneﬁtfromimproveddataﬂowbyeliminatingcontrolﬂowmergesdownstreamfromuncom-monblocks.Second,thetechniquereducescompile-time.Recentresearchsuggestspotentialgainsusingproﬁle-directeddeferredcompilationforJava.Thepreviousstud-iesreliedonofﬂine[16,3]proﬁledataandofﬂinewhole-programbytecode-leveltransformations[16].Buildingonthiswork,thispaperpresentstheﬁrstpublishedevalu-ationofcompletelyautomatic,online,proﬁle-drivende-ferredcompilation.Weevaluatecompile-time,codesize,andcodequalityimprovementsatmultipleoptimizationlevels,consideringbothanideallimitcaseandacompleteimplementation.Wecomparedeferredcompilationbasedon“perfect”ofﬂineandonlineproﬁledata,andadditionallyevaluatetheimpactofdeferredcompilationonadaptivere-compilationactivitydrivenbyananalyticmodel.Themaincontributionsofthispaperare

anew,simple,relativelycompiler-independentmecha-nismtotransferanactivation’sexecutiontospecializedoptimizedcode,

enhancementstoananalyticmodeltodriverecompila-tiontoaccountfordeferredcompilationandon-the-ﬂyreoptimization,and

theﬁrst(toourknowledge)comprehensiveexperimen-talevaluationofautomatic,online,proﬁle-drivende-ferredcompilation.

Theexperimentalresultsshowthatintheidealcase,proﬁle-drivendeferredcompilationimprovescompile-timeby3-13%,dependingontheoptimizationlevel.Codesizereductionsrangefrom7-16%.ConstrainingoptimizationtoallowOSRinvalidationreducesthesebeneﬁtsslightly.Overall,theperformanceimpactofeliminatingdataﬂowmergesandconstrainingoptimizationappearssmall,rang-ingintheidealcasefrom0.5%to2.2%,dependingonopti-mizationlevel.UsingOSRtoenableclass-hierarchybasedunguardedinlininghasanegligiblebeneﬁtcomparedtothesystem’sdefaultstrategy.

Afullyonlineimplementationofdeferredcompila-tion,asexploitedbyamodel-drivenadaptiverecompila-tionsystem,improvesperformanceontheﬁrstrunoftheSPECjvm98benchmarksbyanaverageof2.6%,withim-provementonindividualbenchmarksrangingfrom+8%to-4%.Duringtheﬁrstrun,theindividualbenchmarkstriggerbetween0and10OSRevents.

Theremainderofthispaperproceedsasfollows.Sec-tion2reviewssomeJikesRVMbackgroundrelevanttothisstudy.Section3presentsanewOSRtransitionmecha-nismanddiscussesimplementationdetailsinoursystem.

Section4presentsapolicytoallowtheadaptiveoptimiza-tionanalyticmodeltoexploittheOSRmechanisminitsdecision-making.Section5presentstheexperimentaleval-uation.Finally,Section6reviewsrelatedworkandSec-tion7concludes.

2Background

WenowreviewsometechnicaldetailsoftheJikesRe-searchVirtualMachine(RVM)whichpertaintothisstudy.TheRVMprovidesanexecutionengineforJavabyte-codes,althoughnotacompleteJavaplatform.TheRVMhasacompile-onlyapproach;thesystemcompilesallbyte-codetonativecodebeforeexecution.Thesystemhastwocompilers,thebaselinecompilerandtheoptimizingcom-piler.Thebaselinecompilergeneratescodequickly,imple-mentingastraightforwardinterpretationoftheJVMstackarchitecture[11].Theresultantcodequality,naturally,isrelativelypoor.Theimplementationofthebaselinecom-pilerismostlyarchitecture-dependent,andthesystemcur-rentlysupportstwoarchitectures(IA32andPowerPC).Theoptimizingcompilerprovidesafullsuiteofop-timizations.Theimplementationismostlyarchitecture-independent.Forthisstudy,theoptimizingcompilerpro-videsthreelevelsofoptimization:

Level0consistsmainlyofasetofoptimizationsper-formedon-the-ﬂyduringthetranslationfrombyte-codestotheintermediaterepresentation.Currently,thecompilerperformsthefollowingoptimizationsdur-ingIRgeneration:constant,type,non-null,andcopypropagation;constantfoldingandarithmeticsimpliﬁ-cation;unreachablecodeelimination;eliminationofredundantnullchecks,checkcasts,arraystorechecks,andinliningoftinymethodsatmonomorphiccallsites.Level1augmentslevel0withadditionallocalopti-mizationssuchascommonsubexpressionelimination,arrayboundscheckelimination,andredundantloadelimination.Italsoaddsinliningbasedonstatic-sizeheuristics,1globalﬂow-insensitivecopyandconstantpropagation,globalﬂow-insensitivedeadassignmentelimination,synchronizationoptimizations,andscalarreplacementofaggregatesandshortarrays.

Level2augmentslevel1withSSA-basedﬂow-sensitiveoptimizations.InadditiontotraditionalSSAoptimizationsonscalarvariables[6],thesystemalsousesHeapArraySSAform[8]toperformredundant

loadelimination,globalcodemotion,andcommonsubexpressionelimination.

Theadaptiveoptimizationsystemdrivesrecompilationdecisionsbasedontheestimatedcostandbeneﬁtofcom-pilingeachmethod.Thesysteminitiallycompileseachmethodwiththebaselinecompiler,andastheprogramruns,mayrecompileeachmethodwiththeoptimizingcompiler.Atagivenpointduringprogramexecution,theadaptivesys-temcontrollerwillconsideramethodforrecompilationguidedbythefollowinganalyticmodel.

Supposemethodiscurrentlyoptimizedatoptimiza-tionlevel,,deﬁningbaseline-compiledcodetobeoptimizationlevel-1.Deﬁnethefollowing:

,theexpectedtimetheprogramwillspendexecutingmethodinthefuture,ifisnotrecompiled.

,thecostofrecompilingmethodatoptimization

level,for.

,theexpectedtimetheprogramwillspendexecutingmethodinthefuture,ifisrecompiledatlevel.

Usingtheseestimatedvalues,thecontrollerchoosestherecompilationlevelthatminimizestheexpectedfuturerunningtimeofarecompiledversionof;i.e.,itchoosesthethatminimizesthequantity.If,

thecontrollerdecidestorecompile

atlevel;otherwiseitdecidestonotrecompile.

Thecontrollerestimatesthequantitiesandbasedonofﬂineproﬁledata,andsimplemodelsofcompilerandprogrambehavior[2].

3OSRMechanism

TheOSRtransformationperformsthefollowingsteps:1:extractcompiler-independentstatefromasuspendedthread,

2:generatenewcodeforthesuspendedactivation,and3:transferexecutioninthesuspendedthreadtonewcom-piledcode.

Thissectionpresentsmechanismstoaccomplishthesesteps.

3.1JVMscopedescriptor

ForJava,wedeﬁnethe’compiler-independent’stateofarunningactivationtobeascopedescriptor[9]basedonthestack-basedJavaVirtualMachinearchitecture[11]:Deﬁnition1TheJVMscopedescriptorofamethodactiva-tionconsistsofthethreadrunningtheactivation,thepro-gramcounterasabytecodeindex,valuesofeachlocalvari-ableandstacklocation,andareferencetotheactivation’sstackframe.

Figure1(a)and(b)showasimpleexampleinJavasourceandbytecode,respectively.Figure1(c)showsaJVMscopedescriptorforarunningactivationmethodofmethodsum.ThisscopedescriptorcapturesaruntimesnapshotoftheJVMstatein“sum”beforebytecodeindex16afterloop-ing50times.

3.2Specializedcodegeneration

TogeneratetargetcodeforanOSRtransition,previoussystemssuchasHotspot[12]wouldcompileoneversionofthetargetmethodwithtwoormoreentrypoints,andusethisversionbothforOSRtransitionsandfuturemethodin-vocations.Wepresentasubstantiallydifferentmechanismtogeneratetargetcode.

RecallthatJikesRVMhasmultiplecompilerimplemen-tations;so,wesetouttoimplementOSRwithminimalchangestotheunderlyingcompilers.Tothisend,ourdesigncompilesaspecializedversionofthemethodforeachacti-vationthatisreplaced,aswellasanewversionforfutureinvocations.Ourdesignisrelativelycompiler-independent,asthetransitiontooptimizedcodeisexpressedinbytecodeandhandledwithnormalcompilermechanisms.

ThekeyinsightunderlyingtheJikesRVMmechanismisthatgiventheJVMscopedescriptorofamethod,wecanconstructamethod,inbytecode,thatsetsupthenewstackframeandcontinuesexecution,preservingsemantics.Todothis,weprependtotheoriginalbytecodesaspecializedprologuethata)savesvaluesintolocals,b)loadsvaluesonthestack,andthenc)jumpstothecurrentprogramcounter.Notethatweexpressthestackframesetupinthesourcelanguage(bytecode),andrelyonthetargetcompilertoim-plementthesetupprocedureasitseesﬁt.

Figure1(d)showsthespecializedbytecodegeneratedfortheJVMscopedescriptorfromFigure1(c).Byconstruc-tion,executingthespecializedmethodofFigure1(d)hasthesamesemanticsascontinuingexecutionoftheinter-ruptedactivationbasedonthecodein1(b).

WhenanOSRhappensinaninlinedcontext,were-coveraJVMscopedescriptorforeachvirtualinlinedstackframe.Wegenerateaspecializedversionofeachinlinedmethod,constructedsuchthatcallingthespecializedrootmethodoftheinlinedcontextrestoresallthereplacementstackframes.Theprologueofthespecializedrootrestorestherootmethod’sstate,thenimmediatelycallsaspecializedversionofthecallee.Whenthespecializedcalleereturns,therootmethodcontinuesexecutionimmediatelyafterthecall.Figure2showsonesimpleexamplewhereOSRhap-pensinthecalleebar.Thecallerfoorestoresitsexecutionstatebeforethecallsite,callsthespecialized“bar

classC{

staticintsum(intc){inty=0;

for(inti=0;ireturny;}}

(a)asimpleexample

0123478910111415161920iconst_0istore_1iconst_0istore_2goto14iload_1iload_2iadd

istore_1iinc21iload_2iload_0

if_icmplt7iload_1ireturn

(b)bytecodeforsum

ldc100istore_0ldc1225istore_1ldc50istore_2ldc50ldc100goto16

runningthread:MainThreadframepointer:0xSomeAddressprogramcounter:16

localvariables:L0(c)=100;L1(y)=1225;L2(i)=50;stackexpressions:S0=50;S1=100;

(c)JVMscopedescriptorforanactivationofsum

0iconst_0...

16if_icmplt7

...

20ireturn

(d)specializedversionofsumforOSRtransition

Figure1.ExampleofOSRtransitionmechanism

voidfoo(){

bar();A:...}

voidbar(){

...B:...}

(a)Sourcecode.Assumetheoptimizingcompilerin-linesbarintofooandthenforcesanOSRatprogrampointB.

foo_prime(){

callbar_prime()gotoA;...bar();A:...}

bar_prime(){

gotoB:...B:...}

(b)targetcodeforOSRtransition

Figure2.ExampleofOSRtransitionwithinlining.Callingfoo

IRmodelsthesemanticsofOSRPointsimilartoaCALLinstructionthatusesallthelivevariablesneededtore-coverJVMstate.LikeaCALLinstruction,anOSRPointwillthusconstrainsomeoptimizations,includingdeadcodeelimination,loadelimination,storeelimination,andcodemotion.UnlikeaCALLinstruction,anOSRpointtrans-ferscontroltotheexitblock,sothereisnomergebacktoreachablecodeafteranOSRPoint.

IfanOSRPointisrequiredinaninlinedmethodcontext,theinstructionkeepsaliveallstaterequiredtorecovertheentireinlinedcontext.

Afterregisterallocation,thecompilermapsthestateusedbyOSRPointinstructionstothecorrespondingphys-icalpositions(registernumberorspillingoffset),maintain-ingtypeinformation.Thesystemencodesthemapinata-ble.

ThecompilerexpandsOSRPointinstructionsintocallsintotheJikesRVMthreadscheduler.WhencalledfromanOSRPoint,thethreadsystemsuspendsthecallingthread.Aftercompilingcodeforthespecializedactivation,therun-timesystememitsmemorybarrierstoforceSMPmemoryconsistency,andthenwakesupthethreadandtransfersex-ecutiontothenewcompiledcode.

4OSRPolicy

4.1DeferredCompilationPolicy

TheSelf-91implementation[5]introduceddeferredcompilation(sometimesreferredtoasuncommonbranchextensionorpartialmethodcompilation)asatechniquetoreducecompiletimeandpotentiallyimprovecodequality.Withthistechnique,thecompileravoidsgeneratingcodethatitpredictswillrarelyexecute.Ifthepredictionfails,thecompilertraps,recompilesthemethod,andperformsOSRtotransfertothenewcode.

Previousworkhaspresentedseveralpoliciesforpredict-inguncommonbranchesfordeferredcompilation.InSelf,thecompilercouldoftenpredicttypeswithhighconﬁdence,andpredictbranchesthatsignaltheappearanceofothertypestobeuncommon.Forexample,thecompilercouldpredictthatavariableusedinanarraysubscriptexpressionhadconcretetypeinteger.Thistechniquewashighlyef-fectiveforSelf,sincethis“pureobject-oriented”languageemployspolymorphictypeseverywhere.

TheHotSpotservercompiler[12]similarlyusestypepredictionbasedonthecurrentclasshierarchytoinlinecur-rentlymonomorphiccallsites,andtrapsandOSRsshouldclassloadinginvalidatetheprediction.Thiscompileralsopredictsotherlanguagefeatures,suchasclassinitialization,tobeuncommonevents.

Whaley[16]studiedproﬁle-directeddeferredcompila-tion,predictinguncommonbranchesbasedonofﬂinepro-

ﬁledata.

WehaveimplementeddeferredcompilationcombiningsomeofthepoliciespresentedbyWhaleyandusedinHotSpot.Namely,wehaveimplementedanonlinever-sionofproﬁle-directeddeferredcompilation,aswellastheclass-hierarchybasedinliningtrapsusedinHotSpot.

4.2IntegrationintoAdaptiveOptimizationSys-tem

Asmentionedearlier,theJikesRVMincludesasophisti-catedadaptiveoptimizationsystemthatdriveson-linecom-pilation.Wemodiﬁedthecontrollerevaluationalgorithmtoaccountfordeferredcompilation.

AsdescribedinSection2,theadaptiveoptimizationsys-temrecompilesamethodatoptimizationlevelchosen

tominimize

.Deferredcompilationshouldaccel-eratecompilation,soadaptiverecompilationcanaffordtobemoreaggressive.Tothisend,wemodiﬁedtheanalyticmodeltoaccountforthereducedcompiletime.Whencon-sideringamethod

forrecompilation,thesystemcom-putesthepercentageofthesourcecode,,thatproﬁledataindicateswasdynamicallyreached.Sincenearlyalltheop-timizingcompilerphasesaredesignedtoruninlineartime,

weestimatethecostofoptimizingtobe

.Thus,theanalyticmodelaccountsforthebeneﬁtofdeferredcom-pilationonaper-methodbasis.

Toreduceoverhead,thesystemcomputesthefractiononlyonceforeachmethod,andcachesit.Thismightresultinthemodelusinganoutdatedformakingdecisions;wesuspectthiseffectissmallinpracticesincethemodelwillnotcomputeuntilthemethodhasrunforsometime.Inadditiontodeferredcompilation,wehaveimple-mentedpromotion,wherebythesystemmayoptimizealong-runningactivationrunningbaseline-compiledcode.Normally,theadaptiveoptimizationsystemconsidersop-timizingamethodbasedonthetotaltimespentinthatmethodinthepast.Toreconcilepromotionwiththispolicy,whenthesystemoptimizesamethod,itstartstomonitortimespentinoutdatedversionsofthecode(i.e.,baseline-compiledversions).Toconsiderpromotinganoutdatedac-tivation,thesystemusesthedefaultanalyticmodel,butes-timatesthefuturetime

basedonthetimespentintheoutdatedcode,ratherthanthetotaltimespentinallcom-piledversionsofthemethod.

5ExperimentalResults

WehaveimplementedthesetechniquesinJikesRVMv2.1.1.WereportexperimentalresultsontheSPECjvm98benchmarks[15].TheresultsreporteddonotconformtotheofﬁcialSPECrunrules,soourresultsdonotdirectlyorindirectlyrepresentaSPECmetric.Allrunsusethesize

100inputs,andruneachdistinctbenchmarkinanewVMinstance.

AllexperimentsranonadedicatedIBMRS/6000ModelF80withsix500MHzprocessorsand4GBofmainmem-ory,runningAIX4.3.3.JikesRVMwasrunwithaFas-tAdaptiveSemispaceconﬁguration,usingonevirtualpro-cessor,400MBofsmallobjectheap,and100MBoflargeobjectheap.

5.1Ofﬂinedata

First,weexamineexperimentalresultsbasedonofﬂine“perfectproﬁles”.Thatis,wegathertheproﬁledataof-ﬂine,andthenruntheprogramusingthesameinput.Theseexperimentsreveal,totheﬁrstorder,anupperboundongainspossiblewithon-linedeferredcompilation.Fortheseruns,weuseaconﬁgurationofJikesRVMthatloadsandprecompilesallclassesbeforetheﬁrstrunoftheprogram;thus,compilationactivitydoesnotfactorintothereportedperformanceresults.

Weconsiderdeferredcompilationusingoneorbothofthefollowingtwopolicies:

EdgecountersWedefercompilationofanybasicblocks

thattheproﬁledataindicateswereneverexecuteddur-ingtheproﬁlingrun.StaticheuristicsWedefercompilationofthefailedbranch

ofanyinlineguardatacallsitethatismonomorphicintheloadedclasshierarchy.Notethatusingtheperfectproﬁles,deferringcompila-tionissafeinthattheuncommoncodeneverexecutes.Weconsiderthefollowingfourvariantsofthesystem:IdealThisconﬁgurationdeferscompilationbasedonboth

edgecountersandstaticheuristics.Inplaceofuncom-moncode,thecompilerinsertsanunconditionaltrapinstructionthatdoesnotkeepanyprogramstatelive.Ideal-OSRThisconﬁgurationdeferscompilationbasedon

bothedgecountersandstaticheuristics.Inplaceofuncommoncode,thecompilerinsertsanOSRPointthatkeepsstatelivetoallowrecoveryofaJVMscopedescriptor.Static-OSRThisconﬁgurationdeferscompilationbased

solelyonstaticheuristics.Inplaceofuncommoncode,thecompilerinsertsanOSRPoint.EagerThisconﬁguration(thedefaultJikesRVM)doesnot

performdeferredcompilation.Intheseofﬂineexperiments,thedifferencebetweenIdealandIdeal-OSRisthatOSRpointsconstrainoptimiza-tion.Thus,thedifferencebetweentheseconﬁgurations

quantiﬁesthecostoftheseconstraints,comparedtoaper-fectoraclethatcouldpredictthefutureandthrowawayun-reachedcodeunconditionally.

ThedifferencebetweenStatic-OSRandIdeal-OSRquantiﬁesthebeneﬁtsofperformingdeferredcompilationbasedonproﬁledata,asopposedtosimplyusingOSRforclass-hierarchy-basedinlining.

Clearly,theimpactofdeferredcompilationoncompilerspeedorcodequalitydependshighlyontheunderlyingcompilerinfrastructure.Toexplorethisspace,wereportstatisticsforeachoftheoptimizingcompiler’sthreeopti-mizationlevels.

Figure3showsthecompilerspeed,inbytecodebytespermillisecond,ofeachsystemconﬁguration.Theresultsshowanaverageupperbound(Ideal)improvementincompile-time,comparedtoeagercodegeneration,of3%,13%,and11%atoptimizationlevelsO0,O1,andO2,respectively.Whenconstrainingoptimizationtoaccountforpotentialin-validation(Ideal-OSR),theimprovementsdropto3%,7%,and10%,respectively.So,constrainingoptimizationhasasigniﬁcanteffectoncompile-timeimprovements,espe-ciallyathigheroptimizationlevels.Acrosstheboard,main-tainingOSRpointsonlyatstatically-predictedinlineguardfailuresactuallyslightlydecreasescompilerspeed,ascom-paredtosimplygeneratingtheeliminatedcallinstruction.WeconcludethatmaintaininganOSRpointintheIRen-tailsslightlymoreworkthanmaintainingacallinstruction.Figure4showsthesizeoftheresultantmachinecode,inmachinecodebytesperbytecodebyte.Theresultsshowanaverageupperbound(Ideal)improvementingeneratedcodesize,comparedtoeagercodegeneration,of7%,16%,and13%atoptimizationlevelsO0,O1,andO2,respectively.WhenconstrainingoptimizationwithOSRPoints,(Ideal-OSR),theimprovementsdropto7%,13%,and9%,respec-tively.MaintainingOSRpointsonlyforinlineguardshasanegligibleeffectoncodesize;asexpected,sincethede-ferredcompilationelidesonlyasinglecallinstructionanditsassociatedsetup.

Figure5showsthespeed,relativetotheeagerconﬁg-uration,oftheresultantcodeforeachconﬁguration.ThedifferencebetweentheIdealandEager(baseline)conﬁg-urationshowsthemaximumspeedupthecompilercouldhopetoobtainbythistechnique.Theresultsshowthatonaverage,idealdeferredcompilationcouldimprovegen-eratedcodequalityby0.5%,1%,and2.2%atoptimiza-tionlevelsO0,O1,andO2,respectively.Naturally,im-proveddataﬂowhelpsmoreathigheroptimizationlevels.Thejessandmtrtbenchmarksseethelargestpotentialgain,eachroughly5%.Nevertheless,overallthesenum-bersshowonlyamodestpotentialforgainduetoimproveddataﬂow.

TheIdeal-OSRconﬁguration,whichkeepsOSRdatalive,onaveragedoesnotchangeperformance.Thejess

10IdealIdeal-OSRStatic-OSR5Eager0compressjessdbjavacmpegaudiomtrtjackG. Meana)64IdealIdeal-OSRStatic-OSREager20compressjessdbjavacmpegaudiomtrtjackG. Mean4b)3Ideal2Ideal-OSRStatic-OSREager10compressjessdbjavacmpegaudiomtrtjackG. Meanc)

Figure3.Compilerspeedusingofﬂine“per-fect”proﬁledataatoptimizationlevela)O0,b)O1,andc)O2.

1086IdealIdeal-OSRStatic-OSR4Eager20compressjessdbjavacmpegaudiomtrtjackG. Meana)10IdealIdeal-OSRStatic-OSR5Eager0compressjessdbjavacmpegaudiomtrtjackG. Meanb)10IdealIdeal-OSRStatic-OSR5Eager0compressjessdbjavacmpegaudiomtrtjackG. Meanc)

Figure4.Finaloptimizedmachinecodesizeusingofﬂine“perfect”proﬁledataatopti-mizationlevela)O0,b)O1,andc)O2.

Machine Code Size (mcb/bcb)Compilation Rate (bcb/ms)Machine Code Size (mcb/bcb)Compilation Rate (bcb/ms)Machine Code Size (mcb/bcb)Compilation Rate (bcb/ms)1.10noitarugifn1.05oc regIdealaEIdeal-OSR otStatic-OSR evita1.00lercompressjessdbjavacmpegaudiomtrtjackG. Mean deepS0.951.10a)noitarugifn1.05oc regIdealaEIdeal-OSR otStatic-OSR evita1.00lercompressjessdbjavacmpegaudiomtrtjackG. Mean deepS0.951.10b)noitarugifn1.05oc regIdealaEIdeal-OSR otStatic-OSR evita1.00lercompressjessdbjavacmpegaudiomtrtjackG. Mean deepS0.95c)

Figure5.Speedupobtained(biggerisbetter)comparedtothedefault(eager)codegener-ationstrategy,whenusingdeferredcompila-tionstrategiesatoptimizationlevela)O0,b)O1,andc)O2.Theseresultsuseofﬂine“per-fect”proﬁledata.

benchmarkseesthelargestimprovement,of4%.The5%idealmtrtgainvanishes,insteadproducinga1.5%degra-dation.jackseesasubstantialdegradation,of4%.Over-all,thenumbersindicatethatthecostofconstrainingopti-mizationwithOSRPointsroughlynegatesthesmallbene-ﬁtduetoimproveddataﬂow,withvariancesonindividualbenchmarks.

ThecompilershouldbeabletoapproachtheIdealper-formancebysinkingcodeintouncommonsectionstoma-terializestatethatisotherwisedead.Whaley[16]im-plementedaspecializedsinkingtransformationtoaccom-plishthis.WearecurrentlyworkingonimprovementstotheRVMoptimizer’scodemotion,tomoreeffectivelysinkcomputationintouncommonbranchextensions.

PlacingOSRpointsusingthestaticstrategyhasanegli-gibleimpactonperformance.ThisindicatesthattheJikesRVMpatch-pointbasedguardedvirtualinlining[13]andstaticsplittingheuristics[4]handleclass-hierarchybasedinliningeffectively.

5.2Onlinedata

Now,weexaminetheperformanceofdeferredcompila-tioninafullyautomaticon-lineadaptiveoptimizationsys-tem.Fortheseexperiments,thebaselinecompilerinsertsin-strumentationintoeachmethodtogatheredgecounterdata.Thatis,thesysteminsertscountersthatcountthenumberoftimeseachbranchistakenandnottaken.Whenoptimizingamethod,theoptimizingcompilerpropagatestheinforma-tionandinsertsanOSRPointatthebeginningofeachbasicblockthatthecountersindicatewasneverreachedwhiletheunoptimizedcoderan.

Naturally,aftercodeisoptimized,itmaytraverseanewpathandexecuteablockthatwasnotpreviouslyexecuted.Inthiscase,thesystemperformsOSRtogenerateuncom-moncodeasneeded.WhenperformingOSR,thesystemimmediatelyinvalidatestheoptimizedcodewhichtriggeredthedeferredcompilation.Whentheprogramnextenterstheinvalidatedmethod,thesystemre-optimizesthemethodwithoutusingdeferredcompilation.

Wereportperformancebasedonwall-clocktime,whichincludesallonlineproﬁling,decisionmaking,recompila-tion,invalidation,andOSRtransitionactivity.

Figure6ashowstheperformanceduringtheﬁrstrunofeachoftheSPECjvm98benchmarks.Theresultsshowthatonaverage,theOSRtechniquesimproveperformanceoftheﬁrstrunofaSPECjvm98benchmarkby2.6%.UsingonlystaticplacementofOSRpointsforinliningimprovesperformanceby1.8%.Thecompressbenchmarkshowsasigniﬁcantimprovement(around8%);onlythisbenchmarkrunslong-runningloopswhichtriggerOSRpromotionfrombaselinetooptimizedcode.Thedbandjavacbench-marksshowimprovementsof8%and4%respectively,pre-

1.15noitarug1.10ifnoc regaE o1.05OSR-edge countstOSR-static evitaler pu1.00decompressjessdbjavacmpegaudiomtrtjackG. MeanepS0.951.15a)noitarug1.10ifnoc regaE o1.05OSR-edge countstOSR-static evitaler pu1.00decompressjessdbjavacmpegaudiomtrtjackG. MeanepS0.95b)

Figure6.Speedupobtained(biggerisbetter)comparedtothedefault(eager)codegenera-tionstrategy,whenusingmulti-leveladaptiverecompilation:a)ﬁrstrun.b)bestrunof10.

sumablyduetomoreaggressivecompilation.Themtrtperformancedegradesby2-4%.Thisbenchmarkcontainsafewhotmethodsthatdominateperformanceafterashortstart-upphase.ItappearsthatwithOSR,theadaptivesys-temismoreaggressivecompilingmethodsduringthestart-upphase.Thisdelaysthetimetoreachthecomputationallyintensivemethods,anddelaysoptimizationofthesemeth-ods,hurtingperformance.Itremainsanopenissueonhowtodesignanadaptiverecompilationstrategythatavoidsthispathologicalbehavior.

Figure6bshowstheperformanceonthebestof10runsofeachbenchmark,usingtheOSRtechniques.Theoverallperformance,comparedtoeagercompilation,isvirtuallyunchanged,althoughindividualbenchmarksvaryslightly.Thisisconsistentwiththedatafromtheofﬂinestudy.

Table1showsthecompilationactivityduringtheﬁrstrunofeachbenchmark,withandwithoutOSRtechniques.Theﬁrsttwocolumnsreportsthenumberofon-stackre-placementeventsintheﬁrstrunofeachbenchmark.Onlycompresstriggerspromotionevents.SixofthesevenbenchmarkstriggerOSRinvalidationsduetodeferredcom-pilation.Alltheseinvalidationsresultfromincompletecodecoverageproﬁledata;nonerepresentfailedinliningsinval-idatedbydynamicclassloading.

Thus,inallcases,bythetimetheadaptivesystemopti-mizedamethod,ithadloadedallclassesthataffectinlineguards.However,inafewcases,theon-lineproﬁledata

didnotcoveralldynamicallyreachedinstructions,causinglaterinvalidations.Weveriﬁedthateachinvalidationtakesatmostontheorderof10microseconds;thuscausingasmalldirectperformancehit.

Theremainingcolumnsofthetableshowthenumberofmethodstheadaptivecontrollerchosetocompileateachoptimizationlevel.RecallthatusingOSR,thecontrollermodelshouldcompilemoreaggressivelysinceitestimatesareductionincompile-time.Ingeneral,thedatabearsthisout,asoverallthesystemcompilesmoremethodsusingtheOSRheuristics.Thechangeincompileractivityisnotdra-matic(4%moremethodscompiled),corroboratingearlierexperiencesthattheanalyticrecompilationmodelisrela-tivelyinsensitivetovariancesincompilerperformance[1].Thetableshowsaninterestingcounter-intuitivetrendinthatwithOSR,fewermethodsrisetothehighestoptimiza-tionlevel.Wesuspectthatthecontrollercost-beneﬁtmodelassignshigherprioritytotheﬁrst(low-level)optimizationofamethod,sincethisoptimizationgivesthegreatestab-solutegain.ItseemsthatthisphenomenonisexaggeratedbythemodiﬁedOSRcontrollermodel,causingmore“low-level”optimizationandindirectlydelaying“high-level”op-timization.

Wenotethatonshorttime-scales,thebehavioroftheadaptivesystemdependsonacomplexandnon-linearfeed-backloop,varyingsubstantiallybasedoncompetingesti-matesofcostsandbeneﬁtsandsampledata.Althoughthebehaviorappearstoberelativelystableonaggregate,itre-mainsanopenproblemtobetterunderstandhowchangestocompilerperformanceandcodequalitytranslateintoover-allsystemperformance.

6RelatedWork

On-stackreplacementtechniquesoriginatedinSelf-91’sdeferredcompilation[5].TheSelf-91compilerdefersgen-eratingcodeforthetargetsofuncommonbranches,predict-ingthesebranchesbasedontypesappearinginnormalcode.Inthecasethatanuncommonbranchtargetexecutes,thesystemgeneratescodeondemandusinguncommonbranchextension.TheoriginalmotivationfordeferredcompilationinSelf-91wastoreducecompiletimeandavoidconser-vativedataﬂowmerges.Later,Self-93[10]usedOSRtoimproveperformancewithadaptiverecompilation,transfer-ringexecutionfromslowcodetofastercodeon-the-ﬂy.Selfalsopioneeredtheconceptofusingde-optimizationtodebugoptimizedcode[9],whichremainsoneofthemostcompellingapplicationsofOSR.Wehavenotyetimple-mentedthisapplication,astheJikesRVMdebuggerframe-workisstillimmatureandunstable.

OfcurrentproductionJavaVirtualMachines,onlytheHotSpotservercompiler[12]hasreportedOSRtechnol-

Benchmarkcompressjessdbjavac

mpegaudiomtrtjackTotal

WithOSR

InvalidationsO0

617049181017116855715924429

212273825

WithoutOSRO1917416291126112

total2860171871017585553

Table1.CompilationactivitybytheJikesRVMAdaptiveOptimizationSystemduringtheﬁrstrunofeachSPECjvm98benchmark,measuredinmethodscompiled.Promotionsrepresenttransitionsfrombaselinetooptimizedcodeforalong-runningactivation.InvalidationsrepresentOSRtransitionsduetoincorrectlypredicteddeferredcompilation.Theremainingcolumnsreportthenumberofmethodsoptimizedateachoptimizationlevel.

ogy.2HotSpottransitionsfrominterpretedcodetoopti-mizedcodeforlong-runningloops,basedoninvocationandloopcounters.Additionally,theHotSpotcompilerusesde-ferredcompilationtoavoidgeneratingcodeforuncommonbranches,suchasthefailedbranchofclass-hierarchy-basedguardedinlining,andprogrampointsthatinvokeclassini-tializers.Toourknowledge,HotSpotdoesnotperformproﬁle-drivendeferredcompilationexcepttoinlinecallsitesthatproﬁledataindicatesaredynamicallymonomor-phic.

Whaley’s[16]partialmethodcompilationessentiallyap-pliesSelf-91’suncommonbranchextensionto“rare”blocksasdeterminedbyheuristicsonproﬁledata.Thepresentedresultsfocusmainlyoncompile-time;thereweresmallim-provementsinrunningtime,attributedtobetterdataﬂow.Themethodologyofourdeferredcompilationstudydif-fersfromWhaley’sinmanydimensions.Inparticular,fortheexperimentalresults,Whaleyimplemented“methodoutlining”atthebytecodelevel,ratherthanengineeringOSRintotheoptimizingJIT.Thishasimplicationsonbothcorrectness(aswediscussedforpseudo-bytecodes),andonthequalityofthegeneratedcode.Furthermore,sinceWha-leyusedtheIBMDKasa“blackbox”torunthetrans-formedcode,itisnotclearhowtheunderlyingJITin-linesandoptimizesthegeneratedsourcecode.Incontrast,ourexperimentalresultsbreakdowntheimpactofOSRpointsoncodequalityduetodataﬂowimprovementsandoptimizationconstraints.Finally,asdescribedinthelit-erature[2,14],thetwosystemsusedrasticallydifferentadaptiverecompilationstrategies,whichmakeexperimen-

existencebasedoptimizationsasOSRrestrictedtoOSRatmethodentry,whenreplacingastackframeistrivial.

7Conclusion

Wehavepresentedarelativelysimpleon-stack-replacementmechanism,andpresentedacomprehensiveevaluationofon-lineproﬁle-directeddeferredcompilationandpromotionofunoptimizedcode.Theresultsshowthatinourimplementation,deferredcompilationprovidesmod-estgainsincodequality,compilerspeed,andcodespace.WeshowedhowtointegratetheOSR-basedtechniquesintoamodel-drivenadaptiveoptimizationsystem,andshowedthatthesystemrespondsbymoreaggressivecompilation,providingasmallperformanceimprovement.

Shouldaproductionvirtualmachineimplementon-stackreplacement?Inourexperience,thispaper’sOSRmecha-nismenablesanimplementationwithminimaldisruptiontotherestofthecodebase.Theabilitytodebugopti-mizedcodeviade-optimizationmayjustifytheinvestment.Regardingperformanceimprovements,ourresultssofarshowamodestbutnotcompellingimprovementfromon-lineproﬁle-directeddeferredcompilation.Ourresultsindi-catethatclass-hierarchy-basedinliningcanbehandledef-fectivelywithacombinationofsimplertechniques;namely,pre-existence,codepatching,andstaticsplitting.Onthepositiveside,OSRprotectsagainstpathologicallybadper-formanceduetoasinglelong-runningloop;thispathologyoftenappearsinmicrobenchmarks.

Webelievethatmoredramaticperformanceimprove-mentswillcomewithmoreadvancedprogramtransforma-tions.OSRprovidesaremarkablypowerfulinvalidationmechanism,whichtheoptimizercanrelyontoimplementmoreinvasivetransformationssuchasobjectinlining,ortodrawoptimisticconclusionswhereconservativeanalysisfails.WeplantomaketheOSRimplementationpubliclyavailableinaJikesRVMrelease,andhopethatitsavail-abilitywillspurfutureresearchinthisdirection.

8Acknowledgements

TheauthorswouldliketothankDaveGrovefornumer-ousproductivediscussions,andtheanonymousreviewersforvaluablefeedback.

References

[1]M.Arnold,S.Fink,D.Grove,M.Hind,andP.Sweeney.

AdaptiveoptimizationintheJalape˜noJVM:Thecontroller’sanalyticalmodel.In3rdACMWorkshoponFeedback-DirectedandDynamicOptimization(FDDO-3),Dec.2000.

[2]M.Arnold,D.Grove,S.Fink,M.Hind,andP.Sweeney.

AdaptiveoptimizationintheJalape˜noJVM.InProceed-ingsoftheACMSIGPLANConferenceonObject-OrientedProgrammingSystems,Languages,andApplications(OOP-SLA2000),Minneapolis,MN,Oct.2000.AlsopublishedasACMSIGPLANNotices,volume35,number10.

[3]D.BrueningandE.Duesterwald.ExploringOptimalCom-pilationUnitShapesforanEmbeddedJust-In-TimeCom-piler.InProceedingsofthe2000ACMWorkshoponFeedback-DirectedandDynamicOptimizationFDDO-3,Dec2000.

[4]C.Chambers.TheDesignandImplementationoftheSelf

Compiler,anOptimizingCompilerforObject-OrientedPro-grammingLanguages.PhDthesis,StanfordUniversity,Mar.1992.PublishedastechnicalreportSTAN-CS-92-1420.[5]C.ChambersandD.Ungar.Makingpureobject-oriented

languagespractical.InACMConferenceonObject-OrientedProgrammingSystems,Languages,andApplications,pages1–15,Nov.1991.

[6]R.Cytron,J.Ferrante,B.K.Rosen,M.N.Wegman,and

F.K.Zadeck.Anefﬁcientmethodforcomputingstaticsingleassignmentformandthecontroldependencegraph.ACMTransactionsonProgrammingLanguagesandSys-tems,13(4):451–490,1991.

[7]D.DetlefsandO.Agesen.Inliningofvirtualmethods.

In13thEuropeanConferenceonObject-OrientedProgram-ming,June1999.

[8]S.Fink,K.Knobe,andV.Sarkar.Uniﬁedanalysisofarray

andobjectreferencesinstronglytypedlanguages.InSev-enthInternationalStaticAnalysisSymposium(2000),June2000.[9]U.H¨olzle,C.Chambers,andD.Ungar.Debuggingopti-mizedcodewithdynamicdeoptimization.ACMSIGPLAN

Notices,27(7):32–43,July1992.

[10]U.HolzleandD.Ungar.Athirdgenerationselfimplementa-tion:Reconcilingresponsivenesswithperformance.InACMConferenceonObject-OrientedProgrammingSystems,Lan-guages,andApplications,pages229–243,1994.

[11]T.LindholmandF.Yellin.TheJavaVirtualMachineSpec-iﬁcationSecondEdition.TheJavaSeries.Addison-Wesley,1999.

[12]M.Paleczny,C.Vick,andC.Click.TheJavaHotspot(tm)

ServerCompiler.InUSENIXJavaVirtualMachineRe-searchandTechnologySymposium,pages1–12,Apr2001.[13]T.Suganama,T.Ogasawara,M.Takeuchi,T.Yasue,

M.Kawahito,K.Ishizaki,H.Komatsu,andT.Nakatani.OverviewoftheIBMJavaJust-in-Timecompiler.IBMSys-temsJournal,39(1),2000.

[14]T.Suganuma,T.Yasue,M.Kawahito,H.Komatsu,and

T.Nakatani.AdynamicoptimizationframeworkforaJavajust-in-timecompiler.InACMConferenceonObject-OrientedProgrammingSystems,Languages,andApplica-tions,pages180–195,Oct.2001.

[15]TheStandardPerformanceEvaluationCorporation.SPEC

JVM98Benchmarks.http://www.spec.org/osg/jvm98/,1998.

[16]J.Whaley.Partialmethodcompilationusingdynamicproﬁle

information.InACMConferenceonObject-OrientedPro-grammingSystems,Languages,andApplications,Oct2001.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文