您好,欢迎来到爱问旅游网。
搜索
您的当前位置:首页Design, implementation and evaluation of adaptive recompilation with on-stack replacement

Design, implementation and evaluation of adaptive recompilation with on-stack replacement

来源:爱问旅游网
Design,ImplementationandEvaluationofAdaptiveRecompilationwith

On-StackReplacement

StephenJ.Fink

IBMT.J.WatsonResearchCenterFengQian

SchoolofComputerScience,McGillUniversity

YorktownHeights,NY10598sjfink@us.ibm.com

Abstract

Modernvirtualmachinesoftenmaintainmultiplecom-piledversionsofamethod.Anon-stackreplacement(OSR)mechanismenablesavirtualmachinetotransferexecutionbetweencompiledversions,evenwhileamethodruns.Re-lyingonthismechanism,thesystemcanexploitpowerfultechniquestoreducecompiletimeandcodespace,dynami-callyde-optimizecode,andinvalidatespeculativeoptimiza-tions.

Thispaperpresentsanew,simple,mostlycompiler-independentmechanismtotransferexecutionintocompiledcode.Additionally,wepresentenhancementstoananalyticmodelforrecompilationtoexploitOSRformoreaggres-siveoptimization.WehaveimplementedthesetechniquesinJikesRVMandpresentacomprehensiveevaluation,in-cludingastudyoffullyautomatic,online,profile-drivende-ferredcompilation.

1Introduction

Modernvirtualmachines(VMs)oftenmaintainmultiplecompiledversionsofamethod.Forexample,state-of-the-artJavaVMsemployadaptivestrategiestoselectivelycom-pileandrecompilehotmethods[2,12,14,10].AVMcaneasilytransitiontoanewcompiledversionofamethodbymodifyingdispatchstructures,sothatfuturemethodinvo-cationsbranchtothenewcompiledversion.However,thetransitionforamethodthatiscurrentlyexecutingonsomethread’sstackpresentsaharderengineeringchallenge.Toperformthistransition,theSelfprogramminglan-guageimplementations[5,10]pioneeredon-stackreplace-ment(OSR)technology.OSRtechnologyenablesfunda-mentalVMenhancements,includingdebuggingoptimizedcodeviade-optimization[9],deferredcompilationtoim-

Montreal,Canada

fqian@sable.mcgill.ca

Instead,theprogramtrapsuponreachinganuncommonblock,andinvokesOSRtorecompileandtransitiontonewcode.Deferredcompilationbringstwopotentialbenefits.First,theoptimizercanbenefitfromimproveddataflowbyeliminatingcontrolflowmergesdownstreamfromuncom-monblocks.Second,thetechniquereducescompile-time.Recentresearchsuggestspotentialgainsusingprofile-directeddeferredcompilationforJava.Thepreviousstud-iesreliedonoffline[16,3]profiledataandofflinewhole-programbytecode-leveltransformations[16].Buildingonthiswork,thispaperpresentsthefirstpublishedevalu-ationofcompletelyautomatic,online,profile-drivende-ferredcompilation.Weevaluatecompile-time,codesize,andcodequalityimprovementsatmultipleoptimizationlevels,consideringbothanideallimitcaseandacompleteimplementation.Wecomparedeferredcompilationbasedon“perfect”offlineandonlineprofiledata,andadditionallyevaluatetheimpactofdeferredcompilationonadaptivere-compilationactivitydrivenbyananalyticmodel.Themaincontributionsofthispaperare

anew,simple,relativelycompiler-independentmecha-nismtotransferanactivation’sexecutiontospecializedoptimizedcode,

enhancementstoananalyticmodeltodriverecompila-tiontoaccountfordeferredcompilationandon-the-flyreoptimization,and

thefirst(toourknowledge)comprehensiveexperimen-talevaluationofautomatic,online,profile-drivende-ferredcompilation.

Theexperimentalresultsshowthatintheidealcase,profile-drivendeferredcompilationimprovescompile-timeby3-13%,dependingontheoptimizationlevel.Codesizereductionsrangefrom7-16%.ConstrainingoptimizationtoallowOSRinvalidationreducesthesebenefitsslightly.Overall,theperformanceimpactofeliminatingdataflowmergesandconstrainingoptimizationappearssmall,rang-ingintheidealcasefrom0.5%to2.2%,dependingonopti-mizationlevel.UsingOSRtoenableclass-hierarchybasedunguardedinlininghasanegligiblebenefitcomparedtothesystem’sdefaultstrategy.

Afullyonlineimplementationofdeferredcompila-tion,asexploitedbyamodel-drivenadaptiverecompila-tionsystem,improvesperformanceonthefirstrunoftheSPECjvm98benchmarksbyanaverageof2.6%,withim-provementonindividualbenchmarksrangingfrom+8%to-4%.Duringthefirstrun,theindividualbenchmarkstriggerbetween0and10OSRevents.

Theremainderofthispaperproceedsasfollows.Sec-tion2reviewssomeJikesRVMbackgroundrelevanttothisstudy.Section3presentsanewOSRtransitionmecha-nismanddiscussesimplementationdetailsinoursystem.

Section4presentsapolicytoallowtheadaptiveoptimiza-tionanalyticmodeltoexploittheOSRmechanisminitsdecision-making.Section5presentstheexperimentaleval-uation.Finally,Section6reviewsrelatedworkandSec-tion7concludes.

2Background

WenowreviewsometechnicaldetailsoftheJikesRe-searchVirtualMachine(RVM)whichpertaintothisstudy.TheRVMprovidesanexecutionengineforJavabyte-codes,althoughnotacompleteJavaplatform.TheRVMhasacompile-onlyapproach;thesystemcompilesallbyte-codetonativecodebeforeexecution.Thesystemhastwocompilers,thebaselinecompilerandtheoptimizingcom-piler.Thebaselinecompilergeneratescodequickly,imple-mentingastraightforwardinterpretationoftheJVMstackarchitecture[11].Theresultantcodequality,naturally,isrelativelypoor.Theimplementationofthebaselinecom-pilerismostlyarchitecture-dependent,andthesystemcur-rentlysupportstwoarchitectures(IA32andPowerPC).Theoptimizingcompilerprovidesafullsuiteofop-timizations.Theimplementationismostlyarchitecture-independent.Forthisstudy,theoptimizingcompilerpro-videsthreelevelsofoptimization:

Level0consistsmainlyofasetofoptimizationsper-formedon-the-flyduringthetranslationfrombyte-codestotheintermediaterepresentation.Currently,thecompilerperformsthefollowingoptimizationsdur-ingIRgeneration:constant,type,non-null,andcopypropagation;constantfoldingandarithmeticsimplifi-cation;unreachablecodeelimination;eliminationofredundantnullchecks,checkcasts,arraystorechecks,andinliningoftinymethodsatmonomorphiccallsites.Level1augmentslevel0withadditionallocalopti-mizationssuchascommonsubexpressionelimination,arrayboundscheckelimination,andredundantloadelimination.Italsoaddsinliningbasedonstatic-sizeheuristics,1globalflow-insensitivecopyandconstantpropagation,globalflow-insensitivedeadassignmentelimination,synchronizationoptimizations,andscalarreplacementofaggregatesandshortarrays.

Level2augmentslevel1withSSA-basedflow-sensitiveoptimizations.InadditiontotraditionalSSAoptimizationsonscalarvariables[6],thesystemalsousesHeapArraySSAform[8]toperformredundant

loadelimination,globalcodemotion,andcommonsubexpressionelimination.

Theadaptiveoptimizationsystemdrivesrecompilationdecisionsbasedontheestimatedcostandbenefitofcom-pilingeachmethod.Thesysteminitiallycompileseachmethodwiththebaselinecompiler,andastheprogramruns,mayrecompileeachmethodwiththeoptimizingcompiler.Atagivenpointduringprogramexecution,theadaptivesys-temcontrollerwillconsideramethodforrecompilationguidedbythefollowinganalyticmodel.

Supposemethodiscurrentlyoptimizedatoptimiza-tionlevel,,definingbaseline-compiledcodetobeoptimizationlevel-1.Definethefollowing:

,theexpectedtimetheprogramwillspendexecutingmethodinthefuture,ifisnotrecompiled.

,thecostofrecompilingmethodatoptimization

level,for.

,theexpectedtimetheprogramwillspendexecutingmethodinthefuture,ifisrecompiledatlevel.

Usingtheseestimatedvalues,thecontrollerchoosestherecompilationlevelthatminimizestheexpectedfuturerunningtimeofarecompiledversionof;i.e.,itchoosesthethatminimizesthequantity.If,

thecontrollerdecidestorecompile

atlevel;otherwiseitdecidestonotrecompile.

Thecontrollerestimatesthequantitiesandbasedonofflineprofiledata,andsimplemodelsofcompilerandprogrambehavior[2].

3OSRMechanism

TheOSRtransformationperformsthefollowingsteps:1:extractcompiler-independentstatefromasuspendedthread,

2:generatenewcodeforthesuspendedactivation,and3:transferexecutioninthesuspendedthreadtonewcom-piledcode.

Thissectionpresentsmechanismstoaccomplishthesesteps.

3.1JVMscopedescriptor

ForJava,wedefinethe’compiler-independent’stateofarunningactivationtobeascopedescriptor[9]basedonthestack-basedJavaVirtualMachinearchitecture[11]:Definition1TheJVMscopedescriptorofamethodactiva-tionconsistsofthethreadrunningtheactivation,thepro-gramcounterasabytecodeindex,valuesofeachlocalvari-ableandstacklocation,andareferencetotheactivation’sstackframe.

Figure1(a)and(b)showasimpleexampleinJavasourceandbytecode,respectively.Figure1(c)showsaJVMscopedescriptorforarunningactivationmethodofmethodsum.ThisscopedescriptorcapturesaruntimesnapshotoftheJVMstatein“sum”beforebytecodeindex16afterloop-ing50times.

3.2Specializedcodegeneration

TogeneratetargetcodeforanOSRtransition,previoussystemssuchasHotspot[12]wouldcompileoneversionofthetargetmethodwithtwoormoreentrypoints,andusethisversionbothforOSRtransitionsandfuturemethodin-vocations.Wepresentasubstantiallydifferentmechanismtogeneratetargetcode.

RecallthatJikesRVMhasmultiplecompilerimplemen-tations;so,wesetouttoimplementOSRwithminimalchangestotheunderlyingcompilers.Tothisend,ourdesigncompilesaspecializedversionofthemethodforeachacti-vationthatisreplaced,aswellasanewversionforfutureinvocations.Ourdesignisrelativelycompiler-independent,asthetransitiontooptimizedcodeisexpressedinbytecodeandhandledwithnormalcompilermechanisms.

ThekeyinsightunderlyingtheJikesRVMmechanismisthatgiventheJVMscopedescriptorofamethod,wecanconstructamethod,inbytecode,thatsetsupthenewstackframeandcontinuesexecution,preservingsemantics.Todothis,weprependtotheoriginalbytecodesaspecializedprologuethata)savesvaluesintolocals,b)loadsvaluesonthestack,andthenc)jumpstothecurrentprogramcounter.Notethatweexpressthestackframesetupinthesourcelanguage(bytecode),andrelyonthetargetcompilertoim-plementthesetupprocedureasitseesfit.

Figure1(d)showsthespecializedbytecodegeneratedfortheJVMscopedescriptorfromFigure1(c).Byconstruc-tion,executingthespecializedmethodofFigure1(d)hasthesamesemanticsascontinuingexecutionoftheinter-ruptedactivationbasedonthecodein1(b).

WhenanOSRhappensinaninlinedcontext,were-coveraJVMscopedescriptorforeachvirtualinlinedstackframe.Wegenerateaspecializedversionofeachinlinedmethod,constructedsuchthatcallingthespecializedrootmethodoftheinlinedcontextrestoresallthereplacementstackframes.Theprologueofthespecializedrootrestorestherootmethod’sstate,thenimmediatelycallsaspecializedversionofthecallee.Whenthespecializedcalleereturns,therootmethodcontinuesexecutionimmediatelyafterthecall.Figure2showsonesimpleexamplewhereOSRhap-pensinthecalleebar.Thecallerfoorestoresitsexecutionstatebeforethecallsite,callsthespecialized“bar

classC{

staticintsum(intc){inty=0;

for(inti=0;ireturny;}}

(a)asimpleexample

0123478910111415161920iconst_0istore_1iconst_0istore_2goto14iload_1iload_2iadd

istore_1iinc21iload_2iload_0

if_icmplt7iload_1ireturn

(b)bytecodeforsum

ldc100istore_0ldc1225istore_1ldc50istore_2ldc50ldc100goto16

runningthread:MainThreadframepointer:0xSomeAddressprogramcounter:16

localvariables:L0(c)=100;L1(y)=1225;L2(i)=50;stackexpressions:S0=50;S1=100;

(c)JVMscopedescriptorforanactivationofsum

0iconst_0...

16if_icmplt7

...

20ireturn

(d)specializedversionofsumforOSRtransition

Figure1.ExampleofOSRtransitionmechanism

voidfoo(){

bar();A:...}

voidbar(){

...B:...}

(a)Sourcecode.Assumetheoptimizingcompilerin-linesbarintofooandthenforcesanOSRatprogrampointB.

foo_prime(){

callbar_prime()gotoA;...bar();A:...}

bar_prime(){

gotoB:...B:...}

(b)targetcodeforOSRtransition

Figure2.ExampleofOSRtransitionwithinlining.Callingfoo

IRmodelsthesemanticsofOSRPointsimilartoaCALLinstructionthatusesallthelivevariablesneededtore-coverJVMstate.LikeaCALLinstruction,anOSRPointwillthusconstrainsomeoptimizations,includingdeadcodeelimination,loadelimination,storeelimination,andcodemotion.UnlikeaCALLinstruction,anOSRpointtrans-ferscontroltotheexitblock,sothereisnomergebacktoreachablecodeafteranOSRPoint.

IfanOSRPointisrequiredinaninlinedmethodcontext,theinstructionkeepsaliveallstaterequiredtorecovertheentireinlinedcontext.

Afterregisterallocation,thecompilermapsthestateusedbyOSRPointinstructionstothecorrespondingphys-icalpositions(registernumberorspillingoffset),maintain-ingtypeinformation.Thesystemencodesthemapinata-ble.

ThecompilerexpandsOSRPointinstructionsintocallsintotheJikesRVMthreadscheduler.WhencalledfromanOSRPoint,thethreadsystemsuspendsthecallingthread.Aftercompilingcodeforthespecializedactivation,therun-timesystememitsmemorybarrierstoforceSMPmemoryconsistency,andthenwakesupthethreadandtransfersex-ecutiontothenewcompiledcode.

4OSRPolicy

4.1DeferredCompilationPolicy

TheSelf-91implementation[5]introduceddeferredcompilation(sometimesreferredtoasuncommonbranchextensionorpartialmethodcompilation)asatechniquetoreducecompiletimeandpotentiallyimprovecodequality.Withthistechnique,thecompileravoidsgeneratingcodethatitpredictswillrarelyexecute.Ifthepredictionfails,thecompilertraps,recompilesthemethod,andperformsOSRtotransfertothenewcode.

Previousworkhaspresentedseveralpoliciesforpredict-inguncommonbranchesfordeferredcompilation.InSelf,thecompilercouldoftenpredicttypeswithhighconfidence,andpredictbranchesthatsignaltheappearanceofothertypestobeuncommon.Forexample,thecompilercouldpredictthatavariableusedinanarraysubscriptexpressionhadconcretetypeinteger.Thistechniquewashighlyef-fectiveforSelf,sincethis“pureobject-oriented”languageemployspolymorphictypeseverywhere.

TheHotSpotservercompiler[12]similarlyusestypepredictionbasedonthecurrentclasshierarchytoinlinecur-rentlymonomorphiccallsites,andtrapsandOSRsshouldclassloadinginvalidatetheprediction.Thiscompileralsopredictsotherlanguagefeatures,suchasclassinitialization,tobeuncommonevents.

Whaley[16]studiedprofile-directeddeferredcompila-tion,predictinguncommonbranchesbasedonofflinepro-

filedata.

WehaveimplementeddeferredcompilationcombiningsomeofthepoliciespresentedbyWhaleyandusedinHotSpot.Namely,wehaveimplementedanonlinever-sionofprofile-directeddeferredcompilation,aswellastheclass-hierarchybasedinliningtrapsusedinHotSpot.

4.2IntegrationintoAdaptiveOptimizationSys-tem

Asmentionedearlier,theJikesRVMincludesasophisti-catedadaptiveoptimizationsystemthatdriveson-linecom-pilation.Wemodifiedthecontrollerevaluationalgorithmtoaccountfordeferredcompilation.

AsdescribedinSection2,theadaptiveoptimizationsys-temrecompilesamethodatoptimizationlevelchosen

tominimize

.Deferredcompilationshouldaccel-eratecompilation,soadaptiverecompilationcanaffordtobemoreaggressive.Tothisend,wemodifiedtheanalyticmodeltoaccountforthereducedcompiletime.Whencon-sideringamethod

forrecompilation,thesystemcom-putesthepercentageofthesourcecode,,thatprofiledataindicateswasdynamicallyreached.Sincenearlyalltheop-timizingcompilerphasesaredesignedtoruninlineartime,

weestimatethecostofoptimizingtobe

.Thus,theanalyticmodelaccountsforthebenefitofdeferredcom-pilationonaper-methodbasis.

Toreduceoverhead,thesystemcomputesthefractiononlyonceforeachmethod,andcachesit.Thismightresultinthemodelusinganoutdatedformakingdecisions;wesuspectthiseffectissmallinpracticesincethemodelwillnotcomputeuntilthemethodhasrunforsometime.Inadditiontodeferredcompilation,wehaveimple-mentedpromotion,wherebythesystemmayoptimizealong-runningactivationrunningbaseline-compiledcode.Normally,theadaptiveoptimizationsystemconsidersop-timizingamethodbasedonthetotaltimespentinthatmethodinthepast.Toreconcilepromotionwiththispolicy,whenthesystemoptimizesamethod,itstartstomonitortimespentinoutdatedversionsofthecode(i.e.,baseline-compiledversions).Toconsiderpromotinganoutdatedac-tivation,thesystemusesthedefaultanalyticmodel,butes-timatesthefuturetime

basedonthetimespentintheoutdatedcode,ratherthanthetotaltimespentinallcom-piledversionsofthemethod.

5ExperimentalResults

WehaveimplementedthesetechniquesinJikesRVMv2.1.1.WereportexperimentalresultsontheSPECjvm98benchmarks[15].TheresultsreporteddonotconformtotheofficialSPECrunrules,soourresultsdonotdirectlyorindirectlyrepresentaSPECmetric.Allrunsusethesize

100inputs,andruneachdistinctbenchmarkinanewVMinstance.

AllexperimentsranonadedicatedIBMRS/6000ModelF80withsix500MHzprocessorsand4GBofmainmem-ory,runningAIX4.3.3.JikesRVMwasrunwithaFas-tAdaptiveSemispaceconfiguration,usingonevirtualpro-cessor,400MBofsmallobjectheap,and100MBoflargeobjectheap.

5.1Offlinedata

First,weexamineexperimentalresultsbasedonoffline“perfectprofiles”.Thatis,wegathertheprofiledataof-fline,andthenruntheprogramusingthesameinput.Theseexperimentsreveal,tothefirstorder,anupperboundongainspossiblewithon-linedeferredcompilation.Fortheseruns,weuseaconfigurationofJikesRVMthatloadsandprecompilesallclassesbeforethefirstrunoftheprogram;thus,compilationactivitydoesnotfactorintothereportedperformanceresults.

Weconsiderdeferredcompilationusingoneorbothofthefollowingtwopolicies:

EdgecountersWedefercompilationofanybasicblocks

thattheprofiledataindicateswereneverexecuteddur-ingtheprofilingrun.StaticheuristicsWedefercompilationofthefailedbranch

ofanyinlineguardatacallsitethatismonomorphicintheloadedclasshierarchy.Notethatusingtheperfectprofiles,deferringcompila-tionissafeinthattheuncommoncodeneverexecutes.Weconsiderthefollowingfourvariantsofthesystem:IdealThisconfigurationdeferscompilationbasedonboth

edgecountersandstaticheuristics.Inplaceofuncom-moncode,thecompilerinsertsanunconditionaltrapinstructionthatdoesnotkeepanyprogramstatelive.Ideal-OSRThisconfigurationdeferscompilationbasedon

bothedgecountersandstaticheuristics.Inplaceofuncommoncode,thecompilerinsertsanOSRPointthatkeepsstatelivetoallowrecoveryofaJVMscopedescriptor.Static-OSRThisconfigurationdeferscompilationbased

solelyonstaticheuristics.Inplaceofuncommoncode,thecompilerinsertsanOSRPoint.EagerThisconfiguration(thedefaultJikesRVM)doesnot

performdeferredcompilation.Intheseofflineexperiments,thedifferencebetweenIdealandIdeal-OSRisthatOSRpointsconstrainoptimiza-tion.Thus,thedifferencebetweentheseconfigurations

quantifiesthecostoftheseconstraints,comparedtoaper-fectoraclethatcouldpredictthefutureandthrowawayun-reachedcodeunconditionally.

ThedifferencebetweenStatic-OSRandIdeal-OSRquantifiesthebenefitsofperformingdeferredcompilationbasedonprofiledata,asopposedtosimplyusingOSRforclass-hierarchy-basedinlining.

Clearly,theimpactofdeferredcompilationoncompilerspeedorcodequalitydependshighlyontheunderlyingcompilerinfrastructure.Toexplorethisspace,wereportstatisticsforeachoftheoptimizingcompiler’sthreeopti-mizationlevels.

Figure3showsthecompilerspeed,inbytecodebytespermillisecond,ofeachsystemconfiguration.Theresultsshowanaverageupperbound(Ideal)improvementincompile-time,comparedtoeagercodegeneration,of3%,13%,and11%atoptimizationlevelsO0,O1,andO2,respectively.Whenconstrainingoptimizationtoaccountforpotentialin-validation(Ideal-OSR),theimprovementsdropto3%,7%,and10%,respectively.So,constrainingoptimizationhasasignificanteffectoncompile-timeimprovements,espe-ciallyathigheroptimizationlevels.Acrosstheboard,main-tainingOSRpointsonlyatstatically-predictedinlineguardfailuresactuallyslightlydecreasescompilerspeed,ascom-paredtosimplygeneratingtheeliminatedcallinstruction.WeconcludethatmaintaininganOSRpointintheIRen-tailsslightlymoreworkthanmaintainingacallinstruction.Figure4showsthesizeoftheresultantmachinecode,inmachinecodebytesperbytecodebyte.Theresultsshowanaverageupperbound(Ideal)improvementingeneratedcodesize,comparedtoeagercodegeneration,of7%,16%,and13%atoptimizationlevelsO0,O1,andO2,respectively.WhenconstrainingoptimizationwithOSRPoints,(Ideal-OSR),theimprovementsdropto7%,13%,and9%,respec-tively.MaintainingOSRpointsonlyforinlineguardshasanegligibleeffectoncodesize;asexpected,sincethede-ferredcompilationelidesonlyasinglecallinstructionanditsassociatedsetup.

Figure5showsthespeed,relativetotheeagerconfig-uration,oftheresultantcodeforeachconfiguration.ThedifferencebetweentheIdealandEager(baseline)config-urationshowsthemaximumspeedupthecompilercouldhopetoobtainbythistechnique.Theresultsshowthatonaverage,idealdeferredcompilationcouldimprovegen-eratedcodequalityby0.5%,1%,and2.2%atoptimiza-tionlevelsO0,O1,andO2,respectively.Naturally,im-proveddataflowhelpsmoreathigheroptimizationlevels.Thejessandmtrtbenchmarksseethelargestpotentialgain,eachroughly5%.Nevertheless,overallthesenum-bersshowonlyamodestpotentialforgainduetoimproveddataflow.

TheIdeal-OSRconfiguration,whichkeepsOSRdatalive,onaveragedoesnotchangeperformance.Thejess

10IdealIdeal-OSRStatic-OSR5Eager0compressjessdbjavacmpegaudiomtrtjackG. Meana)64IdealIdeal-OSRStatic-OSREager20compressjessdbjavacmpegaudiomtrtjackG. Mean4b)3Ideal2Ideal-OSRStatic-OSREager10compressjessdbjavacmpegaudiomtrtjackG. Meanc)

Figure3.Compilerspeedusingoffline“per-fect”profiledataatoptimizationlevela)O0,b)O1,andc)O2.

1086IdealIdeal-OSRStatic-OSR4Eager20compressjessdbjavacmpegaudiomtrtjackG. Meana)10IdealIdeal-OSRStatic-OSR5Eager0compressjessdbjavacmpegaudiomtrtjackG. Meanb)10IdealIdeal-OSRStatic-OSR5Eager0compressjessdbjavacmpegaudiomtrtjackG. Meanc)

Figure4.Finaloptimizedmachinecodesizeusingoffline“perfect”profiledataatopti-mizationlevela)O0,b)O1,andc)O2.

Machine Code Size (mcb/bcb)Compilation Rate (bcb/ms)Machine Code Size (mcb/bcb)Compilation Rate (bcb/ms)Machine Code Size (mcb/bcb)Compilation Rate (bcb/ms)1.10noitarugifn1.05oc regIdealaEIdeal-OSR otStatic-OSR evita1.00lercompressjessdbjavacmpegaudiomtrtjackG. Mean deepS0.951.10a)noitarugifn1.05oc regIdealaEIdeal-OSR otStatic-OSR evita1.00lercompressjessdbjavacmpegaudiomtrtjackG. Mean deepS0.951.10b)noitarugifn1.05oc regIdealaEIdeal-OSR otStatic-OSR evita1.00lercompressjessdbjavacmpegaudiomtrtjackG. Mean deepS0.95c)

Figure5.Speedupobtained(biggerisbetter)comparedtothedefault(eager)codegener-ationstrategy,whenusingdeferredcompila-tionstrategiesatoptimizationlevela)O0,b)O1,andc)O2.Theseresultsuseoffline“per-fect”profiledata.

benchmarkseesthelargestimprovement,of4%.The5%idealmtrtgainvanishes,insteadproducinga1.5%degra-dation.jackseesasubstantialdegradation,of4%.Over-all,thenumbersindicatethatthecostofconstrainingopti-mizationwithOSRPointsroughlynegatesthesmallbene-fitduetoimproveddataflow,withvariancesonindividualbenchmarks.

ThecompilershouldbeabletoapproachtheIdealper-formancebysinkingcodeintouncommonsectionstoma-terializestatethatisotherwisedead.Whaley[16]im-plementedaspecializedsinkingtransformationtoaccom-plishthis.WearecurrentlyworkingonimprovementstotheRVMoptimizer’scodemotion,tomoreeffectivelysinkcomputationintouncommonbranchextensions.

PlacingOSRpointsusingthestaticstrategyhasanegli-gibleimpactonperformance.ThisindicatesthattheJikesRVMpatch-pointbasedguardedvirtualinlining[13]andstaticsplittingheuristics[4]handleclass-hierarchybasedinliningeffectively.

5.2Onlinedata

Now,weexaminetheperformanceofdeferredcompila-tioninafullyautomaticon-lineadaptiveoptimizationsys-tem.Fortheseexperiments,thebaselinecompilerinsertsin-strumentationintoeachmethodtogatheredgecounterdata.Thatis,thesysteminsertscountersthatcountthenumberoftimeseachbranchistakenandnottaken.Whenoptimizingamethod,theoptimizingcompilerpropagatestheinforma-tionandinsertsanOSRPointatthebeginningofeachbasicblockthatthecountersindicatewasneverreachedwhiletheunoptimizedcoderan.

Naturally,aftercodeisoptimized,itmaytraverseanewpathandexecuteablockthatwasnotpreviouslyexecuted.Inthiscase,thesystemperformsOSRtogenerateuncom-moncodeasneeded.WhenperformingOSR,thesystemimmediatelyinvalidatestheoptimizedcodewhichtriggeredthedeferredcompilation.Whentheprogramnextenterstheinvalidatedmethod,thesystemre-optimizesthemethodwithoutusingdeferredcompilation.

Wereportperformancebasedonwall-clocktime,whichincludesallonlineprofiling,decisionmaking,recompila-tion,invalidation,andOSRtransitionactivity.

Figure6ashowstheperformanceduringthefirstrunofeachoftheSPECjvm98benchmarks.Theresultsshowthatonaverage,theOSRtechniquesimproveperformanceofthefirstrunofaSPECjvm98benchmarkby2.6%.UsingonlystaticplacementofOSRpointsforinliningimprovesperformanceby1.8%.Thecompressbenchmarkshowsasignificantimprovement(around8%);onlythisbenchmarkrunslong-runningloopswhichtriggerOSRpromotionfrombaselinetooptimizedcode.Thedbandjavacbench-marksshowimprovementsof8%and4%respectively,pre-

1.15noitarug1.10ifnoc regaE o1.05OSR-edge countstOSR-static evitaler pu1.00decompressjessdbjavacmpegaudiomtrtjackG. MeanepS0.951.15a)noitarug1.10ifnoc regaE o1.05OSR-edge countstOSR-static evitaler pu1.00decompressjessdbjavacmpegaudiomtrtjackG. MeanepS0.95b)

Figure6.Speedupobtained(biggerisbetter)comparedtothedefault(eager)codegenera-tionstrategy,whenusingmulti-leveladaptiverecompilation:a)firstrun.b)bestrunof10.

sumablyduetomoreaggressivecompilation.Themtrtperformancedegradesby2-4%.Thisbenchmarkcontainsafewhotmethodsthatdominateperformanceafterashortstart-upphase.ItappearsthatwithOSR,theadaptivesys-temismoreaggressivecompilingmethodsduringthestart-upphase.Thisdelaysthetimetoreachthecomputationallyintensivemethods,anddelaysoptimizationofthesemeth-ods,hurtingperformance.Itremainsanopenissueonhowtodesignanadaptiverecompilationstrategythatavoidsthispathologicalbehavior.

Figure6bshowstheperformanceonthebestof10runsofeachbenchmark,usingtheOSRtechniques.Theoverallperformance,comparedtoeagercompilation,isvirtuallyunchanged,althoughindividualbenchmarksvaryslightly.Thisisconsistentwiththedatafromtheofflinestudy.

Table1showsthecompilationactivityduringthefirstrunofeachbenchmark,withandwithoutOSRtechniques.Thefirsttwocolumnsreportsthenumberofon-stackre-placementeventsinthefirstrunofeachbenchmark.Onlycompresstriggerspromotionevents.SixofthesevenbenchmarkstriggerOSRinvalidationsduetodeferredcom-pilation.Alltheseinvalidationsresultfromincompletecodecoverageprofiledata;nonerepresentfailedinliningsinval-idatedbydynamicclassloading.

Thus,inallcases,bythetimetheadaptivesystemopti-mizedamethod,ithadloadedallclassesthataffectinlineguards.However,inafewcases,theon-lineprofiledata

didnotcoveralldynamicallyreachedinstructions,causinglaterinvalidations.Weverifiedthateachinvalidationtakesatmostontheorderof10microseconds;thuscausingasmalldirectperformancehit.

Theremainingcolumnsofthetableshowthenumberofmethodstheadaptivecontrollerchosetocompileateachoptimizationlevel.RecallthatusingOSR,thecontrollermodelshouldcompilemoreaggressivelysinceitestimatesareductionincompile-time.Ingeneral,thedatabearsthisout,asoverallthesystemcompilesmoremethodsusingtheOSRheuristics.Thechangeincompileractivityisnotdra-matic(4%moremethodscompiled),corroboratingearlierexperiencesthattheanalyticrecompilationmodelisrela-tivelyinsensitivetovariancesincompilerperformance[1].Thetableshowsaninterestingcounter-intuitivetrendinthatwithOSR,fewermethodsrisetothehighestoptimiza-tionlevel.Wesuspectthatthecontrollercost-benefitmodelassignshigherprioritytothefirst(low-level)optimizationofamethod,sincethisoptimizationgivesthegreatestab-solutegain.ItseemsthatthisphenomenonisexaggeratedbythemodifiedOSRcontrollermodel,causingmore“low-level”optimizationandindirectlydelaying“high-level”op-timization.

Wenotethatonshorttime-scales,thebehavioroftheadaptivesystemdependsonacomplexandnon-linearfeed-backloop,varyingsubstantiallybasedoncompetingesti-matesofcostsandbenefitsandsampledata.Althoughthebehaviorappearstoberelativelystableonaggregate,itre-mainsanopenproblemtobetterunderstandhowchangestocompilerperformanceandcodequalitytranslateintoover-allsystemperformance.

6RelatedWork

On-stackreplacementtechniquesoriginatedinSelf-91’sdeferredcompilation[5].TheSelf-91compilerdefersgen-eratingcodeforthetargetsofuncommonbranches,predict-ingthesebranchesbasedontypesappearinginnormalcode.Inthecasethatanuncommonbranchtargetexecutes,thesystemgeneratescodeondemandusinguncommonbranchextension.TheoriginalmotivationfordeferredcompilationinSelf-91wastoreducecompiletimeandavoidconser-vativedataflowmerges.Later,Self-93[10]usedOSRtoimproveperformancewithadaptiverecompilation,transfer-ringexecutionfromslowcodetofastercodeon-the-fly.Selfalsopioneeredtheconceptofusingde-optimizationtodebugoptimizedcode[9],whichremainsoneofthemostcompellingapplicationsofOSR.Wehavenotyetimple-mentedthisapplication,astheJikesRVMdebuggerframe-workisstillimmatureandunstable.

OfcurrentproductionJavaVirtualMachines,onlytheHotSpotservercompiler[12]hasreportedOSRtechnol-

Benchmarkcompressjessdbjavac

mpegaudiomtrtjackTotal

WithOSR

InvalidationsO0

617049181017116855715924429

O2

212273825

WithoutOSRO1917416291126112

total2860171871017585553

Table1.CompilationactivitybytheJikesRVMAdaptiveOptimizationSystemduringthefirstrunofeachSPECjvm98benchmark,measuredinmethodscompiled.Promotionsrepresenttransitionsfrombaselinetooptimizedcodeforalong-runningactivation.InvalidationsrepresentOSRtransitionsduetoincorrectlypredicteddeferredcompilation.Theremainingcolumnsreportthenumberofmethodsoptimizedateachoptimizationlevel.

ogy.2HotSpottransitionsfrominterpretedcodetoopti-mizedcodeforlong-runningloops,basedoninvocationandloopcounters.Additionally,theHotSpotcompilerusesde-ferredcompilationtoavoidgeneratingcodeforuncommonbranches,suchasthefailedbranchofclass-hierarchy-basedguardedinlining,andprogrampointsthatinvokeclassini-tializers.Toourknowledge,HotSpotdoesnotperformprofile-drivendeferredcompilationexcepttoinlinecallsitesthatprofiledataindicatesaredynamicallymonomor-phic.

Whaley’s[16]partialmethodcompilationessentiallyap-pliesSelf-91’suncommonbranchextensionto“rare”blocksasdeterminedbyheuristicsonprofiledata.Thepresentedresultsfocusmainlyoncompile-time;thereweresmallim-provementsinrunningtime,attributedtobetterdataflow.Themethodologyofourdeferredcompilationstudydif-fersfromWhaley’sinmanydimensions.Inparticular,fortheexperimentalresults,Whaleyimplemented“methodoutlining”atthebytecodelevel,ratherthanengineeringOSRintotheoptimizingJIT.Thishasimplicationsonbothcorrectness(aswediscussedforpseudo-bytecodes),andonthequalityofthegeneratedcode.Furthermore,sinceWha-leyusedtheIBMDKasa“blackbox”torunthetrans-formedcode,itisnotclearhowtheunderlyingJITin-linesandoptimizesthegeneratedsourcecode.Incontrast,ourexperimentalresultsbreakdowntheimpactofOSRpointsoncodequalityduetodataflowimprovementsandoptimizationconstraints.Finally,asdescribedinthelit-erature[2,14],thetwosystemsusedrasticallydifferentadaptiverecompilationstrategies,whichmakeexperimen-

existencebasedoptimizationsasOSRrestrictedtoOSRatmethodentry,whenreplacingastackframeistrivial.

7Conclusion

Wehavepresentedarelativelysimpleon-stack-replacementmechanism,andpresentedacomprehensiveevaluationofon-lineprofile-directeddeferredcompilationandpromotionofunoptimizedcode.Theresultsshowthatinourimplementation,deferredcompilationprovidesmod-estgainsincodequality,compilerspeed,andcodespace.WeshowedhowtointegratetheOSR-basedtechniquesintoamodel-drivenadaptiveoptimizationsystem,andshowedthatthesystemrespondsbymoreaggressivecompilation,providingasmallperformanceimprovement.

Shouldaproductionvirtualmachineimplementon-stackreplacement?Inourexperience,thispaper’sOSRmecha-nismenablesanimplementationwithminimaldisruptiontotherestofthecodebase.Theabilitytodebugopti-mizedcodeviade-optimizationmayjustifytheinvestment.Regardingperformanceimprovements,ourresultssofarshowamodestbutnotcompellingimprovementfromon-lineprofile-directeddeferredcompilation.Ourresultsindi-catethatclass-hierarchy-basedinliningcanbehandledef-fectivelywithacombinationofsimplertechniques;namely,pre-existence,codepatching,andstaticsplitting.Onthepositiveside,OSRprotectsagainstpathologicallybadper-formanceduetoasinglelong-runningloop;thispathologyoftenappearsinmicrobenchmarks.

Webelievethatmoredramaticperformanceimprove-mentswillcomewithmoreadvancedprogramtransforma-tions.OSRprovidesaremarkablypowerfulinvalidationmechanism,whichtheoptimizercanrelyontoimplementmoreinvasivetransformationssuchasobjectinlining,ortodrawoptimisticconclusionswhereconservativeanalysisfails.WeplantomaketheOSRimplementationpubliclyavailableinaJikesRVMrelease,andhopethatitsavail-abilitywillspurfutureresearchinthisdirection.

8Acknowledgements

TheauthorswouldliketothankDaveGrovefornumer-ousproductivediscussions,andtheanonymousreviewersforvaluablefeedback.

References

[1]M.Arnold,S.Fink,D.Grove,M.Hind,andP.Sweeney.

AdaptiveoptimizationintheJalape˜noJVM:Thecontroller’sanalyticalmodel.In3rdACMWorkshoponFeedback-DirectedandDynamicOptimization(FDDO-3),Dec.2000.

[2]M.Arnold,D.Grove,S.Fink,M.Hind,andP.Sweeney.

AdaptiveoptimizationintheJalape˜noJVM.InProceed-ingsoftheACMSIGPLANConferenceonObject-OrientedProgrammingSystems,Languages,andApplications(OOP-SLA2000),Minneapolis,MN,Oct.2000.AlsopublishedasACMSIGPLANNotices,volume35,number10.

[3]D.BrueningandE.Duesterwald.ExploringOptimalCom-pilationUnitShapesforanEmbeddedJust-In-TimeCom-piler.InProceedingsofthe2000ACMWorkshoponFeedback-DirectedandDynamicOptimizationFDDO-3,Dec2000.

[4]C.Chambers.TheDesignandImplementationoftheSelf

Compiler,anOptimizingCompilerforObject-OrientedPro-grammingLanguages.PhDthesis,StanfordUniversity,Mar.1992.PublishedastechnicalreportSTAN-CS-92-1420.[5]C.ChambersandD.Ungar.Makingpureobject-oriented

languagespractical.InACMConferenceonObject-OrientedProgrammingSystems,Languages,andApplications,pages1–15,Nov.1991.

[6]R.Cytron,J.Ferrante,B.K.Rosen,M.N.Wegman,and

F.K.Zadeck.Anefficientmethodforcomputingstaticsingleassignmentformandthecontroldependencegraph.ACMTransactionsonProgrammingLanguagesandSys-tems,13(4):451–490,1991.

[7]D.DetlefsandO.Agesen.Inliningofvirtualmethods.

In13thEuropeanConferenceonObject-OrientedProgram-ming,June1999.

[8]S.Fink,K.Knobe,andV.Sarkar.Unifiedanalysisofarray

andobjectreferencesinstronglytypedlanguages.InSev-enthInternationalStaticAnalysisSymposium(2000),June2000.[9]U.H¨olzle,C.Chambers,andD.Ungar.Debuggingopti-mizedcodewithdynamicdeoptimization.ACMSIGPLAN

Notices,27(7):32–43,July1992.

[10]U.HolzleandD.Ungar.Athirdgenerationselfimplementa-tion:Reconcilingresponsivenesswithperformance.InACMConferenceonObject-OrientedProgrammingSystems,Lan-guages,andApplications,pages229–243,1994.

[11]T.LindholmandF.Yellin.TheJavaVirtualMachineSpec-ificationSecondEdition.TheJavaSeries.Addison-Wesley,1999.

[12]M.Paleczny,C.Vick,andC.Click.TheJavaHotspot(tm)

ServerCompiler.InUSENIXJavaVirtualMachineRe-searchandTechnologySymposium,pages1–12,Apr2001.[13]T.Suganama,T.Ogasawara,M.Takeuchi,T.Yasue,

M.Kawahito,K.Ishizaki,H.Komatsu,andT.Nakatani.OverviewoftheIBMJavaJust-in-Timecompiler.IBMSys-temsJournal,39(1),2000.

[14]T.Suganuma,T.Yasue,M.Kawahito,H.Komatsu,and

T.Nakatani.AdynamicoptimizationframeworkforaJavajust-in-timecompiler.InACMConferenceonObject-OrientedProgrammingSystems,Languages,andApplica-tions,pages180–195,Oct.2001.

[15]TheStandardPerformanceEvaluationCorporation.SPEC

JVM98Benchmarks.http://www.spec.org/osg/jvm98/,1998.

[16]J.Whaley.Partialmethodcompilationusingdynamicprofile

information.InACMConferenceonObject-OrientedPro-grammingSystems,Languages,andApplications,Oct2001.

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- awee.cn 版权所有

违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务