UsingCompactFiniteAutomata
ChristopherL.Hayes
∗
DepartmentofElectricalandComputerEngineering
UniversityofMassachusettsLowell
Lowell,MA,01854USA
andYanLuo
hayesc3@rpi.edu,yanluo@uml.eduABSTRACT
DeepPacketInspection(DPI)hasbeenwidelyadoptedindetectingnetworkthreatssuchasintrusion,virusesandspam.Itischallenging,however,toachievehighspeedDPIduetotheexpandingrulesetsandeverincreasinglinerates.Akeyissueisthatthesizeofthefiniteautomatafallsbe-yondthecapacityofon-chipmemorythusincurringexpen-siveoff-chipaccesses.InthispaperwepresentDPICO,ahardwarebasedDPIenginethatutilizesnoveltechniquestominimizethestoragerequirementsforfiniteautomata.Thetechniquesproposedaremodifiedcontentaddressablemem-ory(mCAM),interleavedmemorybanks,anddatapacking.TheexperimentresultsshowthescalableperformanceofDPICOcanachieveupto17.7GbpsthroughputusingacontemporaryFPGAchip.ExperimentdataalsoshowthataDPICObasedacceleratorcanimprovethepatternmatch-ingperformanceofaDPIserverbyupto10times.
1.INTRODUCTION
CategoriesandSubjectDescriptors
C.2.0[ComputerCommunicationNetworks]:General—SecurityandProtection(e.g.,Firewalls);C.2.3[ComputerCommunicationNetworks]:NetworkOperations—Net-workmonitoring
GeneralTerms
Algorithms,Design,Performance,Security
Keywords
FiniteAutomata,ContentAddressableMemory,FPGA,In-trusionDetection
∗ChristopherHayeshassincemovedtoRensselaerPolytech-nicInstitute.
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.
ANCS’07,December3–4,2007,Orlando,Florida,USA.Copyright2007ACM978-1-59593-945-6/07/0012...$5.00.
Computernetworksarebecomingincreasinglyvulnera-bletomanythreats,suchasintrusions,worms,virusesandspam.Whenthesethreats,whichcontinuouslyincreaseinnumberandcomplexity,arecoupledwithconstantlyincreas-ingnetworklinerates,thesituationbecomesmoredifficulttocontain.DeepPacketInspection(DPI)playsanimpor-tantroleindetectingthreatsbysearchingthepayloadofnetworkpacketsforknownpatternsorsignatures[5,14].Itisalsoessentialintrafficcharacterizationandfine-grainednetworkmonitoring.
ADPIsystemperformsasetoftime-criticaloperationstosensecertainnetworkpatternsorbehaviorwhiletryingtominimizepacketprocessinglatency.Thefirststepistocap-turepacketsfromnetworkinterfacecards,reassembleandbufferthemforprocessing.Thesecondstepistosearchknownsignaturepatternsinthepayloadofpackets.Fol-lowingthat,aDPIsystemwillanalyzethematchedpacketsagainstsemantic-richpoliciestodetermineifanattackispresentoranalertshouldbetriggered.WearguethatthefirststepcanbespedupwithApplicationSpecificInte-gratedCircuits(ASICs)duetotheinvarianceofthetask,andthepolicyprocessingisbettercarriedoutwithgeneral-purposeprocessorsbecauseofthecomplexsemanticprocess-ing.Thematchingofsignaturepatternsfallsinthemiddleofthespectrum,andcanbenefitfromprogrammablehard-wareaccelerationbasedonFPGAsornetworkprocessors.Theexpandingsignaturesetsandincreasinglinespeedhavemadethesignaturepatternmatchingchallenging.Forexample,therulesetofthewell-knownintrusiondetectionsystem,Snort[19],contains4219rulesasofDec2005,andnewrulesareaddedconstantly.Variablepatternlengthandlocation,andincreasinglylargerulesetsmakepatternmatchingadifficulttask.Manystringmatchingalgorithmsexist,suchasthoseduetoBoyer-Moore[9],Wu-Manber[22],andAho-Corasick[4].Asmoregeneralcasesoffixedstrings,regularexpressionsusedtodepictpatternsignaturesfurthercomplicatethetaskduetopossibleexponentialnumberofstates.Therehavebeennumerousstudiesonregularexpres-sionmatching[10,23,12,2].DifferentplatformshavebeenusedtoperformDPI,includingASICs[20],FPGAs[11]andnetworkprocessors[17].Mostofthesemethodsrelyonstatemachinesorfiniteautomatatomatchpatterns.However,oneofthekeyissuesisthatthesizeofthefiniteautomataissolarge(oftenatleasttensofmegabytes)thattheyhavetobestoredinoff-chipmemorymodules.Asaresult,thesearchingontheautomataincursalargenumberofoff-chip
195
memoryaccessesthatleadtounsatisfactoryperformance.Thusitisimportanttominimizethestoragerequirementsofstatemachinessuchthattheycanbeputintoon-chiporotherfastmemorymodulestoachievehigh-speedpatternmatching.
Wearemotivatedtostudyefficientmemoryutilizationofprogrammablehardwarearchitecturetoimprovetheper-formanceofsignaturepatternmatching.Inthispaper,wepresentDPICO,ahardwarebasedDPIenginethatutilizesnoveltechniquestoreducethestorageforfiniteautomata.WefirstanalyzethestorageoftraditionalDeterministicFi-niteAutomata(DFA)andpresentthebaselinedesign.Tak-ingadvantageofthemulti-portedon-chipmemorybanksonmodernASICsandFPGAs,wethenusethreetechniquestoreducetheredundantinformation:(1)modifiedcontentad-dressablememory(mCAM),(2)interleavedmemorybanks,and(3)datapacking.Thecombinationoftheseconceptsallowsdatatobeorganizedinamoreefficientway.Ouranalysisshowthatthestoragecanbereducedbyover90%.WehaveimplementedanFPGAbasedprototypeincorpo-ratingtheproposedtechniques.TheDPIperformanceofthesystemisshowntobescalableandreachupto17.7GbpsusingacontemporaryFPGAchip.WeevaluatetheperformanceimpactofincorporatingaDPICObasedaccel-eratorinanx86/PCIeserverarchitecture.ThedatashowthatthepatternmatchingperformanceoftheDPIservercanbeimprovedbyupto10times.
Thecontributionsofourworkareasfollows.WepresenteffectivetechniquestostorecompactfiniteautomatainFP-GAs,takingadvantageoftheirarchitecturalfeatures.OurscalabledesignprovidesanempiricalbasisforarchitectinghighperformanceDPIsystems,wherethepatternmatchingisanimportantprocessingphase.Inaddition,finiteau-tomataisnecessaryinupperlevelsemanticpolicyprocess-ing,thusourapproachcanbeextendedtospeedupotherstatefulpacketinspectionoperations.
Thepaperisorganizedasfollows:Section2reviewsre-latedwork,Section3describesamotivatingexampleandthebaselinedesign,Section4elaboratestheproposedtech-niquestoreducethestorageofDFA,andtheperformanceresultsarepresentedinSection5.Finally,thepaperiscon-cludedinSection6.
2.RELATEDWORK
StringmatchingtechniquessuchasBloomFilters[8]andBoyer-Moore[9],Wu-Manber[22],andAho-Corasick[4]al-gorithmshavebeenthefoundationformanysignature-baseddetectionenginesovermanyyears.Someoftheseconceptshaveevenbeenexpandedtosearchagainstregularexpres-sionsinsteadoffixedstrings[10,23].Regularexpressionsin-creasethecapabilityandmaintainabilityofthreatdetectionsystemsbyincreasingtheflexibilityofthreatdefinitions.Schaelickeetal.characterizedtheperformanceoftheSnortsoftwareongeneral-purposeprocessors.Theyshowedthattheirhighestperformancetestsystemcouldonlysi-multaneouslyhandle217payloadrulesonanetworkrun-ningata100Mbpsrate[18].Morerecently,DregeretalprovidedinsightstotheperformancelimitationsofaBroNIDS[15]onacommodityPCandexploredwaystomitigateitsresourcedemands.Throughextensiveexperiments,theyshowedthatGbpsnetworkintrusiondetectionratecannotbeachievedwithoutcarefullytuningthesystem.Clearly,theperformanceofnetworkintrusiondetectionimplemented
ongeneral-purposeprocessorsisdeficientwhenconsideringincreasingspeeddemands.
Paxsonetalproposedtorethinkthehardwaresupportforefficientnetworkanalysisandintrusionprevention[16].Theydescribedahigh-levelnetworksecurityanalysispipeline,wheretheprotocolanalysisisoneofthestages.Patternmatchingagainstpacketpayloadisindispensabletoclassifyflowsasincreasinglymoreapplicationscannotbereliablyidentifiedbymerelyexaminingtransportportnumbers.Manysignature-basedsystemshavebeenarchitectedfortheFPGA[21]andASIC[10],takingadvantageofthepar-allelstructuresavailableinthesedevices.ThesedesignsarepredominantlybasedontheAho-Corasickalgorithmorotherfiniteautomaton-likestructures.Thesearchitectureshaveincrementallyimprovedthespeedorstorageutiliza-tionofsignaturematchingthroughmodificationoftheim-plementation.
Theseimplementation-basedimprovementsarecomple-mentedbyalgorithmicimprovementsdirectedatmodifyingthefiniteautomatathemselves.Kumaretal.[12]modifiedaDFAtocombinecommonoutputtransitionsofindivid-ualstatesbycreatingadefaulttransitionbetweenthosestates.ThismodifiedDFA,knownasthedelayed-inputde-terministicfiniteautomataorD2FAisfoundtoreducetheDFAstorageofapartitionedCiscorulesetfrom92MBto2MB[12].Thissignificantsavingsbeginstobringfi-niteautomatonprocessingoflargerulesetsintoarangethatisconceivableonanFPGAornetworkprocessor.Thisstorageimprovement,however,isprovidedatthecostofthedelayedinputbehavior,reducingtheaverageprocessingthroughput.Alongwithtransitionreduction,statemergingtechniqueshavealsobeenproposed[6].
LinetalproposedtoreduceFPGAlogicsusedtomatchregularexpressionsthroughsharingcommonsub-regularex-pressions[13].TheirmethodissimilartoanumberofNon-deterministicFiniteAutomata(NFA)basedapproaches[7],whosedrawbackisthatthosedesignscannoteasilyaccom-modatenewsignaturepatterns.Anewcompilation-synthesis-placement-downloadprocedureisneededbecauseregularex-pressionsare“hardcoded”intoFPGAlogicelements.Inthispaper,wefocuson“memorybased”finiteautomatabecauseofitssignificantadvantageover“logicbased”finiteautomata:thememorybasedFAcanbeeasilyupdatedtoincorporatenewsignaturepatternswithoutreprogrammingFPGAs.
Therearetwopopularcategoriesoffiniteautomata:Non-deterministicFiniteAutomata(NFA)andDeterministicFi-niteAutomata(DFA)[23].Theydiffersignificantlyinthecomplexityofstorageandsearching.ThespacecomplexityofNFAisO(n)anditssearchingcomplexityisO(n2),whileDFA’ssearchingcomplexityisO(1)atthecostofspacecom-plexityO(2n).ForDPIsystems,DFAisthepreferredstatemachineespeciallyfordelaysensitivenetworkapplications,thusitisthefocusofthispaper.
3.3.1
MOTIVATIONANDBASELINEDESIGNMotivatingExample
Weassumethataspecificnetworkattackispresentwhenthestring“root”isfoundinthepayloadofapacket.Sec-ondly,weassumeadifferenttypeofattackisbeingprose-cutedwhenthestring“rmdir”isfoundinapacket.Eachofthesestringsformthefoundationofaruleandtogether
196
d|i|m|o|t0d|i|trm2dd|i|m|o|t3i4d|i|m|o|t5rr8rrrrr1ord|i|m|td|i|m|oo7td|i|m|o|timumamountofmemory(S)neededforafiniteautomatonisderivedfromthenumberofstatesintheautomatonandthenumberofstringsorregularexpressionsthatcanbematched.
S=2bNSlg(NS)+NSlg(NM+1)
wherebisthenumberofbitsintheinputcharacter,NSisthenumberofstatesintheautomaton,andNMisthenumberofMatchIDsintheautomaton.
Iftheinputtothestatemachinewerea3-bitcharacter(toencodethealphabetinthemotivatingexample),thebaselinedesignwouldrequire72locationsinmemory;theproductofthenumberofpossibleinputcharactersbythenumberofstates.Someoftheinformationstored,however,isredundantasthereareonly24uniquetransitionsstoredinthose72locations.Itisthroughtheeliminationofthisredundancythatweachieveincreasedstorageefficiency.
i|m|o|td|m|o|t64.
the
Regular
Expression
PROPOSEDTECHNIQUES
Figure1:DFA{root|rmdir}.
for
theymakearulesetthatcanbeusedoneverypackettofilterbothtypesofattacks.
Here,wecompiletherulesetintoaDFAthatallowstherulestringstobefoundonatanypointinthedatastream,evenwhenmultiplestringsoverlap.Figure1showsaDFAfortheregularexpression{root|rmdir}withtheinputal-phabetΣ={d,i,m,o,r,t}.TheDFAcontainsninestates,twoofwhichareacceptingstates,and24distincttransitions.Whenacceptingstatefiveisreached,thestring“rmdir”hasbeenmatched,andlikewise,whenacceptingstateeightisreached,thestring“root”hasbeenmatched.
3.2Baseline
WeuseatraditionalDFAimplementationasabaselineforperformancecomparison.Inthismethod,thestatema-chineisimplementedintwomemorymodules.Onememorymodulecontainsthestatetransitiontableandthesecondmodulecontainsthematchidentifiers.Thecurrentstatepointerandtheinputcharacterarecombinedandusedastheaddresstolookupthenextstatepointer.Thecurrentstatepointerisalsousedasanaddresstolookupthematchidentifierforthecurrentstate.
Input CharTransition Table (RAM)Current State PtrOutput Table (RAM)Match IDIngeneral,multipleregularexpressionsareencodedintoasinglefiniteautomaton.Whenanacceptingstateoftheautomatonisreached,aregularexpressionismatched.EachacceptingstateisgivenaMatchIDtosignifywhichregularexpressionhasbeenmatched.IftheMatchIDisallzeros,thecurrentstateisnotanacceptingstateandhencethereisnoregularexpressionmatch.
Inourdesign,weassumethatthestatetransitiontableforthefiniteautomatawillbestoredinauniformcostRAMasshowninFig.2.Eachstateofafiniteautomatonconsistsofoneormorenext-statetransitions.Atransitionwillrequireexactlyonelocationinmemory.StatesintheDFAcancon-tainavariablenumberofnext-statetransitions.Next-statetransitionsforanygivenstatewillbestoredcontiguouslyinthememory,andlikewise,statesthemselveswillbestoredcontiguously.
Insuchadesign,wetakeadvantageofthreetechniquestoallowforthereductionofredundantinformation:(1)modi-fiedContentAddressableMemory(mCAM),(2)interleavedmemorybanks,and(3)datapacking.Thecombinationoftheseconceptsallowsdatatobeorganizedinamoreefficientway.
4.1ModifiedContentAddressableMemory
Figure2:BaselineStateMachineBlockDiagramWhenthismethodisused,eachstateisgivenafixedandconstantamountofmemory.Thissectionofmem-orycontainsalocationforeachpossibleinputcharacter.Forinstance,iftheinputcharacterwereeightbitswide,eachstatewouldrequireenoughmemorytocontain256(28)next-statetransitions.Whenusingthistechnique,themin-
ForagivenDFA,thememoryinthebaselinedesignwillcontainrepeateddata.Forinstance,agivenstatewillhavemanytransitionsthathavethesamenext-statepointer.Sincethebaselinedesignexplicitlystoreseverytransition,therepetitionofdataisinherenttothearchitecture.Toremovethepointerrepetition,itisnecessarytomodifythenext-statelookuptechnique.Thisisaccomplishedbycre-atingtwotypesoftransitions:labeledtransitionsandde-faulttransitions.TotranslateanexplicitstatetransitiontableforaDFA,whereeverypossibletransitionisstored,toanimplicitone,wetakethemostfrequentnext-statepointerforagivenstateandmakeitthedefaulttransition.Allothertransitionsareconvertedintolabeledtransitions.Eachstatecancontainuptoonelabeledtransitionforeachlabelandeachstatewillcontainexactlyonedefaulttransition.
Thisdefault/labeledtransitiontechniquecanbeappliedtoallthedeterministicfiniteautomata(DFA).SomeDFAvariantssuchasthedelayed-inputfiniteautomaton(D2FA[12]),haveinherentdefaulttransitions.Thesetransitionsaredifferent,however,becausetheycausetheinputtobe
197
delayedwhentheyaretaken.Statesthatdonothavede-layeddefaulttransitionscanbegivenanon-delayeddefaulttransitionthatcanbeselectedinthesamemannerastheDFA.Inthiscase,thedelayedandnon-delayeddefaulttran-sitionsneedtobedistinguished.
Usingthesetransitions,weconstructamodifiedCAMstructure(mCAM)whereeachstatehasitsownassociativememory.Fig.3depictsasetofstatesstoredinmCAM.La-belsfromthelabeledtransitionsarethekeystotheCAMandthenextstatepointerinformationisthedataintheCAM.Thedefaulttransitionandstateinformationarealsostoredinthememoryandarenotaccessedassociatively.Thesizeofstatesisnon-uniformthusithastobestoredwithineachstateforidentifyingtheboundaries.Thevari-ablesizeofstatesmakesthestateaccessandtransitionnon-trivial.
Key nmData nmtomatontothetotalnumberofpossibletransitions,2b.Asthatratiodecreases,thepotentialstoragesavingsincreases.Theminimummemoryrequiredcanbecalculatedas
Seff=NS(lg(NM+1))+NT(lg(NT)+b)
r=
NT
,NT≥NS
2bNS
Key 2Data 2Key 1Data 1CAM[m] SizeDefault Data [m]Key n2Key 2Key 1CAM[2] SizeKey n1Key 2Key 1CAM[1] SizeData n2whereNTisthenumberoftransitionsintheautomaton.Fromthis,weevaluatethedifferencesinminimumstor-agerequirementstoseewherethismethodismosteffective.TheresultsareshowninFigure4wherexaxisisthenum-berofstatesintheDFAandyaxisisthepercentageofmemoryreduction((S−Seff)/S).Theproposedmethodismosteffectivewhenthefiniteautomatawithlessaveragetransitionsperstate.Also,anyfiniteautomatonwithara-tioabove0.5doesnotbenefitfromthismethodsincetheoverheadofstoringlabelsinthememoryoutweighsthesav-ingsfromstoringlesstransitions.RulesetsusedinrecentresearchbyKumaretal.[12]resultintransitionratiosbe-tween0.01and0.09whentheserulesetsimplementedasaDFA.Mostoftheserulesetsalsohavetransitionratioslessthan0.5whenimplementedasD2FA.ThustheproposedmodifiedCAMstructurehaspotentialtoreduceboththeDFAandD2FAstoragesignificantly.
100
……Data 2Data 1Default Data [2]Data n1Data 2Data 1Default Data [1]…… 50
Percent Improvement 0
Figure3:mCAMDiagram
Alltransitionsforastate(defaultandlabeled)arereadfromthememory.Thelabelsarecomparedagainstthein-putcharacter.Iftheinputcharactermatchesoneofthela-belsonalabeledtransition,itsnext-statepointeristaken,otherwise,thenext-statepointerforthedefaulttransitionistaken.Itiscleartoseethatthistechniquerequiresalotofmemoryaccessesinordertoretrievethepointerforthenextstateifastatecontainsahighnumberofuniquenext-statetransitions.Later,wewillseethatthisissuecanbeaddressedintheFPGAimplementationsuchthatthenext-statepointercanberesolvedinasingleclockcycle.ThismCAMtechniquehasbothadvantagesanddisad-vantages.Thenumberoflocationsrequiredforthestateisreduced,however,thelabeledtransitionsrequirestorageofthenextstatepointeralongwiththelabel.Secondly,thestatenolongerrequiresconstantspaceinthememoryandcanbepackedmoretightlyinmemory.Thisimpliesthatthenextstatepointerswouldneedtoincreaseinsizetoaccountforthefactthatstatesdonotoccuronregularboundaries.Finally,thematchidentifierforthestatemustbestored.Sinceeverystatecontainsexactlyonedefaulttransition,thematchidentifierisstoredwiththedefaulttransition.
Theminimumamountofmemoryrequiredforthismethodisbasedonthenumberofstates,matchidentifiersandla-beledtransitions.Inthiscase,theeffectivestoragesavingsfromstraightforwardtechniquewillbedictatedbytheratio,r,oftheaveragenumberoftransitionsperstatefortheau-
-50
-100
Ratio = 0.062500Ratio = 0.125000Ratio = 0.250000Ratio = 0.375000Ratio = 0.500000
0
1000
2000
3000
4000 5000 6000
Number of States
7000
8000
9000
10000
-150
Figure4:StorageImprovementwithVariedTransi-tionRatios(b=8)
4.2InterleavedMemoryBanks
TheCAMimplementationwillneedtoanalyzeallofthelabeledtransitionsandthedefaulttransitionssimultane-ouslyinordertoevaluateastateinasinglecycle.CurrentFPGAproductsgenerallycontainhundredsofindividualmemoryelementson-chip.Usingtheseelements,wepro-posethatthestatetransitioninformationbeorganizedintosequentiallyinterleavedmemorybanks(showninFig.5)totakeadvantageofthepotentialmemorybandwidthavail-ableinFPGAdevices.Ifthenumberofbanks,n,isgreaterthanorequaltothenumberofnext-statetransitionsforthelargeststate(Nmax),allthetransitionscanbereadsimul-taneouslyandprocessedinparallel.Thisarchitecturewillalloweachstatetobeprocessedwithinasingleevaluationcycle.Ifthenumberofnext-statetransitionsexceedsn,a
cycles,whichtransitionwilltakeaconstanttimeofNmaxncanbeaddressedthroughpipelineddesign.Infact,on-chip
198
256andmorememorybankshavebecomecommoninmod-ernFPGAs[3],whichguaranteeone-cyclestateevaluationandtransition.
Phys. Addr3210Bank 0Location 12Location 8Location 4Location 0Bank 1Location 13Location 9Location 5Location 1Bank 2Location 14Location 10Location 6Location 2Bank 3D...Calc AddrROM 0Calc AddrROM 1...Calc AddrROM n-1Location 15Location 11Location 7Location 3QRAM 0RAM 1...RAM n-1State Start OffsetSelect Default Transition(Mux)Match IDInput Char.Select Labeled TransitionState End OffsetFigure5:SequentiallyInterleavedMemoryBanks(n=4).
Oneachreadcycle,nwordsarereadinparallelfromtheinterleavedmemory.Anylocationinanybankcanbethestartaddressforthen-wordread.Forexample,areadac-cesscanretrievewords5,6,7and8,whichspanintwocontinuousrows.Sinceastatecanhavelessthanntransi-tions,thesizeofthestatemustbeencodedsothatthereadlogiccanignoreunnecessaryinformation.Here,weproposetoinserttheendoffsetofthestatetothefirstlocationofthestatesothattheframingcanbedeterminedappropriately.Eachlocationinthememoryeithercontainsadefaulttransitionoralabeledtransition.Thelocationwithala-beledtransitionwillcontainthelabelandthenextstatepointerassociatedwiththetransition.Thelocationwithadefaulttransitionwillcontainthenextstatepointerforthedefaulttransitionandnecessarystateinformationsincethereisonedefaulttransitionperstate.Thestateinforma-tionincludestheoffsetofthelasttransitionforthestateandthematchidentifierforthestate.Sinceeverystatehasadefaulttransition,itisplacedasthefirstlocationforthestate.Figure6showsthestorageoflabeledtransitionsanddefaulttransitions.Thebitrangeofeachfieldisindicatedinthefigure.
Labeled Transitionlg(NT)+b-1
Next State Pointerb-1
Label0
Select Address (Mux)Figure7:BlockDiagramofDPICO.
4.3OperationofDPICO
Sincenisequaltoorlargerthanthenumberoftransitionsforthestate,allthetransitionsarereadsimultaneouslyfrommemorybanks0throughn-1.Thetransitionsarethenpro-cessedtoidentifythepropernextstateusingselectionlogic.ThelogicofselectinglabeledtransitionisshowninFig.8.AllthetransitionsreadfromtheRAMarequalifiedbasedonthebeginningoffsetofthestateandtheendoffsetreadfromthedefaulttransition.Thememoryoutputsoutsideoftherangearesimplyignored,andonlythequalifiedla-beledtransitionsarecomparedagainsttheinputcharacter.Ifamatchexists,amultiplexerselectstheoutputoftheRAMthatcontainsthatlabel.Ifthereisnomatchthenextstatepointerfromthedefaulttransitionisused.Oneachreadcycle,thematchidentifier,whichisalsocontainedinthedefaulttransitionlocation,isoutputfromthestatema-chine.Thedelayofaddresscalculation,memoryaccessandnextstateselectionlogicdeterminestheclockcycletime.Sincetheengineconsumesoneinputcharacterperclockcy-cle,themaximumprocessingthroughputofthisapproachforaDFAissimplytheproductoftheinputcharactersizeandtheprocessingclockfrequency:T=b·fmax.
Transition LabelsDefault Transitionlg(NM)+lg(NT)+b-1
MatchIDlg(NT)+b-1
Next State Pointerb-1
EndLocnState Start OffsetState End Offset.........Next State Pointers0
Input Char.Compare Labels...Qualify LabelsFigure6:Storageofdefaulttransitionandlabeledtransition.
Usingtheseconcepts,weproposeageneralarchitectureforperformingregularexpressionmatching.Fig.7depictstheblockdiagramoftheDPIenginecalledDPICO.Thearchitectureconsistsofcomponentsforaddresscalculation,transitionstorage,labelcomparisonanddeterminingnextstateaddress.Thetransitionsarestoredinasequentiallyinterleavedmemoryasdiscussedpreviously.Eachmemorybankhasanaddresscalculationunittogenerateproperbankaddressbasedonthecurrentstateaddress.Theinputchar-acteriscomparedagainstthelabelsofthecurrentstatestatethroughasetofselectionlogic(bottomportionofFig.7.)
EncodeMultiplexerNext State Ptr. ValidSelected Next State PointerFigure8:DiagramofSelectingLabeledTransition
4.4PackingAcrossMemoryBoundaries
Inpracticalapplication,theprevioussolutiondoesnotequaltheminimummemoryutilizationdescribedasSeff.Instead,theactualmemorysize(Sact)isdrivenbythesize
199
PhysicalLogicalBankAddrAddr55554444333322221111000032103210321032103210321023222120191817161514131211109876543210UnusedEndB: 3UnusedUnusedEndB: 1UnusedUnusedEndB: 2UnusedEndB: 3UnusedEndB: 1UnusedUnusedEndB: 3UnusedUnusedEndB: 0UnusedUnusedUnusedEndB: 1UnusedEndB: 1Ptr: 2Ptr: 0Ptr: 2Ptr: 22Ptr: 0Ptr: 2Ptr: 19Ptr: 0Ptr: 2Ptr: 0Ptr: 14Ptr: 0Ptr: 2Ptr: 12Ptr: 0Ptr: 2Ptr: 9Ptr: 0Ptr: 2Ptr: 16Ptr: 6Ptr: 0Ptr: 2Ptr: 0Label: rState 8MatchID: 2Label: rLabel: tMatchID: 0Label: rLabel: oMatchID: 0Label: rMatchID: 1Label: rState 4MatchID: 0Label: rLabel: iMatchID: 0Label: rLabel: dMatchID: 0Label: rLabel: oState 1Label: mMatchID: 0Label: rState 0MatchID: 0State 2State 3State 5State 6State 7Addr13
1211109876543210
UnusedLabeled-Trans. 5Labeled Trans. 4Labeled Trans. 3Labeled-Trans. 2Labeled Trans. 1Default TransitionLabeled Trans. 3Labeled-Trans. 2Labeled Trans. 1Default TransitionUnusedLabeled Trans. 7Labeled Trans. 6Labeled-Trans. 5Labeled Trans. 4Labeled Trans. 3Labeled-Labeled Trans. 1Trans. 2Default TransitionState 2
State 1
State 0
Figure10:ExamplePackedMemoryMap
Table1:CompressionResults
RuleSetimapftpnetbiosnntpexploitNo.ofRules467663313122BaselineSize(kbits)16,526.811,448.42,146.77,820.855,270.1DPICOUnpackedSize(kbits)698.4522.265.0322.67,182.9DPICOMin.Size(kbits)557.8408.753.1262.54884.0Trans.Ratio(r)0.0180.0170.0110.0170.046Pct.Savings96.596.497.596.691.2Figure9:Memoryorganizationfor{root|rmdir}.ofamemorylocation,L,whichisdeterminedbythesizeofthedefaulttransition,SD,andthelabeledtransition,SL.
Sact=L·NT,L=max(SD,SL)
Fig.9showsthememoryorganizationofthemotivatingexample.Itcanbeseenthatsomeunusedbitsarepresentineverylabeltransition.Thewastedmemoryspace(denotedas“unused”inFig.9)comesmainlyfromthedisparityinthesizeofthedefaultandlabeledtransitionsandtheas-sumptionthateachtransitionwouldoccupyonelocationinmemory.Thelabeledtransitionstendtohavelessinforma-tionandthereforegenerallysmallerthandefaulttransitions.Onemethodtobetterapproachtheminimummemorycalculationistopackthelabeledtransitionsacrossmemorylocationboundaries.Figure10showsanexampleofthepackingstrategyforastatemachinewithvariednumbersoftransitionsperstate.Thelargerdefaulttransitionoccupiesafullmemorylocationandisalwayspositionedonregularmemorylocationboundaries.Labeledtransitionsarepackedbetweendefaulttransitions.Somespacemaybeunusedforastate.Thisunusedspace,however,tendstobesmallerthantheunusedspaceforanunpackeddesign.Thispackingdoesnotincreasethecomplexityandlatencyofselectingthenextstatebecauseitonlychangeswhichbitsarefedtothecomparisonlogicandallbitsarereadsimultaneouslyanyway.Thememorybandwidthissavedeffectively.
impactofincorporatingsuchapatternmatchingaccelera-tortoacommodityserverarchitecturesinceourimmediategoalistospeedupthepatternmatchingphaseoftheDPIworkloadwithanaccelerationcard.WethendiscusstheapplicabilityofDPICOason-chipaccelerators.
5.1PerformanceCharacteristicsofDPICO
5.PERFORMANCEEVALUATION
WeevaluateourDPICOengineintwoaspects.Firstwestudytheeffectivenessofthedesignandexplorethedesignspace.Particularlyweapplypipeliningtechniquetoopti-mizethedesign.Second,weinvestigatetheperformance
WecharacterizedtheperformanceoftheDPICOdesignintwoways-(1)bymeasuringthecompressionfactoragainstaparticularDPIrulesetand(2)byfindingthespeedandsizingcharacteristicsofthedesignwhenimplementedinanFPGA.Thissubsectionsummarizesthoseresults.
Toshowcompressionresults,webeginwithasubsetoftheSnortruleset.Wechoosefiverepresentativerulefiles,namelyimap,ftp,netbios,nntpandexploitbecausetheirsizeisrelativelylargeinthewholeset.TherulefilesareconvertedintoDFAandD2FA,inthememorytableformatofthebaselinedesign.Next,theserulesetswereconvertedintothetableformatoftheDPICOdesign.Thesizingre-sultsarefoundinTable1whichshowsoverallDPICOcanreducememoryusagebyover90%forallrulesetsunderstudy.Datapackingcanreduceabout20-30%ofthestor-age.
TheDPICOhasbeenwritteninVHDLandtargetedforaXilinxVirtex4SX35FPGA.ThedesignwassimulatedusingaVHDLsimulator,synthesizedwithSynplicitySynplifyandimplementedusingtheXilinxISEtoolchain.
ThebaselineDFAdesignwasfoundtooperateatamaxi-mumof267.7MHzusingalmostnologicresourcesincludingLookUpTable(LUT)andRegister(REG).Table2showsthespeedandsizingresultsfromthestoragesavingdesign
200
Table2:SpeedandSizeoftheDPICO
No.ofMemoryBanks248163264128256LUT11418332069816723541765916052REG1292043526421252234648109563fmax(MHz)144.9122.3106.998.784.878.974.068.1Tmax(Mbps)1159.2978.4855.2789.6678.4631.2592.0544.8showsthenumberofDPICOenginesimplementedonthechip,i.e.thenumberofpacketflowsthatareprocessedsi-multaneously.Wealsocomparethenon-pipelineddesignwiththepipelineddesign.ItcanbeobservedthatmultipleDPICOengineshavelittlenegativeimpactontheopera-tionfrequencyofeachengine.Thetotalthroughputofthepatternmatchingprocessingisscalabletothenumberofenginesonachip.WithacontemporaryFPGAchip,theperformancecanreach17.7Gbps.Again,fourbanksgivethebestoverallperformanceduetothecomplexityofcompar-isonlogics.TheresultsalsoimplythepotentialofDPICOonmoreadvancedFPGAchips.
Table3:SpeedandSizeofPipelinedDPICO
No.ofMemoryBanks24816AddedPipelineStages
1223LUTREGfmax(MHz)247.8307.4251.9271.9Tmax(Mbps)1982.42459.22015.22175.2Non-pipelinedTmax(Mbps)1159.2978.4855.2789.65.2
PerformanceImprovementtoaDPIServerArchitecture
102189336596149253418889whenthedataisunpackedandtheautomatonisaDFA.Itisassumedthatthenumberofparallelreadsisgreaterthanorequaltothemaximumnumberoftransitions,bothdefaultandlabeled,inanystate.Thethroughputgivenassumes8-bitinputcharactersareused.Theresultsshowthattheclockfrequencydecreasesasthenumberofbanksincrease.ThisisbecausethepropagationdelayofthelogicencompassedintheSelectLabeledTransitionBlockandtheSelectDefaultTransitionBlockmustincreaseasthenum-berofbanksincreases.Theycontainmultiplexersthatmustbesizedappropriatelytoselectdatafromtheappropriatenumberofbanks.Largermultiplexershavelongerpropaga-tiondelays,subsequentlydecreasingtheclockfrequencyatwhichthedesigncanoperate.
WethenaddpipelinestagestotheDPICOdesigntomin-imizethepropagationdelaybetweenclockedelements(reg-istersorblockmemory),thusincreasingthemaximumclockfrequency.Whenpipelinestagesareadded,thestatema-chinecanprocessmultipletime-multiplexedstreamsofdata.Thenumberoftimeslotsisequaltothenumberofpipelinestages,andthebandwidthofeachindividualdatastreamisthetotalbandwidthofthestatemachinedividedbythenumberofclockedstages.Thepipelineddesignpresentsatradeoffbetweenmaximumprocessingbandwidthfortheengine,whichincreaseswhenthenumberofpipelinestagesincreases,versustheprocessingbandwidthforeachstream,whichdecreasesasthenumberofpipelinestagesincreases.Table3showstheresultsforvariedpipelineddesigns.Eachrowofthetablehasdifferentnumberofmemorybanksandpipelinestages.Itcanbeseenthatpipelineddesignbringssignificantimprovementovernon-pipelineddesign.There-ducedclockfrequencywithmorememorybanksaffecttheoverallperformanceofpipelineddesigns:4banksofmemoryoutperformtheotherconfigurations.
DPICOenginescanprovidescalableperformancebyin-corporatingmultipleenginesononeFPGAchiptoexploitpacketlevelparallelism.Weimplementsuchamulti-enginedesigninwhicheachenginehandlesapacketflowindepen-dently.DetailedperformanceresultsareshowninTable4.Thefirstcolumnisthenumberofmemorybanksundertestincluding2,4,8and16.Thesecondcolumnofthetable
OurimmediategoalistoincorporateDPICOaccelera-tioncardintoacommodityserverarchitecturewithx86coresandPCIExpressbuses.TheDPICOenginesitsintheFPGAonthePCIeacceleratorandDFAsarepreloaded.ADPIsystemsuchasSnortandBrorunsonx86cores,whichsendpacketstreamstotheacceleratorforpatternmatch-ing.ThepackettransferisoverthePCIebususingDMAoperations.TheDPICOenginesearchthepayloadagainstDFAsandsendstheresultsofpatternmatchingbacktothex86cores.We’dliketostudythepotentialofperformanceimprovementofsuchaconfiguration.
Wesetuptheexperimentasfollows.WeuserealisticrulesfromSnortrulesetastheregularexpressionstobematched.WecapturepackettracesatthenetworklinkconnectingourcampustotheISP.Thepackettracefilecontains406KTCPandUDPpacketswithnonzeropayloadsize.Thisre-alistictracefileisusedastheinputpacketstreamtotheDPIsystem.Wecreateacompilertoparseregularexpres-sionsandgeneratecorrespondingDFAs.WeruntheregularexpressionmatchingworkloadonaserverwithDualIntelXeonprocessors(4MBL2cache)and1GBmemory.Weinstrumenttheworkloadwithgprof[1]torecordthetimespentonthematchingprocess(excludingtheDFAcreation,bookkeeping,etc.)Then,weloadtheDFAsintoDPICOengineandprocessthesamepackettraceontheaccelera-torcard.Wecompareprocessingtimeinthetwoscenar-ios,takingintoaccounttheoverheadsuchaspassingpacketpayloadoverthePCIebus.Intheexperiment,thepacketsaretransferredoverPICebusinsequentialorder,notbeingcombinedinbatchestomaximizethebandwidthutilization.PayloadtransferandthepatternmatchingonFPGAarenotpipelined,whichcanbeoptimizedlater.TheseconservativesettingsimplytheworstperformanceofDPICObasedac-celerator.Weleavetheseoptimizationsforfuturework.Fig.11showstheresultsoftheexperiments.Theyaxisoftheplotisthetimespentonpatternmatching.ThelefttwobarsrepresenttheCPUprocessingtimeonthetwoSnortrulefiles,namelyexploit.rulesandweb-cgi.rules.TheprocessingtimeisthetimespentonsearchingtheDFAs.Therigh-mostfivebarsdepictthetimeusedbytheDPICOengines(inconfigurationsof1-4and8engines)togetherwiththeoverhead.InfactthemajorityoftheoverheadisthepayloadtransferoverPCIebus(16xspeedat8GB/s).ThedelayofDPICOenginesisdeterministicasexactlyonebyteisconsumedinonecycle,regardlessofrulesets.ThisfigureclearlyshowsthattheperformanceimprovementofaDPICOacceleratoronanx86/PCIeserverarchitecturecanreachupto10times.
201
Table4:SpeedofDPICOwithMultipleEngines
Non-pipelinedPipelinedNo.ofNo.offmaxTmaxPipe-fmaxTmaxMemoryDPICO(MHz)(Mbps)line(MHz)(Mbps)BanksEnginesStages21150.61204.82269.82158.422137.42198.42269.84316.823150.23604.82269.86475.224133.54272.02248.37945.628137.88819.22242.815539.241129.91039.23287.12296.842129.82076.83287.14593.643124.42985.63254.26100.844120.53856.03277.08864.048121.37763.23277.317747.281106.58524212.91703.28298.61577.64215.63449.68398.62366.44192.14610.48498.63155.24190.36089.688101.16470.44212.313587.216197.57804202.41619.216295.11521.64192.43078.416397.52340.04202.44857.616494.93036.84187.35993.616892.65926.44187.311987.2
5.3Discussions
2,000 1,800 1,600 1,400Time (ms)DPICOOverheadCPU 1,200 1,000 800 600 400 200exploitweb−cgiDPICO(1)DPICO(2)DPICO(3)DPICO(4)DPICO(8)TheDPICOdesigntakesadvantagesofsomefeaturesofmodernFPGAchips,butitsapplicabilityisnotlimitedtoFPGAs.WearguethatDPICOcanbeintegratedasanon-chipacceleratorofageneralpurposeCPU,althoughitisnotthefocusofthispaper.ThecontrollogicofDPICOissimplethusdoesnotdemandsignificantsiliconarea.On-chipDFAstoragecaneliminatetheoverheadoftransferingpayloadsoverperipheralbuses,however,on-chipmemorybanksingeneralpurposeCPUsareexpensiveinareaandpowercon-sumption.Correlationsamongmemorysize,performanceandpowerisworthystudying.Aspatternmatchingrulesetsexpand,thelimitedon-chipmemorymaygetfullyuti-lizedthusrequiringswappingunusedDFAstomainmemory.So,itisalsointerestingtoinvestigatethetrade-offsbetweenadedicatedon-chipDFAmemoryandasharedDFA/cachememory,andrelatedDFAreplacementpolicies.
06.CONCLUSION
Figure11:PerformanceImprovementtox86/PCIeServerArchitecture
InthispaperwepresentahighspeedDPIengine,DPICO,whichisshowntobemosteffectivewhentheaveragetran-sitionratioofaDFAissignificantlylessthan0.5.Thepro-posedtechniquescanreducethememoryusageofDFAsby90%.PerformanceevaluationresultsshowthatapipelinedimplementationswithmultipleDPICOenginescanreachto-talathroughputupto17.7GbpsincontemporaryFPGAde-vices.Experimentdataalsoshowupto10foldimprovementonpatternmatchingwhenincorporatinganDPICObasedacceleratortoanx86/PCIeserverarchitecturerunningDPIworkload.ThehighspeedandscalabilityofDPICOmakeitapromisingcandidateforawiderangeofDPIapplicationssuchasnetworkintrusiondetectionandspamfiltering.
202
Acknowledgment
ThisworkissupportedinpartbytheNationalScienceFoun-dationundergrantNo.CNS0709001andagrantfromIntelResearchCouncil.
7.REFERENCES
[1]Gnugprof.FreeSoftwareFoundation.
[2]TRE:POSIXCompliantRegularExpression
MatchingLibrary.http://laurikari.net/tre/.
[3]Virtex4familyoverview,January2007.Xilinx,Inc.
http://direct.xilinx.com/bvdocs/publications/ds112.pdf.[4]A.V.AhoandM.J.Corasick.Efficientstring
matching:anaidtobibliographicsearch.
CommunicationsoftheACM,18(6):333–340,1975.[5]F.Anjum,D.Subhadrabandhu,andS.Sarkar.
Signature-basedintrusiondetectionforwirelessad-hocnetworks:Acomparativestudyofvariousroutingprotocols.InIEEEVehicularTechnologyConference,October2003.
[6]M.BecchiandS.Cadambi.Memory-efficientregular
expressionsearchusingstatemerging.INFOCOM2007,pagespp.1064–1072,May2007.
[7]JoaoBispo,IoannisSourdis,JoaoM.P.Cardoso,and
StamatisVassiliadis.Synthesisofregularexpressionstargetingfpgas:Currentstatusandopenissues.InInt.WorkshoponAppliedReconfigurableComputing(ARC2007),pages179–190,Mangaratiba,Brazil,March2007.
[8]B.H.Bloom.Space-timetrade-offsinhashcoding
withallowableerrors.CommunicationsoftheACM,13(7):422–426,1970.
[9]R.S.BoyerandJ.S.Moore.Afaststringsearching
algorithm.CommunicationsoftheACM,20(10):762–772,1977.
[10]B.C.Brodie,R.K.Cytron,andD.E.Taylor.A
ScalableArchitectureforHigh-Throughput
Regular-ExpressionPatternMatching.InISCA,Boston,MA,June2006.
[11]S.DharmapurikarandJ.Lockwood.Fastandscalable
patternmatchingfornetworkintrusiondetectionsystems.IEEEJournalonSelectedAreasin
Communications,24(10):1781–1792,October2006.[12]S.Kumar,S.Dharmapurikar,P.Crowley,J.Turner,
andF.Yu.Algorithmstoacceleratemultipleregularexpressionmatchingfordeeppacketinspection.InSIGCOMM,Pisa,Italy,September2006.
[13]Cheng-HungLin,Chih-TsunHuang,Chang-Ping
Jiang,andShih-ChiehChang.Optimizationofregularexpressionpatternmatchingcircuitsonfpga.InDATE2006,Munich,Germany,2006.
[14]R.-T.Liu,N.-F.Huang,C.-H.Chen,andC.-N.Kao.
Afaststring-matchingalgorithmfornetwork
processor-basedintrusiondetectionsystems.ACMTransitionsonEmbeddedComputing,xx(12):614–633,2004.
[15]V.Paxson.Asystemfordetectingnetworkintruders
inreal-time.ComputerNetworks,
31(23-24):2435–2463–1792,December1999.[16]V.Paxson,K.Asanovic,S.Dharmapurikar,
J.Lockwood,R.Pang,R.Sommer,andN.Weaver.Rethinkinghardwaresupportfornetworkanalysisandintrusionprevention.InProc.USENIXHotSecurity,August2006.
[17]P.PiyachonandY.Luo.Efficientmemoryutilization
onnetworkprocessorsfordeeppacketinspection.InACMSymposiumonArchitectureforNetworkandCommunicationSystems,SanJose,CA,December2006.
[18]L.Schaelicke,B.MooreT.Slabach,andC.Freeland.
Characterizingtheperformanceofnetworkintrusiondetectionsensors.InProceedingsoftheSixthInternationalSymposiumonRecentAdvancesinIntrusionDetection(RAID2003),LNCS,Springer-Verlag,September2003.[19]Snort.http://www.snort.org/,2003.
[20]L.TanandT.Sherwood.ArchitecturesforBit-Split
StringScanninginIntrusionDetection.IEEEMicro:Micro’sTopPicksfromComputerArchitectureConferences,January-February2006.
[21]N.Weaver,V.Paxson,andJ.M.Gonzalez.The
shunt:Anfpga-basedacceleratorfornetworkintrusionprevention.InProceedingsofthe2007ACM/SIGDA15thinternationalsymposiumonFieldprogrammablegatearrays,Monterey,CA,2007.
[22]S.WuandU.Manber.Fasttextsearching:Allowing
errors.CommunicationsoftheACM,35(10):83–91,1992.
[23]F.Yu,Z.Chen,Y.Diao,T.V.Lakshman,andR.H.
Katz.Fastandmemory-efficientregularexpressionmatchingfordeeppacketinspection.InACMSymposiumonArchitectureforNetworkand
CommunicationSystems,SanJose,CA,December2006.
203
因篇幅问题不能全部显示,请点此查看更多更全内容