您的当前位置：首页 Communication Networks] Network Operations—Network monitoring General Terms

Communication Networks] Network Operations—Network monitoring General Terms

来源：爱问旅游网

DPICO:AHighSpeedDeepPacketInspectionEngine

UsingCompactFiniteAutomata

ChristopherL.Hayes

∗

DepartmentofElectricalandComputerEngineering

UniversityofMassachusettsLowell

Lowell,MA,01854USA

andYanLuo

hayesc3@rpi.edu,yanluo@uml.eduABSTRACT

DeepPacketInspection(DPI)hasbeenwidelyadoptedindetectingnetworkthreatssuchasintrusion,virusesandspam.Itischallenging,however,toachievehighspeedDPIduetotheexpandingrulesetsandeverincreasinglinerates.Akeyissueisthatthesizeoftheﬁniteautomatafallsbe-yondthecapacityofon-chipmemorythusincurringexpen-siveoﬀ-chipaccesses.InthispaperwepresentDPICO,ahardwarebasedDPIenginethatutilizesnoveltechniquestominimizethestoragerequirementsforﬁniteautomata.Thetechniquesproposedaremodiﬁedcontentaddressablemem-ory(mCAM),interleavedmemorybanks,anddatapacking.TheexperimentresultsshowthescalableperformanceofDPICOcanachieveupto17.7GbpsthroughputusingacontemporaryFPGAchip.ExperimentdataalsoshowthataDPICObasedacceleratorcanimprovethepatternmatch-ingperformanceofaDPIserverbyupto10times.

1.INTRODUCTION

CategoriesandSubjectDescriptors

C.2.0[ComputerCommunicationNetworks]:General—SecurityandProtection(e.g.,Firewalls);C.2.3[ComputerCommunicationNetworks]:NetworkOperations—Net-workmonitoring

GeneralTerms

Algorithms,Design,Performance,Security

Keywords

FiniteAutomata,ContentAddressableMemory,FPGA,In-trusionDetection

∗ChristopherHayeshassincemovedtoRensselaerPolytech-nicInstitute.

Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforproﬁtorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontheﬁrstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspeciﬁcpermissionand/orafee.

Computernetworksarebecomingincreasinglyvulnera-bletomanythreats,suchasintrusions,worms,virusesandspam.Whenthesethreats,whichcontinuouslyincreaseinnumberandcomplexity,arecoupledwithconstantlyincreas-ingnetworklinerates,thesituationbecomesmorediﬃculttocontain.DeepPacketInspection(DPI)playsanimpor-tantroleindetectingthreatsbysearchingthepayloadofnetworkpacketsforknownpatternsorsignatures[5,14].Itisalsoessentialintraﬃccharacterizationandﬁne-grainednetworkmonitoring.

ADPIsystemperformsasetoftime-criticaloperationstosensecertainnetworkpatternsorbehaviorwhiletryingtominimizepacketprocessinglatency.Theﬁrststepistocap-turepacketsfromnetworkinterfacecards,reassembleandbuﬀerthemforprocessing.Thesecondstepistosearchknownsignaturepatternsinthepayloadofpackets.Fol-lowingthat,aDPIsystemwillanalyzethematchedpacketsagainstsemantic-richpoliciestodetermineifanattackispresentoranalertshouldbetriggered.WearguethattheﬁrststepcanbespedupwithApplicationSpeciﬁcInte-gratedCircuits(ASICs)duetotheinvarianceofthetask,andthepolicyprocessingisbettercarriedoutwithgeneral-purposeprocessorsbecauseofthecomplexsemanticprocess-ing.Thematchingofsignaturepatternsfallsinthemiddleofthespectrum,andcanbeneﬁtfromprogrammablehard-wareaccelerationbasedonFPGAsornetworkprocessors.Theexpandingsignaturesetsandincreasinglinespeedhavemadethesignaturepatternmatchingchallenging.Forexample,therulesetofthewell-knownintrusiondetectionsystem,Snort[19],contains4219rulesasofDec2005,andnewrulesareaddedconstantly.Variablepatternlengthandlocation,andincreasinglylargerulesetsmakepatternmatchingadiﬃculttask.Manystringmatchingalgorithmsexist,suchasthoseduetoBoyer-Moore[9],Wu-Manber[22],andAho-Corasick[4].Asmoregeneralcasesofﬁxedstrings,regularexpressionsusedtodepictpatternsignaturesfurthercomplicatethetaskduetopossibleexponentialnumberofstates.Therehavebeennumerousstudiesonregularexpres-sionmatching[10,23,12,2].DiﬀerentplatformshavebeenusedtoperformDPI,includingASICs[20],FPGAs[11]andnetworkprocessors[17].Mostofthesemethodsrelyonstatemachinesorﬁniteautomatatomatchpatterns.However,oneofthekeyissuesisthatthesizeoftheﬁniteautomataissolarge(oftenatleasttensofmegabytes)thattheyhavetobestoredinoﬀ-chipmemorymodules.Asaresult,thesearchingontheautomataincursalargenumberofoﬀ-chip

195

memoryaccessesthatleadtounsatisfactoryperformance.Thusitisimportanttominimizethestoragerequirementsofstatemachinessuchthattheycanbeputintoon-chiporotherfastmemorymodulestoachievehigh-speedpatternmatching.

Wearemotivatedtostudyeﬃcientmemoryutilizationofprogrammablehardwarearchitecturetoimprovetheper-formanceofsignaturepatternmatching.Inthispaper,wepresentDPICO,ahardwarebasedDPIenginethatutilizesnoveltechniquestoreducethestorageforﬁniteautomata.WeﬁrstanalyzethestorageoftraditionalDeterministicFi-niteAutomata(DFA)andpresentthebaselinedesign.Tak-ingadvantageofthemulti-portedon-chipmemorybanksonmodernASICsandFPGAs,wethenusethreetechniquestoreducetheredundantinformation:(1)modiﬁedcontentad-dressablememory(mCAM),(2)interleavedmemorybanks,and(3)datapacking.Thecombinationoftheseconceptsallowsdatatobeorganizedinamoreeﬃcientway.Ouranalysisshowthatthestoragecanbereducedbyover90%.WehaveimplementedanFPGAbasedprototypeincorpo-ratingtheproposedtechniques.TheDPIperformanceofthesystemisshowntobescalableandreachupto17.7GbpsusingacontemporaryFPGAchip.WeevaluatetheperformanceimpactofincorporatingaDPICObasedaccel-eratorinanx86/PCIeserverarchitecture.ThedatashowthatthepatternmatchingperformanceoftheDPIservercanbeimprovedbyupto10times.

Thecontributionsofourworkareasfollows.WepresenteﬀectivetechniquestostorecompactﬁniteautomatainFP-GAs,takingadvantageoftheirarchitecturalfeatures.OurscalabledesignprovidesanempiricalbasisforarchitectinghighperformanceDPIsystems,wherethepatternmatchingisanimportantprocessingphase.Inaddition,ﬁniteau-tomataisnecessaryinupperlevelsemanticpolicyprocess-ing,thusourapproachcanbeextendedtospeedupotherstatefulpacketinspectionoperations.

Thepaperisorganizedasfollows:Section2reviewsre-latedwork,Section3describesamotivatingexampleandthebaselinedesign,Section4elaboratestheproposedtech-niquestoreducethestorageofDFA,andtheperformanceresultsarepresentedinSection5.Finally,thepaperiscon-cludedinSection6.

2.RELATEDWORK

StringmatchingtechniquessuchasBloomFilters[8]andBoyer-Moore[9],Wu-Manber[22],andAho-Corasick[4]al-gorithmshavebeenthefoundationformanysignature-baseddetectionenginesovermanyyears.Someoftheseconceptshaveevenbeenexpandedtosearchagainstregularexpres-sionsinsteadofﬁxedstrings[10,23].Regularexpressionsin-creasethecapabilityandmaintainabilityofthreatdetectionsystemsbyincreasingtheﬂexibilityofthreatdeﬁnitions.Schaelickeetal.characterizedtheperformanceoftheSnortsoftwareongeneral-purposeprocessors.Theyshowedthattheirhighestperformancetestsystemcouldonlysi-multaneouslyhandle217payloadrulesonanetworkrun-ningata100Mbpsrate[18].Morerecently,DregeretalprovidedinsightstotheperformancelimitationsofaBroNIDS[15]onacommodityPCandexploredwaystomitigateitsresourcedemands.Throughextensiveexperiments,theyshowedthatGbpsnetworkintrusiondetectionratecannotbeachievedwithoutcarefullytuningthesystem.Clearly,theperformanceofnetworkintrusiondetectionimplemented

ongeneral-purposeprocessorsisdeﬁcientwhenconsideringincreasingspeeddemands.

Paxsonetalproposedtorethinkthehardwaresupportforeﬃcientnetworkanalysisandintrusionprevention[16].Theydescribedahigh-levelnetworksecurityanalysispipeline,wheretheprotocolanalysisisoneofthestages.Patternmatchingagainstpacketpayloadisindispensabletoclassifyﬂowsasincreasinglymoreapplicationscannotbereliablyidentiﬁedbymerelyexaminingtransportportnumbers.Manysignature-basedsystemshavebeenarchitectedfortheFPGA[21]andASIC[10],takingadvantageofthepar-allelstructuresavailableinthesedevices.ThesedesignsarepredominantlybasedontheAho-Corasickalgorithmorotherﬁniteautomaton-likestructures.Thesearchitectureshaveincrementallyimprovedthespeedorstorageutiliza-tionofsignaturematchingthroughmodiﬁcationoftheim-plementation.

Theseimplementation-basedimprovementsarecomple-mentedbyalgorithmicimprovementsdirectedatmodifyingtheﬁniteautomatathemselves.Kumaretal.[12]modiﬁedaDFAtocombinecommonoutputtransitionsofindivid-ualstatesbycreatingadefaulttransitionbetweenthosestates.ThismodiﬁedDFA,knownasthedelayed-inputde-terministicﬁniteautomataorD2FAisfoundtoreducetheDFAstorageofapartitionedCiscorulesetfrom92MBto2MB[12].Thissigniﬁcantsavingsbeginstobringﬁ-niteautomatonprocessingoflargerulesetsintoarangethatisconceivableonanFPGAornetworkprocessor.Thisstorageimprovement,however,isprovidedatthecostofthedelayedinputbehavior,reducingtheaverageprocessingthroughput.Alongwithtransitionreduction,statemergingtechniqueshavealsobeenproposed[6].

LinetalproposedtoreduceFPGAlogicsusedtomatchregularexpressionsthroughsharingcommonsub-regularex-pressions[13].TheirmethodissimilartoanumberofNon-deterministicFiniteAutomata(NFA)basedapproaches[7],whosedrawbackisthatthosedesignscannoteasilyaccom-modatenewsignaturepatterns.Anewcompilation-synthesis-placement-downloadprocedureisneededbecauseregularex-pressionsare“hardcoded”intoFPGAlogicelements.Inthispaper,wefocuson“memorybased”ﬁniteautomatabecauseofitssigniﬁcantadvantageover“logicbased”ﬁniteautomata:thememorybasedFAcanbeeasilyupdatedtoincorporatenewsignaturepatternswithoutreprogrammingFPGAs.

Therearetwopopularcategoriesofﬁniteautomata:Non-deterministicFiniteAutomata(NFA)andDeterministicFi-niteAutomata(DFA)[23].Theydiﬀersigniﬁcantlyinthecomplexityofstorageandsearching.ThespacecomplexityofNFAisO(n)anditssearchingcomplexityisO(n2),whileDFA’ssearchingcomplexityisO(1)atthecostofspacecom-plexityO(2n).ForDPIsystems,DFAisthepreferredstatemachineespeciallyfordelaysensitivenetworkapplications,thusitisthefocusofthispaper.

3.3.1

MOTIVATIONANDBASELINEDESIGNMotivatingExample

Weassumethataspeciﬁcnetworkattackispresentwhenthestring“root”isfoundinthepayloadofapacket.Sec-ondly,weassumeadiﬀerenttypeofattackisbeingprose-cutedwhenthestring“rmdir”isfoundinapacket.Eachofthesestringsformthefoundationofaruleandtogether

196

d|i|m|o|t0d|i|trm2dd|i|m|o|t3i4d|i|m|o|t5rr8rrrrr1ord|i|m|td|i|m|oo7td|i|m|o|timumamountofmemory(S)neededforaﬁniteautomatonisderivedfromthenumberofstatesintheautomatonandthenumberofstringsorregularexpressionsthatcanbematched.

S=2bNS󰀄lg(NS)󰀅+NS󰀄lg(NM+1)󰀅

wherebisthenumberofbitsintheinputcharacter,NSisthenumberofstatesintheautomaton,andNMisthenumberofMatchIDsintheautomaton.

Iftheinputtothestatemachinewerea3-bitcharacter(toencodethealphabetinthemotivatingexample),thebaselinedesignwouldrequire72locationsinmemory;theproductofthenumberofpossibleinputcharactersbythenumberofstates.Someoftheinformationstored,however,isredundantasthereareonly24uniquetransitionsstoredinthose72locations.Itisthroughtheeliminationofthisredundancythatweachieveincreasedstorageeﬃciency.

i|m|o|td|m|o|t64.

the

Regular

Expression

PROPOSEDTECHNIQUES

Figure1:DFA{root|rmdir}.

for

theymakearulesetthatcanbeusedoneverypackettoﬁlterbothtypesofattacks.

Here,wecompiletherulesetintoaDFAthatallowstherulestringstobefoundonatanypointinthedatastream,evenwhenmultiplestringsoverlap.Figure1showsaDFAfortheregularexpression{root|rmdir}withtheinputal-phabetΣ={d,i,m,o,r,t}.TheDFAcontainsninestates,twoofwhichareacceptingstates,and24distincttransitions.Whenacceptingstateﬁveisreached,thestring“rmdir”hasbeenmatched,andlikewise,whenacceptingstateeightisreached,thestring“root”hasbeenmatched.

3.2Baseline

WeuseatraditionalDFAimplementationasabaselineforperformancecomparison.Inthismethod,thestatema-chineisimplementedintwomemorymodules.Onememorymodulecontainsthestatetransitiontableandthesecondmodulecontainsthematchidentiﬁers.Thecurrentstatepointerandtheinputcharacterarecombinedandusedastheaddresstolookupthenextstatepointer.Thecurrentstatepointerisalsousedasanaddresstolookupthematchidentiﬁerforthecurrentstate.

Input CharTransition Table (RAM)Current State PtrOutput Table (RAM)Match IDIngeneral,multipleregularexpressionsareencodedintoasingleﬁniteautomaton.Whenanacceptingstateoftheautomatonisreached,aregularexpressionismatched.EachacceptingstateisgivenaMatchIDtosignifywhichregularexpressionhasbeenmatched.IftheMatchIDisallzeros,thecurrentstateisnotanacceptingstateandhencethereisnoregularexpressionmatch.

Inourdesign,weassumethatthestatetransitiontablefortheﬁniteautomatawillbestoredinauniformcostRAMasshowninFig.2.Eachstateofaﬁniteautomatonconsistsofoneormorenext-statetransitions.Atransitionwillrequireexactlyonelocationinmemory.StatesintheDFAcancon-tainavariablenumberofnext-statetransitions.Next-statetransitionsforanygivenstatewillbestoredcontiguouslyinthememory,andlikewise,statesthemselveswillbestoredcontiguously.

Insuchadesign,wetakeadvantageofthreetechniquestoallowforthereductionofredundantinformation:(1)modi-ﬁedContentAddressableMemory(mCAM),(2)interleavedmemorybanks,and(3)datapacking.Thecombinationoftheseconceptsallowsdatatobeorganizedinamoreeﬃcientway.

4.1ModiﬁedContentAddressableMemory

Figure2:BaselineStateMachineBlockDiagramWhenthismethodisused,eachstateisgivenaﬁxedandconstantamountofmemory.Thissectionofmem-orycontainsalocationforeachpossibleinputcharacter.Forinstance,iftheinputcharacterwereeightbitswide,eachstatewouldrequireenoughmemorytocontain256(28)next-statetransitions.Whenusingthistechnique,themin-

ForagivenDFA,thememoryinthebaselinedesignwillcontainrepeateddata.Forinstance,agivenstatewillhavemanytransitionsthathavethesamenext-statepointer.Sincethebaselinedesignexplicitlystoreseverytransition,therepetitionofdataisinherenttothearchitecture.Toremovethepointerrepetition,itisnecessarytomodifythenext-statelookuptechnique.Thisisaccomplishedbycre-atingtwotypesoftransitions:labeledtransitionsandde-faulttransitions.TotranslateanexplicitstatetransitiontableforaDFA,whereeverypossibletransitionisstored,toanimplicitone,wetakethemostfrequentnext-statepointerforagivenstateandmakeitthedefaulttransition.Allothertransitionsareconvertedintolabeledtransitions.Eachstatecancontainuptoonelabeledtransitionforeachlabelandeachstatewillcontainexactlyonedefaulttransition.

Thisdefault/labeledtransitiontechniquecanbeappliedtoallthedeterministicﬁniteautomata(DFA).SomeDFAvariantssuchasthedelayed-inputﬁniteautomaton(D2FA[12]),haveinherentdefaulttransitions.Thesetransitionsarediﬀerent,however,becausetheycausetheinputtobe

197

delayedwhentheyaretaken.Statesthatdonothavede-layeddefaulttransitionscanbegivenanon-delayeddefaulttransitionthatcanbeselectedinthesamemannerastheDFA.Inthiscase,thedelayedandnon-delayeddefaulttran-sitionsneedtobedistinguished.

Usingthesetransitions,weconstructamodiﬁedCAMstructure(mCAM)whereeachstatehasitsownassociativememory.Fig.3depictsasetofstatesstoredinmCAM.La-belsfromthelabeledtransitionsarethekeystotheCAMandthenextstatepointerinformationisthedataintheCAM.Thedefaulttransitionandstateinformationarealsostoredinthememoryandarenotaccessedassociatively.Thesizeofstatesisnon-uniformthusithastobestoredwithineachstateforidentifyingtheboundaries.Thevari-ablesizeofstatesmakesthestateaccessandtransitionnon-trivial.

Key nmData nmtomatontothetotalnumberofpossibletransitions,2b.Asthatratiodecreases,thepotentialstoragesavingsincreases.Theminimummemoryrequiredcanbecalculatedas

Seff=NS(󰀄lg(NM+1)󰀅)+NT(󰀄lg(NT)󰀅+b)

,NT≥NS

2bNS

Key 2Data 2Key 1Data 1CAM[m] SizeDefault Data [m]Key n2Key 2Key 1CAM[2] SizeKey n1Key 2Key 1CAM[1] SizeData n2whereNTisthenumberoftransitionsintheautomaton.Fromthis,weevaluatethediﬀerencesinminimumstor-agerequirementstoseewherethismethodismosteﬀective.TheresultsareshowninFigure4wherexaxisisthenum-berofstatesintheDFAandyaxisisthepercentageofmemoryreduction((S−Seff)/S).Theproposedmethodismosteﬀectivewhentheﬁniteautomatawithlessaveragetransitionsperstate.Also,anyﬁniteautomatonwithara-tioabove0.5doesnotbeneﬁtfromthismethodsincetheoverheadofstoringlabelsinthememoryoutweighsthesav-ingsfromstoringlesstransitions.RulesetsusedinrecentresearchbyKumaretal.[12]resultintransitionratiosbe-tween0.01and0.09whentheserulesetsimplementedasaDFA.Mostoftheserulesetsalsohavetransitionratioslessthan0.5whenimplementedasD2FA.ThustheproposedmodiﬁedCAMstructurehaspotentialtoreduceboththeDFAandD2FAstoragesigniﬁcantly.

100

……Data 2Data 1Default Data [2]Data n1Data 2Data 1Default Data [1]…… 50

Percent Improvement 0

Figure3:mCAMDiagram

Alltransitionsforastate(defaultandlabeled)arereadfromthememory.Thelabelsarecomparedagainstthein-putcharacter.Iftheinputcharactermatchesoneofthela-belsonalabeledtransition,itsnext-statepointeristaken,otherwise,thenext-statepointerforthedefaulttransitionistaken.Itiscleartoseethatthistechniquerequiresalotofmemoryaccessesinordertoretrievethepointerforthenextstateifastatecontainsahighnumberofuniquenext-statetransitions.Later,wewillseethatthisissuecanbeaddressedintheFPGAimplementationsuchthatthenext-statepointercanberesolvedinasingleclockcycle.ThismCAMtechniquehasbothadvantagesanddisad-vantages.Thenumberoflocationsrequiredforthestateisreduced,however,thelabeledtransitionsrequirestorageofthenextstatepointeralongwiththelabel.Secondly,thestatenolongerrequiresconstantspaceinthememoryandcanbepackedmoretightlyinmemory.Thisimpliesthatthenextstatepointerswouldneedtoincreaseinsizetoaccountforthefactthatstatesdonotoccuronregularboundaries.Finally,thematchidentiﬁerforthestatemustbestored.Sinceeverystatecontainsexactlyonedefaulttransition,thematchidentiﬁerisstoredwiththedefaulttransition.

Theminimumamountofmemoryrequiredforthismethodisbasedonthenumberofstates,matchidentiﬁersandla-beledtransitions.Inthiscase,theeﬀectivestoragesavingsfromstraightforwardtechniquewillbedictatedbytheratio,r,oftheaveragenumberoftransitionsperstatefortheau-

-50

-100

Ratio = 0.062500Ratio = 0.125000Ratio = 0.250000Ratio = 0.375000Ratio = 0.500000

1000

2000

3000

4000 5000 6000

Number of States

7000

8000

9000

10000

-150

Figure4:StorageImprovementwithVariedTransi-tionRatios(b=8)

4.2InterleavedMemoryBanks

TheCAMimplementationwillneedtoanalyzeallofthelabeledtransitionsandthedefaulttransitionssimultane-ouslyinordertoevaluateastateinasinglecycle.CurrentFPGAproductsgenerallycontainhundredsofindividualmemoryelementson-chip.Usingtheseelements,wepro-posethatthestatetransitioninformationbeorganizedintosequentiallyinterleavedmemorybanks(showninFig.5)totakeadvantageofthepotentialmemorybandwidthavail-ableinFPGAdevices.Ifthenumberofbanks,n,isgreaterthanorequaltothenumberofnext-statetransitionsforthelargeststate(Nmax),allthetransitionscanbereadsimul-taneouslyandprocessedinparallel.Thisarchitecturewillalloweachstatetobeprocessedwithinasingleevaluationcycle.Ifthenumberofnext-statetransitionsexceedsn,a

󰀅cycles,whichtransitionwilltakeaconstanttimeof󰀄Nmaxncanbeaddressedthroughpipelineddesign.Infact,on-chip

198

256andmorememorybankshavebecomecommoninmod-ernFPGAs[3],whichguaranteeone-cyclestateevaluationandtransition.

Phys. Addr3210Bank 0Location 12Location 8Location 4Location 0Bank 1Location 13Location 9Location 5Location 1Bank 2Location 14Location 10Location 6Location 2Bank 3D...Calc AddrROM 0Calc AddrROM 1...Calc AddrROM n-1Location 15Location 11Location 7Location 3QRAM 0RAM 1...RAM n-1State Start OffsetSelect Default Transition(Mux)Match IDInput Char.Select Labeled TransitionState End OffsetFigure5:SequentiallyInterleavedMemoryBanks(n=4).

Oneachreadcycle,nwordsarereadinparallelfromtheinterleavedmemory.Anylocationinanybankcanbethestartaddressforthen-wordread.Forexample,areadac-cesscanretrievewords5,6,7and8,whichspanintwocontinuousrows.Sinceastatecanhavelessthanntransi-tions,thesizeofthestatemustbeencodedsothatthereadlogiccanignoreunnecessaryinformation.Here,weproposetoinserttheendoﬀsetofthestatetotheﬁrstlocationofthestatesothattheframingcanbedeterminedappropriately.Eachlocationinthememoryeithercontainsadefaulttransitionoralabeledtransition.Thelocationwithala-beledtransitionwillcontainthelabelandthenextstatepointerassociatedwiththetransition.Thelocationwithadefaulttransitionwillcontainthenextstatepointerforthedefaulttransitionandnecessarystateinformationsincethereisonedefaulttransitionperstate.Thestateinforma-tionincludestheoﬀsetofthelasttransitionforthestateandthematchidentiﬁerforthestate.Sinceeverystatehasadefaulttransition,itisplacedastheﬁrstlocationforthestate.Figure6showsthestorageoflabeledtransitionsanddefaulttransitions.Thebitrangeofeachﬁeldisindicatedintheﬁgure.

Labeled Transitionlg(NT)+b-1

Next State Pointerb-1

Label0

Select Address (Mux)Figure7:BlockDiagramofDPICO.

4.3OperationofDPICO

Sincenisequaltoorlargerthanthenumberoftransitionsforthestate,allthetransitionsarereadsimultaneouslyfrommemorybanks0throughn-1.Thetransitionsarethenpro-cessedtoidentifythepropernextstateusingselectionlogic.ThelogicofselectinglabeledtransitionisshowninFig.8.AllthetransitionsreadfromtheRAMarequaliﬁedbasedonthebeginningoﬀsetofthestateandtheendoﬀsetreadfromthedefaulttransition.Thememoryoutputsoutsideoftherangearesimplyignored,andonlythequaliﬁedla-beledtransitionsarecomparedagainsttheinputcharacter.Ifamatchexists,amultiplexerselectstheoutputoftheRAMthatcontainsthatlabel.Ifthereisnomatchthenextstatepointerfromthedefaulttransitionisused.Oneachreadcycle,thematchidentiﬁer,whichisalsocontainedinthedefaulttransitionlocation,isoutputfromthestatema-chine.Thedelayofaddresscalculation,memoryaccessandnextstateselectionlogicdeterminestheclockcycletime.Sincetheengineconsumesoneinputcharacterperclockcy-cle,themaximumprocessingthroughputofthisapproachforaDFAissimplytheproductoftheinputcharactersizeandtheprocessingclockfrequency:T=b·fmax.

Transition LabelsDefault Transitionlg(NM)+lg(NT)+b-1

MatchIDlg(NT)+b-1

Next State Pointerb-1

EndLocnState Start OffsetState End Offset.........Next State Pointers0

Input Char.Compare Labels...Qualify LabelsFigure6:Storageofdefaulttransitionandlabeledtransition.

Usingtheseconcepts,weproposeageneralarchitectureforperformingregularexpressionmatching.Fig.7depictstheblockdiagramoftheDPIenginecalledDPICO.Thearchitectureconsistsofcomponentsforaddresscalculation,transitionstorage,labelcomparisonanddeterminingnextstateaddress.Thetransitionsarestoredinasequentiallyinterleavedmemoryasdiscussedpreviously.Eachmemorybankhasanaddresscalculationunittogenerateproperbankaddressbasedonthecurrentstateaddress.Theinputchar-acteriscomparedagainstthelabelsofthecurrentstatestatethroughasetofselectionlogic(bottomportionofFig.7.)

EncodeMultiplexerNext State Ptr. ValidSelected Next State PointerFigure8:DiagramofSelectingLabeledTransition

4.4PackingAcrossMemoryBoundaries

Inpracticalapplication,theprevioussolutiondoesnotequaltheminimummemoryutilizationdescribedasSeff.Instead,theactualmemorysize(Sact)isdrivenbythesize

199

PhysicalLogicalBankAddrAddr55554444333322221111000032103210321032103210321023222120191817161514131211109876543210UnusedEndB: 3UnusedUnusedEndB: 1UnusedUnusedEndB: 2UnusedEndB: 3UnusedEndB: 1UnusedUnusedEndB: 3UnusedUnusedEndB: 0UnusedUnusedUnusedEndB: 1UnusedEndB: 1Ptr: 2Ptr: 0Ptr: 2Ptr: 22Ptr: 0Ptr: 2Ptr: 19Ptr: 0Ptr: 2Ptr: 0Ptr: 14Ptr: 0Ptr: 2Ptr: 12Ptr: 0Ptr: 2Ptr: 9Ptr: 0Ptr: 2Ptr: 16Ptr: 6Ptr: 0Ptr: 2Ptr: 0Label: rState 8MatchID: 2Label: rLabel: tMatchID: 0Label: rLabel: oMatchID: 0Label: rMatchID: 1Label: rState 4MatchID: 0Label: rLabel: iMatchID: 0Label: rLabel: dMatchID: 0Label: rLabel: oState 1Label: mMatchID: 0Label: rState 0MatchID: 0State 2State 3State 5State 6State 7Addr13

1211109876543210

UnusedLabeled-Trans. 5Labeled Trans. 4Labeled Trans. 3Labeled-Trans. 2Labeled Trans. 1Default TransitionLabeled Trans. 3Labeled-Trans. 2Labeled Trans. 1Default TransitionUnusedLabeled Trans. 7Labeled Trans. 6Labeled-Trans. 5Labeled Trans. 4Labeled Trans. 3Labeled-Labeled Trans. 1Trans. 2Default TransitionState 2

State 1

State 0

Figure10:ExamplePackedMemoryMap

Table1:CompressionResults

RuleSetimapftpnetbiosnntpexploitNo.ofRules467663313122BaselineSize(kbits)16,526.811,448.42,146.77,820.855,270.1DPICOUnpackedSize(kbits)698.4522.265.0322.67,182.9DPICOMin.Size(kbits)557.8408.753.1262.54884.0Trans.Ratio(r)0.0180.0170.0110.0170.046Pct.Savings96.596.497.596.691.2Figure9:Memoryorganizationfor{root|rmdir}.ofamemorylocation,L,whichisdeterminedbythesizeofthedefaulttransition,SD,andthelabeledtransition,SL.

Sact=L·NT,L=max(SD,SL)

Fig.9showsthememoryorganizationofthemotivatingexample.Itcanbeseenthatsomeunusedbitsarepresentineverylabeltransition.Thewastedmemoryspace(denotedas“unused”inFig.9)comesmainlyfromthedisparityinthesizeofthedefaultandlabeledtransitionsandtheas-sumptionthateachtransitionwouldoccupyonelocationinmemory.Thelabeledtransitionstendtohavelessinforma-tionandthereforegenerallysmallerthandefaulttransitions.Onemethodtobetterapproachtheminimummemorycalculationistopackthelabeledtransitionsacrossmemorylocationboundaries.Figure10showsanexampleofthepackingstrategyforastatemachinewithvariednumbersoftransitionsperstate.Thelargerdefaulttransitionoccupiesafullmemorylocationandisalwayspositionedonregularmemorylocationboundaries.Labeledtransitionsarepackedbetweendefaulttransitions.Somespacemaybeunusedforastate.Thisunusedspace,however,tendstobesmallerthantheunusedspaceforanunpackeddesign.Thispackingdoesnotincreasethecomplexityandlatencyofselectingthenextstatebecauseitonlychangeswhichbitsarefedtothecomparisonlogicandallbitsarereadsimultaneouslyanyway.Thememorybandwidthissavedeﬀectively.

impactofincorporatingsuchapatternmatchingaccelera-tortoacommodityserverarchitecturesinceourimmediategoalistospeedupthepatternmatchingphaseoftheDPIworkloadwithanaccelerationcard.WethendiscusstheapplicabilityofDPICOason-chipaccelerators.

5.1PerformanceCharacteristicsofDPICO

5.PERFORMANCEEVALUATION

WeevaluateourDPICOengineintwoaspects.Firstwestudytheeﬀectivenessofthedesignandexplorethedesignspace.Particularlyweapplypipeliningtechniquetoopti-mizethedesign.Second,weinvestigatetheperformance

WecharacterizedtheperformanceoftheDPICOdesignintwoways-(1)bymeasuringthecompressionfactoragainstaparticularDPIrulesetand(2)byﬁndingthespeedandsizingcharacteristicsofthedesignwhenimplementedinanFPGA.Thissubsectionsummarizesthoseresults.

Toshowcompressionresults,webeginwithasubsetoftheSnortruleset.Wechooseﬁverepresentativeruleﬁles,namelyimap,ftp,netbios,nntpandexploitbecausetheirsizeisrelativelylargeinthewholeset.TheruleﬁlesareconvertedintoDFAandD2FA,inthememorytableformatofthebaselinedesign.Next,theserulesetswereconvertedintothetableformatoftheDPICOdesign.Thesizingre-sultsarefoundinTable1whichshowsoverallDPICOcanreducememoryusagebyover90%forallrulesetsunderstudy.Datapackingcanreduceabout20-30%ofthestor-age.

TheDPICOhasbeenwritteninVHDLandtargetedforaXilinxVirtex4SX35FPGA.ThedesignwassimulatedusingaVHDLsimulator,synthesizedwithSynplicitySynplifyandimplementedusingtheXilinxISEtoolchain.

ThebaselineDFAdesignwasfoundtooperateatamaxi-mumof267.7MHzusingalmostnologicresourcesincludingLookUpTable(LUT)andRegister(REG).Table2showsthespeedandsizingresultsfromthestoragesavingdesign

200

Table2:SpeedandSizeoftheDPICO

No.ofMemoryBanks248163264128256LUT11418332069816723541765916052REG1292043526421252234648109563fmax(MHz)144.9122.3106.998.784.878.974.068.1Tmax(Mbps)1159.2978.4855.2789.6678.4631.2592.0544.8showsthenumberofDPICOenginesimplementedonthechip,i.e.thenumberofpacketﬂowsthatareprocessedsi-multaneously.Wealsocomparethenon-pipelineddesignwiththepipelineddesign.ItcanbeobservedthatmultipleDPICOengineshavelittlenegativeimpactontheopera-tionfrequencyofeachengine.Thetotalthroughputofthepatternmatchingprocessingisscalabletothenumberofenginesonachip.WithacontemporaryFPGAchip,theperformancecanreach17.7Gbps.Again,fourbanksgivethebestoverallperformanceduetothecomplexityofcompar-isonlogics.TheresultsalsoimplythepotentialofDPICOonmoreadvancedFPGAchips.

Table3:SpeedandSizeofPipelinedDPICO

No.ofMemoryBanks24816AddedPipelineStages

1223LUTREGfmax(MHz)247.8307.4251.9271.9Tmax(Mbps)1982.42459.22015.22175.2Non-pipelinedTmax(Mbps)1159.2978.4855.2789.65.2

PerformanceImprovementtoaDPIServerArchitecture

102189336596149253418889whenthedataisunpackedandtheautomatonisaDFA.Itisassumedthatthenumberofparallelreadsisgreaterthanorequaltothemaximumnumberoftransitions,bothdefaultandlabeled,inanystate.Thethroughputgivenassumes8-bitinputcharactersareused.Theresultsshowthattheclockfrequencydecreasesasthenumberofbanksincrease.ThisisbecausethepropagationdelayofthelogicencompassedintheSelectLabeledTransitionBlockandtheSelectDefaultTransitionBlockmustincreaseasthenum-berofbanksincreases.Theycontainmultiplexersthatmustbesizedappropriatelytoselectdatafromtheappropriatenumberofbanks.Largermultiplexershavelongerpropaga-tiondelays,subsequentlydecreasingtheclockfrequencyatwhichthedesigncanoperate.

WethenaddpipelinestagestotheDPICOdesigntomin-imizethepropagationdelaybetweenclockedelements(reg-istersorblockmemory),thusincreasingthemaximumclockfrequency.Whenpipelinestagesareadded,thestatema-chinecanprocessmultipletime-multiplexedstreamsofdata.Thenumberoftimeslotsisequaltothenumberofpipelinestages,andthebandwidthofeachindividualdatastreamisthetotalbandwidthofthestatemachinedividedbythenumberofclockedstages.Thepipelineddesignpresentsatradeoﬀbetweenmaximumprocessingbandwidthfortheengine,whichincreaseswhenthenumberofpipelinestagesincreases,versustheprocessingbandwidthforeachstream,whichdecreasesasthenumberofpipelinestagesincreases.Table3showstheresultsforvariedpipelineddesigns.Eachrowofthetablehasdiﬀerentnumberofmemorybanksandpipelinestages.Itcanbeseenthatpipelineddesignbringssigniﬁcantimprovementovernon-pipelineddesign.There-ducedclockfrequencywithmorememorybanksaﬀecttheoverallperformanceofpipelineddesigns:4banksofmemoryoutperformtheotherconﬁgurations.

DPICOenginescanprovidescalableperformancebyin-corporatingmultipleenginesononeFPGAchiptoexploitpacketlevelparallelism.Weimplementsuchamulti-enginedesigninwhicheachenginehandlesapacketﬂowindepen-dently.DetailedperformanceresultsareshowninTable4.Theﬁrstcolumnisthenumberofmemorybanksundertestincluding2,4,8and16.Thesecondcolumnofthetable

OurimmediategoalistoincorporateDPICOaccelera-tioncardintoacommodityserverarchitecturewithx86coresandPCIExpressbuses.TheDPICOenginesitsintheFPGAonthePCIeacceleratorandDFAsarepreloaded.ADPIsystemsuchasSnortandBrorunsonx86cores,whichsendpacketstreamstotheacceleratorforpatternmatch-ing.ThepackettransferisoverthePCIebususingDMAoperations.TheDPICOenginesearchthepayloadagainstDFAsandsendstheresultsofpatternmatchingbacktothex86cores.We’dliketostudythepotentialofperformanceimprovementofsuchaconﬁguration.

Wesetuptheexperimentasfollows.WeuserealisticrulesfromSnortrulesetastheregularexpressionstobematched.WecapturepackettracesatthenetworklinkconnectingourcampustotheISP.Thepackettraceﬁlecontains406KTCPandUDPpacketswithnonzeropayloadsize.Thisre-alistictraceﬁleisusedastheinputpacketstreamtotheDPIsystem.Wecreateacompilertoparseregularexpres-sionsandgeneratecorrespondingDFAs.WeruntheregularexpressionmatchingworkloadonaserverwithDualIntelXeonprocessors(4MBL2cache)and1GBmemory.Weinstrumenttheworkloadwithgprof[1]torecordthetimespentonthematchingprocess(excludingtheDFAcreation,bookkeeping,etc.)Then,weloadtheDFAsintoDPICOengineandprocessthesamepackettraceontheaccelera-torcard.Wecompareprocessingtimeinthetwoscenar-ios,takingintoaccounttheoverheadsuchaspassingpacketpayloadoverthePCIebus.Intheexperiment,thepacketsaretransferredoverPICebusinsequentialorder,notbeingcombinedinbatchestomaximizethebandwidthutilization.PayloadtransferandthepatternmatchingonFPGAarenotpipelined,whichcanbeoptimizedlater.TheseconservativesettingsimplytheworstperformanceofDPICObasedac-celerator.Weleavetheseoptimizationsforfuturework.Fig.11showstheresultsoftheexperiments.Theyaxisoftheplotisthetimespentonpatternmatching.ThelefttwobarsrepresenttheCPUprocessingtimeonthetwoSnortruleﬁles,namelyexploit.rulesandweb-cgi.rules.TheprocessingtimeisthetimespentonsearchingtheDFAs.Therigh-mostﬁvebarsdepictthetimeusedbytheDPICOengines(inconﬁgurationsof1-4and8engines)togetherwiththeoverhead.InfactthemajorityoftheoverheadisthepayloadtransferoverPCIebus(16xspeedat8GB/s).ThedelayofDPICOenginesisdeterministicasexactlyonebyteisconsumedinonecycle,regardlessofrulesets.ThisﬁgureclearlyshowsthattheperformanceimprovementofaDPICOacceleratoronanx86/PCIeserverarchitecturecanreachupto10times.

201

Table4:SpeedofDPICOwithMultipleEngines

Non-pipelinedPipelinedNo.ofNo.offmaxTmaxPipe-fmaxTmaxMemoryDPICO(MHz)(Mbps)line(MHz)(Mbps)BanksEnginesStages21150.61204.82269.82158.422137.42198.42269.84316.823150.23604.82269.86475.224133.54272.02248.37945.628137.88819.22242.815539.241129.91039.23287.12296.842129.82076.83287.14593.643124.42985.63254.26100.844120.53856.03277.08864.048121.37763.23277.317747.281106.58524212.91703.28298.61577.64215.63449.68398.62366.44192.14610.48498.63155.24190.36089.688101.16470.44212.313587.216197.57804202.41619.216295.11521.64192.43078.416397.52340.04202.44857.616494.93036.84187.35993.616892.65926.44187.311987.2

5.3Discussions

2,000 1,800 1,600 1,400Time (ms)DPICOOverheadCPU 1,200 1,000 800 600 400 200exploitweb−cgiDPICO(1)DPICO(2)DPICO(3)DPICO(4)DPICO(8)TheDPICOdesigntakesadvantagesofsomefeaturesofmodernFPGAchips,butitsapplicabilityisnotlimitedtoFPGAs.WearguethatDPICOcanbeintegratedasanon-chipacceleratorofageneralpurposeCPU,althoughitisnotthefocusofthispaper.ThecontrollogicofDPICOissimplethusdoesnotdemandsigniﬁcantsiliconarea.On-chipDFAstoragecaneliminatetheoverheadoftransferingpayloadsoverperipheralbuses,however,on-chipmemorybanksingeneralpurposeCPUsareexpensiveinareaandpowercon-sumption.Correlationsamongmemorysize,performanceandpowerisworthystudying.Aspatternmatchingrulesetsexpand,thelimitedon-chipmemorymaygetfullyuti-lizedthusrequiringswappingunusedDFAstomainmemory.So,itisalsointerestingtoinvestigatethetrade-oﬀsbetweenadedicatedon-chipDFAmemoryandasharedDFA/cachememory,andrelatedDFAreplacementpolicies.

06.CONCLUSION

Figure11:PerformanceImprovementtox86/PCIeServerArchitecture

InthispaperwepresentahighspeedDPIengine,DPICO,whichisshowntobemosteﬀectivewhentheaveragetran-sitionratioofaDFAissigniﬁcantlylessthan0.5.Thepro-posedtechniquescanreducethememoryusageofDFAsby90%.PerformanceevaluationresultsshowthatapipelinedimplementationswithmultipleDPICOenginescanreachto-talathroughputupto17.7GbpsincontemporaryFPGAde-vices.Experimentdataalsoshowupto10foldimprovementonpatternmatchingwhenincorporatinganDPICObasedacceleratortoanx86/PCIeserverarchitecturerunningDPIworkload.ThehighspeedandscalabilityofDPICOmakeitapromisingcandidateforawiderangeofDPIapplicationssuchasnetworkintrusiondetectionandspamﬁltering.

202

Acknowledgment

ThisworkissupportedinpartbytheNationalScienceFoun-dationundergrantNo.CNS0709001andagrantfromIntelResearchCouncil.

7.REFERENCES

[1]Gnugprof.FreeSoftwareFoundation.

[2]TRE:POSIXCompliantRegularExpression

MatchingLibrary.http://laurikari.net/tre/.

[3]Virtex4familyoverview,January2007.Xilinx,Inc.

http://direct.xilinx.com/bvdocs/publications/ds112.pdf.[4]A.V.AhoandM.J.Corasick.Eﬃcientstring

matching:anaidtobibliographicsearch.

CommunicationsoftheACM,18(6):333–340,1975.[5]F.Anjum,D.Subhadrabandhu,andS.Sarkar.

Signature-basedintrusiondetectionforwirelessad-hocnetworks:Acomparativestudyofvariousroutingprotocols.InIEEEVehicularTechnologyConference,October2003.

[6]M.BecchiandS.Cadambi.Memory-eﬃcientregular

expressionsearchusingstatemerging.INFOCOM2007,pagespp.1064–1072,May2007.

[7]JoaoBispo,IoannisSourdis,JoaoM.P.Cardoso,and

StamatisVassiliadis.Synthesisofregularexpressionstargetingfpgas:Currentstatusandopenissues.InInt.WorkshoponAppliedReconﬁgurableComputing(ARC2007),pages179–190,Mangaratiba,Brazil,March2007.

[8]B.H.Bloom.Space-timetrade-oﬀsinhashcoding

withallowableerrors.CommunicationsoftheACM,13(7):422–426,1970.

[9]R.S.BoyerandJ.S.Moore.Afaststringsearching

algorithm.CommunicationsoftheACM,20(10):762–772,1977.

[10]B.C.Brodie,R.K.Cytron,andD.E.Taylor.A

ScalableArchitectureforHigh-Throughput

Regular-ExpressionPatternMatching.InISCA,Boston,MA,June2006.

[11]S.DharmapurikarandJ.Lockwood.Fastandscalable

patternmatchingfornetworkintrusiondetectionsystems.IEEEJournalonSelectedAreasin

Communications,24(10):1781–1792,October2006.[12]S.Kumar,S.Dharmapurikar,P.Crowley,J.Turner,

andF.Yu.Algorithmstoacceleratemultipleregularexpressionmatchingfordeeppacketinspection.InSIGCOMM,Pisa,Italy,September2006.

[13]Cheng-HungLin,Chih-TsunHuang,Chang-Ping

Jiang,andShih-ChiehChang.Optimizationofregularexpressionpatternmatchingcircuitsonfpga.InDATE2006,Munich,Germany,2006.

[14]R.-T.Liu,N.-F.Huang,C.-H.Chen,andC.-N.Kao.

Afaststring-matchingalgorithmfornetwork

processor-basedintrusiondetectionsystems.ACMTransitionsonEmbeddedComputing,xx(12):614–633,2004.

[15]V.Paxson.Asystemfordetectingnetworkintruders

inreal-time.ComputerNetworks,

31(23-24):2435–2463–1792,December1999.[16]V.Paxson,K.Asanovic,S.Dharmapurikar,

J.Lockwood,R.Pang,R.Sommer,andN.Weaver.Rethinkinghardwaresupportfornetworkanalysisandintrusionprevention.InProc.USENIXHotSecurity,August2006.

[17]P.PiyachonandY.Luo.Eﬃcientmemoryutilization

onnetworkprocessorsfordeeppacketinspection.InACMSymposiumonArchitectureforNetworkandCommunicationSystems,SanJose,CA,December2006.

[18]L.Schaelicke,B.MooreT.Slabach,andC.Freeland.

Characterizingtheperformanceofnetworkintrusiondetectionsensors.InProceedingsoftheSixthInternationalSymposiumonRecentAdvancesinIntrusionDetection(RAID2003),LNCS,Springer-Verlag,September2003.[19]Snort.http://www.snort.org/,2003.

[20]L.TanandT.Sherwood.ArchitecturesforBit-Split

StringScanninginIntrusionDetection.IEEEMicro:Micro’sTopPicksfromComputerArchitectureConferences,January-February2006.

[21]N.Weaver,V.Paxson,andJ.M.Gonzalez.The

shunt:Anfpga-basedacceleratorfornetworkintrusionprevention.InProceedingsofthe2007ACM/SIGDA15thinternationalsymposiumonFieldprogrammablegatearrays,Monterey,CA,2007.

[22]S.WuandU.Manber.Fasttextsearching:Allowing

errors.CommunicationsoftheACM,35(10):83–91,1992.

[23]F.Yu,Z.Chen,Y.Diao,T.V.Lakshman,andR.H.

Katz.Fastandmemory-eﬃcientregularexpressionmatchingfordeeppacketinspection.InACMSymposiumonArchitectureforNetworkand

CommunicationSystems,SanJose,CA,December2006.

203

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文