Ë ©sg¥<ãóº—dZddlZddlZddlZddlmZddlmZmZm Z m Z mZddlm Z mZddlmZej"e«Zdd iZGd „d«ZGd„d e«Zy)z"Tokenization class for model MyT5.éN)Údefaultdict)ÚDictÚListÚOptionalÚTupleÚUnioné)Ú AddedTokenÚPreTrainedTokenizer)ÚloggingÚ vocab_filezbyte_maps.jsonc óÚ—eZdZdZdZdeeeeefffd„Zdeeee e effdedefd„Zdeeefd eeee e efffd „Zde ed ede effd „Z dde ed e efd„Zy)ÚByteRewriteraZ Byte rewriter class for MyT5 tokenizer. This class is used to rewrite bytes using a hash tree. The hash tree is constructed from a set of rewriting rules. Args: rewriting_rules (`str` or `Dict[str, str]`): A path to a json file containing the rewriting rules or a dictionary containing the rewriting rules. z[LEAF]Úrewriting_rulescóŠ—t|t«r+t|d«5}tj|«}ddd«n't|t «st dt|«›«‚|j|«|_ |j«Dcic]\}}||“Œ }}}|j|«|_y#1swYŒYxYwcc}}w)NÚrzDrewriting_rules should be either a path to json file or a dict, got )Ú isinstanceÚstrÚopenÚjsonÚloadÚdictÚ ValueErrorÚtypeÚconstruct_hash_treeÚ hash_treeÚitemsÚreverse_hash_tree)ÚselfrÚfÚkÚvÚreverse_rewriting_ruless ú]/var/www/html/venv/lib/python3.12/site-packages/transformers/models/myt5/tokenization_myt5.pyÚ__init__zByteRewriter.__init__.sº€Üo¤sÔ+Üo sÓ+ð /¨qÜ"&§)¡)¨A£,÷ /ð /ä˜O¬TÔ2ÜØVÔW[Ð\kÓWlÐVmÐnóð ð×1Ñ1°/ÓBˆŒØ4C×4IÑ4IÓ4K×"L©D¨A¨q 1 a¡4Ð"LÐÑ"LØ!%×!9Ñ!9Ð:QÓ!RˆÕ÷ /ð /üó#MsB3Â B?Â3B<rÚbyte_in_sequenceÚbyte_out_sequencecó”—|jd«}|jd«}|}|D]}||vri||<||}Œ|||j<y)zL Add a leaf with the output byte sequence to the hash tree. ú N)ÚsplitÚLEAF)rrr&r'Úbyte_in_listÚ byte_out_listÚtree_pointerÚbs r$Úadd_leafzByteRewriter.add_leaf;sb€ð(×-Ñ-¨cÓ2ˆØ)×/Ñ/°Ó4ˆ à ˆØò +ˆAØ˜Ñ$Ø"$˜Q‘Ø'¨™?‰Lð +ð #0ˆT—Y‘YÒóÚreturncóÎ—tt«}d„td«D«D]}|g|||j<Œ|j «D]\}}|j|||«Œ|S)zE Construct a hash tree for rewritten byte sequences. c3ó$K—|]}|d›–—Œ yw)Ú02xN©)Ú.0Úxs r$ú z3ByteRewriter.construct_hash_tree..Osèø€Ò1 QsG“*Ñ1ùs‚é)rrÚranger+rr0)rrrr/Úin_sequenceÚout_sequences r$rz ByteRewriter.construct_hash_treeJss€ô ¤Ó%ˆ Ù1¤e¨C£jÔ1ò *ˆAØ'( cˆIa‰L˜Ÿ™Ò#ð *ð*9×)>Ñ)>Ó)@ò @Ñ%ˆK˜ØM‰M˜) [°,Õ?ð @ðÐr1Ú byte_sequenceNcó\—|j}|D] }||vr||}Œ y||jS)zW Search the hash tree and return the rewritten byte sequence if found. N)rr+)rr>r.r/s r$Úsearch_hash_treezByteRewriter.search_hash_treeWsA€ð—~‘~ˆØò ˆAØLÑ Ø+¨A™‘áð ð˜DŸI™IÑ&Ð&r1Úin_bytescóZ—g}d}d}|t|«kr–|s|jn|j}t|t|««D]?}||}||vr||}n||k(r|g} |}n$n"|j|vsŒ/||j} |}ŒA|j «|dz}|t|«krŒ–|S)a6 Rewrite a sequence of bytes using the hash tree. Args: in_bytes (`List[str]`): A list of bytes to be rewritten. reverse (`bool`): If True, decoding is performed with the reverse hash tree. Returns: `List[str]`: The rewritten byte sequence. ré)Úlenrrr;r+Úextend) rrAÚreverseÚ out_bytesÚb_startÚb_endr.Újr/Úcur_leafs r$Ú rewrite_byteszByteRewriter.rewrite_bytesdsÍ€ðˆ ØˆØˆàœ˜H› Ò%Ù18˜4Ÿ>š>¸d×>TÑ>TˆLÜ˜7¤C¨£MÓ2ò Ø˜Q‘KØ˜Ñ$Ø#/°¡?‘LØ˜'’\Ø !˜sHØEÙáØ—9‘9 Ò,Ø+¨D¯I©IÑ6HØ‘Eð ð ×Ñ˜XÔ&Ø˜a‘iˆGð!œ˜H› Ó%ð$Ðr1)F)Ú__name__Ú __module__Ú__qualname__Ú__doc__r+rrrr%rrr0rr@rLr6r1r$rr!sØ„ñð€DðS¨¨c°4¸¸S¸±>Ð.AÑ(BóSð 0 $ s¨E°$¸¸S¹ °/Ñ,BÐ'BÑ"Cð 0ÐWZð 0Ðoró 0ð°4¸¸S¸±>ðÀdÈ3ÐPUÐVZÐ\`ÐadÑ\eÐVeÑPfÐKfÑFgóð'¨d°3©ið'¸EÀ$ÈÈSÉ À/Ñ"`): The end of sequence token. unk_token (`str`, *optional*, defaults to `""`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. pad_token (`str`, *optional*, defaults to `""`): The token used for padding, for example when batching sequences of different lengths. extra_ids (`int`, *optional*, defaults to 125): Add a number of extra ids added to the end of the vocabulary for use as sentinels. These tokens are accessible as "" where "{%d}" is a number between 0 and extra_ids-1. Extra tokens are indexed from the end of the vocabulary up to beginning ("" is the last token in the vocabulary like in ByT5 preprocessing see [here](https://github.com/google-research/text-to-text-transfer-transformer/blob/9fd7b14a769417be33bc6c850f9598764913c833/t5/data/preprocessors.py#L2117)). additional_special_tokens (`List[str]`, *optional*): Additional special tokens used by the tokenizer. Ú input_idsÚattention_maskr2c óì•—|dkDr|€t|«Dcgc]}d|›d‘Œ }}nK|dkDrF|Dt|«dkDr6tttd„|«««} | |k7rt d|›d|›d«‚t|t«rt|dd¬ «n|}t|t«rt|dd¬ «n|}t|t«rt|dd¬ «n|}|||d œ|_t|j«|_ d|_ tjt|d««|_t|jd «|_t|jd«|_t%‰ |Ld|||d|dœ|¤Žycc}w)Nrz có.—tdt|«v«S)NÚextra_id)Úboolr)r8s r$úz(MyT5Tokenizer.__init__..´s€´D¸ÄsÈ1ÃvÐ9MÓ4N€r1zBoth extra_ids (z!) and additional_special_tokens (zm) are provided to MyT5Tokenizer. In this case the additional_special_tokens must include the extra_ids tokensT)ÚlstripÚrstrip)rrCér:rÚ decompose_mapÚ merge_map)Ú eos_tokenÚ unk_tokenÚ pad_tokenÚ extra_idsÚadditional_special_tokensr6)r;rDÚsetÚfilterrrrr Ú_added_tokens_decoderÚoffsetÚ_utf_vocab_sizerrrÚ byte_mapsrÚdecompose_rewriterÚmerge_rewriterÚsuperr%)rr r`rarbrcrdÚkwargsÚiÚextra_tokensÚ __class__s €r$r%zMyT5Tokenizer.__init__¥sŽø€ðqŠ=Ð6Ð>ÜDIÈ)ÓDTÖ(U¸q¨:°a°S¸Ò):Ð(UÐ%Ñ(UØ ˜Š]Ð8ÐDÌÐMfÓIgÐjkÒIkäœs¤6Ñ*NÐPiÓ#jÓkÓlˆLØ˜yÒ(Ü Ø& y kÐ1RÐSlÐRmðn(ð(óðôHRÐR[Ô]`ÔGa”J˜y°¸dÕCÐgpˆ ÜGQÐR[Ô]`ÔGa”J˜y°¸dÕCÐgpˆ ÜGQÐR[Ô]`ÔGa”J˜y°¸dÕCÐgpˆ à)2°yÀYÑ%OˆÔ"Ü˜$×4Ñ4Ó5ˆŒØ#ˆÔôŸ™¤4¨ °CÓ#8Ó9ˆŒä".¨t¯~©~¸oÑ/NÓ"OˆÔÜ*¨4¯>©>¸+Ñ+FÓGˆÔä ‰Ñð ØØØØØ&?ñ ðó ùò3)Vs– E1có—|jS©N)ri)rs r$Ú vocab_sizezMyT5Tokenizer.vocab_sizeÓs€à×#Ñ#Ð#r1cóÄ—t|j|jz«Dcic]}|j|«|“Œ}}|j |j «|Scc}wrs)r;rtrhÚconvert_ids_to_tokensÚupdateÚadded_tokens_encoder)rroÚvocabs r$Ú get_vocabzMyT5Tokenizer.get_vocabØsW€Ü;@ÀÇÁÐSW×S^ÑS^ÑA^Ó;_Ö`°a×+Ñ+¨AÓ.°Ñ1Ð`ˆÐ`Ø ‰T×.Ñ.Ô/Øˆùòas¥AÚtoken_ids_0Útoken_ids_1Úalready_has_special_tokenscó¤•—|rt‰|||d¬«S|€dgt|«zdgzSdgt|«zdgzdgt|«zzdgzS)aÄ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer `prepare_for_model` method. Args: token_ids_0 (`List[int]`): List of IDs. token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs. already_has_special_tokens (`bool`, *optional*, defaults to `False`): Whether or not the token list is already formatted with special tokens for the model. Returns: `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. T)r{r|r}rrC)rmÚget_special_tokens_maskrD)rr{r|r}rqs €r$rz%MyT5Tokenizer.get_special_tokens_maskÞsyø€ñ$&Ü‘7Ñ2Ø'°[Ð]að3óð ð ÐØCœ#˜kÓ*Ñ*¨q¨cÑ1Ð1Ø”c˜+Ó&Ñ&¨1¨#Ñ-°!°´s¸;Ó7GÑ1GÑHÈAÈ3ÑNÐNr1Ú token_idscó¬—t|«dkDr7|d|jk(r%tjd|j›d«|S||jgzS)z.Do not add eos again if user already added it.réÿÿÿÿzThis sequence already has zQ. In future versions this behavior may lead to duplicated eos tokens being added.)rDÚeos_token_idÚwarningsÚwarnr`)rr€s r$Ú_add_eos_if_not_presentz%MyT5Tokenizer._add_eos_if_not_presentús]€äˆy‹>˜AÒ )¨B¡-°4×3DÑ3DÒ"DÜM‰MØ,¨T¯^©^Ð,<ð=+ð+ô ðÐà × 1Ñ 1Ð2Ñ2Ð2r1cót—|jg}|€t||z«dgzSt||z|z|z«dgzS)aÉ Create a mask from the two sequences passed to be used in a sequence-pair classification task. MyT5 does not make use of token type ids, therefore a list of zeros is returned. Args: token_ids_0 (`List[int]`): List of IDs. token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `List[int]`: List of zeros. r)rƒrD)rr{r|Úeoss r$Ú$create_token_type_ids_from_sequencesz2MyT5Tokenizer.create_token_type_ids_from_sequencessP€ð × Ñ Ð!ˆàÐÜ{ SÑ(Ó)¨Q¨CÑ/Ð/Ü; Ñ$ {Ñ2°SÑ8Ó9¸Q¸CÑ?Ð?r1cóX—|j|«}|€|S|j|«}||zS)a‚ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A sequence has the following format: - single sequence: `X ` - pair of sequences: `A B ` Args: token_ids_0 (`List[int]`): List of IDs to which the special tokens will be added. token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens. )r†)rr{r|s r$Ú build_inputs_with_special_tokensz.MyT5Tokenizer.build_inputs_with_special_tokenss;€ð&×2Ñ2°;Ó?ˆØÐØÐà×6Ñ6°{ÓCˆKØ Ñ,Ð,r1Útextcór—|jd«Dcgc]}|d›‘Œ}}|j|«}|Scc}w)z‡Take as input a string and return a list of strings (tokens) for words/sub-words. Represents tokens in two character hex formatúutf-8r5)ÚencodeÚmorphological_encode)rrŒrnroÚtokenss r$Ú _tokenizezMyT5Tokenizer._tokenize6s@€ð'+§k¡k°'Ó&:Ö; QsG‘*Ð;ˆÐ;Ø×*Ñ*¨6Ó2ˆØˆ ùòs6€ôˆu‹:˜Š?ØˆHðˆô˜5 "“~¨¯©Ñ3ˆHàˆr1có(—||jz d›}|S)z=Converts an index (integer) in a token (str) using the vocab.r5)rh)rÚindexr–s r$Ú_convert_id_to_tokenz"MyT5Tokenizer._convert_id_to_tokenHs€à˜4Ÿ;™;Ñ& sÐ+ˆØˆr1Úindicescóz—|jj|d¬«}|jj|d¬«}|S)NF©rF)rkrLrl©rrœs r$rz"MyT5Tokenizer.morphological_encodeMs=€à×)Ñ)×7Ñ7¸ÈÐ7ÓOˆØ×%Ñ%×3Ñ3°GÀUÐ3ÓKˆØˆr1cóz—|jj|d¬«}|jj|d¬«}|S)NTrž)rlrLrkrŸs r$Úmorphological_decodez"MyT5Tokenizer.morphological_decodeSs=€à×%Ñ%×3Ñ3°GÀTÐ3ÓJˆØ×)Ñ)×7Ñ7¸ÈÐ7ÓNˆØˆr1cóø—d}g}|D]`}||jvr|j|j|«Œ0||jvr|j|«ŒP|j|«Œb|j|«}t |jj««t |j«z}|D].}||vr|t |d«z }Œ|tj|«z }Œ0|jdd¬«}|S)z:Converts a sequence of tokens (string) in a single string.r1rŽÚignore)Úerrors) Úadded_tokens_decoderÚappendrxr¡reÚvaluesÚbytesÚfromhexÚdecode)rr‘ÚbstringÚ out_tokensr–Ú _added_tokensÚstrings r$Úconvert_tokens_to_stringz&MyT5Tokenizer.convert_tokens_to_stringYsû€àˆàˆ Øò )ˆEØ˜×1Ñ1Ñ1Ø×!Ñ! $×";Ñ";¸EÑ"BÕCØ˜$×3Ñ3Ñ3Ø×!Ñ! %Õ(à×!Ñ! %Õ(ð )ð×.Ñ.¨zÓ:ˆ Ü˜D×5Ñ5×<Ñ<Ó>Ó?Ä#Àd×F_ÑF_ÓB`Ñ`ˆ Øò 0ˆEØ˜ Ñ%Øœ5 ¨Ó0Ñ0‘àœ5Ÿ=™=¨Ó/Ñ/‘ð 0ð —‘ °Ó9ˆØˆ r1Úsave_directoryÚfilename_prefixcón—tjj|«r2tjj||r|dzndtdz«}n|r|dznd|z}t|dd¬«5}|j tj|jdd¬ ««ddd«|fS#1swY|fSxYw) Nú-Úr ÚwrŽ)Úencodingr]F)ÚindentÚensure_ascii) ÚosÚpathÚisdirÚjoinÚVOCAB_FILES_NAMESrÚwriterÚdumpsrj)rr°r±r Úwriters r$Úsave_vocabularyzMyT5Tokenizer.save_vocabularyps¢€Ü 7‰7=‰=˜Ô(ÜŸ™Ÿ™Ø¹/ °3Ò!6ÈrÔUfÐgsÑUtÑ tó‰Jñ4C˜/¨CÒ/ÈÈnÑ\ˆJÜ *˜c¨GÔ 4ð S¸ØL‰LœŸ™ D§N¡N¸1È5ÔQÔR÷ Sàˆ}Ð÷ Sàˆ}ÐúsÁ,2B)Â)B4)zzzé}N)r2N)NFrs)rMrNrOrPÚmodel_input_namesr½Úvocab_files_namesr%Úpropertyrtrzrr•rrYrr†r‰r‹rr’r˜r›rr¡r¯rrÁÚ __classcell__)rqs@r$rRrR‡s¢ø„ñð4%Ð&6Ð7ÐØ)Ðð ØØØØ"&ð, ð õ, ð\ñ$óð$òðsxñOØ ™9ðOØ3;¸DÀ¹IÑ3FðOØkoðOà ˆc‰õOð8 3°°c±ð 3¸tÀC¹yó 3ðJNñ@Ø ™9ð@Ø3;¸DÀ¹IÑ3Fð@à ˆc‰ó@ð0JNñ-Ø ™9ð-Ø3;¸DÀ¹IÑ3Fð-à ˆc‰ó-ð4˜cð°°S± óòòð ¨D°©Ið¸$¸s¹)óð¨D°©Ið¸$¸s¹)óòñ. ¨cð ÀHÈSÁMð Ð]bÐcfÑ]g÷ r1rR)rPrr¹r„ÚcollectionsrÚtypingrrrrrÚtokenization_utilsr rÚutilsrÚ get_loggerrMÚloggerr½rrRr6r1r$úrÍs`ðñ)ãÛ ÛÝ#ß5Õ5çAÝð ˆ× Ñ ˜HÓ %€ð"Ð#3Ð4Ð÷cñcôLrÐ'õrr1