how to define tokenizing rules
I want to tokenize strings like:
'my name.is(johnny ,knoxville):'
into:
['my', 'name', '.', 'is', '(johnny ,knoxville)', ':']
As you can notice, whitespace separates tokens, non-alphanumeric chars are
not grouped with alphanumeric chars, and there's another exception:
Everything enclosed in parenthesis is taken as a whole token.
I'm not sure if I should use python RE, some python module I don't know
about or an external lib like pyparsing
Any ideas?
No comments:
Post a Comment