6
comments
|
Tuesday, July 27, 2004
In a recent post by Jason Bell, he asks for some help with a regex for validating email addresses. A commenter suggested this:
^[a-zA-Z0-9][_\\-\\.\\w]*@[\\w\\.\\-]+\\.[a-zA-Z]{2,4}$
If it makes sense I'm going to break it down into its pieces:
1) [a-zA-Z0-9] - Match 1 letter or number
2) [_\\-\\.\\w]* - Match 0 or more of: _ - . or a letter or number
3) @ - The @ character
4) [\\w\\.\\-]+ - Match 1 or more of: - . or a letter or number
5) \\. - The . character
6) [a-zA-Z]{2,4} Match between 2 and 4 of any letter
#2 matches valid username characters and #1 ensures they only start with letters or numbers. This is ok, except you will match usernames like 'a----.....___'
In the hostname, #4 suffers from the same problem as #2 (the begining of your hostname can begin with '....-----'). #5 requires that a hostname contains a '.' character. Didn't your orginal question ask about not requiring this? And finally #6 ensures your hostname ends in 2 to 4 letters? I think #5 and #6 were supposed to have ()'s around them to group them together. It would be nice to standardize on either [a-zA-Z0-9] or \w too.
To fix things up I'd trim #1 to just \w since its equivelant. Change #4 to [\w-]* and combine #5 and #6 into (\.[\w-]*){0,3}.
If you've read this far, you've come to the point. Although I love regular expressions, I been feeling more and more lately that maybe they are getting over used. Unless you have a solid understanding of finate state automata, your just programming them by trial and error. Tools and web pages like this can help, but I'm still left concerned about the maintenance of regex related code by non-regex loving peers. Its similar to my thoughts on AOP. The average team has average programmers on it. If Java programmers that maintain their own blogs are having trouble with regex, where does that leave average and below average programmers?

6 Comments:
"Some people, when confronted with a problem, think ``I know, I'll use regular expressions.'' Now they have two problems." --jwz
That said, I still think that regexes are the most appropriate tool for problems like email-address validation.
7/28/2004 4:39 AM
Have a look at http://philip.greenspun.com/panda/case-studies, search for the phrase "I was feeling pretty good about the code above". Apparently you need a 3-page regexp to do a proper validation, so at this point you could just try sending a test email to each address and delete the bouncers.
Sebastiano Pilla
7/28/2004 5:17 AM
I definitely don't have a solid understanding of finate state automata :), but I still feel like regex should be a skill that java programmers should obtain. I run across way too much code that uses nested StringTokenizers to break up a String. A very simple regex with nested groupings can greatly simplify this type of common code.
On a side note Idea has a regex plugin that is great for testing and running your regex expressions.
7/28/2004 8:07 AM
Yeah, I agree StringTokenizer, indexOf() and friends are definately NOT the answer. But something more maintainable then regexs is out there. One thing that can help in this situation is to use a package like commons-validator that already handles email address validation. It may use regexs under the covers but that code is maintained by a different group of people, not your team.
7/28/2004 10:21 AM
"If Java programmers that maintain their own blogs are having trouble with regex, where does that leave average and below average programmers?"
Which implies that Java programmers who bitch and moan in blogs all the time are above average? ;-)
7/28/2004 12:05 PM
Present company excluded ;)
7/28/2004 12:58 PM
Post a Comment
<< Home