<!-- --><!-- --><style type="text/css">@import url(http://www.blogger.com/static/v1/v-css/navbar/697174003-classic.css); div.b-mobile {display:none;} </style> </head><body><script type="text/javascript"> function setAttributeOnload(object, attribute, val) { if(window.addEventListener) { window.addEventListener("load", function(){ object[attribute] = val; }, false); } else { window.attachEvent('onload', function(){ object[attribute] = val; }); } } </script> <iframe src="http://www.blogger.com/navbar.g?targetBlogID=7256432&amp;blogName=The+Frustrated+Programmer&amp;publishMode=PUBLISH_MODE_BLOGSPOT&amp;navbarType=BLACK&amp;layoutType=CLASSIC&amp;homepageUrl=http%3A%2F%2Ffrustratedprogrammer.blogspot.com%2F&amp;blogLocale=en_US&amp;searchRoot=http%3A%2F%2Ffrustratedprogrammer.blogspot.com%2Fsearch" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" height="30px" width="100%" id="navbar-iframe" title="Blogger Navigation and Search"></iframe> <div></div>
6 comments | Tuesday, July 27, 2004

In a recent post by Jason Bell, he asks for some help with a regex for validating email addresses. A commenter suggested this:


^[a-zA-Z0-9][_\\-\\.\\w]*@[\\w\\.\\-]+\\.[a-zA-Z]{2,4}$


If it makes sense I'm going to break it down into its pieces:
1) [a-zA-Z0-9] - Match 1 letter or number
2) [_\\-\\.\\w]* - Match 0 or more of: _ - . or a letter or number
3) @ - The @ character
4) [\\w\\.\\-]+ - Match 1 or more of: - . or a letter or number
5) \\. - The . character
6) [a-zA-Z]{2,4} Match between 2 and 4 of any letter

#2 matches valid username characters and #1 ensures they only start with letters or numbers. This is ok, except you will match usernames like 'a----.....___'

In the hostname, #4 suffers from the same problem as #2 (the begining of your hostname can begin with '....-----'). #5 requires that a hostname contains a '.' character. Didn't your orginal question ask about not requiring this? And finally #6 ensures your hostname ends in 2 to 4 letters? I think #5 and #6 were supposed to have ()'s around them to group them together. It would be nice to standardize on either [a-zA-Z0-9] or \w too.

To fix things up I'd trim #1 to just \w since its equivelant. Change #4 to [\w-]* and combine #5 and #6 into (\.[\w-]*){0,3}.

If you've read this far, you've come to the point. Although I love regular expressions, I been feeling more and more lately that maybe they are getting over used. Unless you have a solid understanding of finate state automata, your just programming them by trial and error. Tools and web pages like this can help, but I'm still left concerned about the maintenance of regex related code by non-regex loving peers. Its similar to my thoughts on AOP. The average team has average programmers on it. If Java programmers that maintain their own blogs are having trouble with regex, where does that leave average and below average programmers?

6 Comments:

Anonymous Anonymous said...

"Some people, when confronted with a problem, think ``I know, I'll use regular expressions.'' Now they have two problems." --jwz

That said, I still think that regexes are the most appropriate tool for problems like email-address validation.

7/28/2004 4:39 AM

 
Anonymous Anonymous said...

Have a look at http://philip.greenspun.com/panda/case-studies, search for the phrase "I was feeling pretty good about the code above". Apparently you need a 3-page regexp to do a proper validation, so at this point you could just try sending a test email to each address and delete the bouncers.
Sebastiano Pilla

7/28/2004 5:17 AM

 
Blogger wlyvers said...

I definitely don't have a solid understanding of finate state automata :), but I still feel like regex should be a skill that java programmers should obtain. I run across way too much code that uses nested StringTokenizers to break up a String. A very simple regex with nested groupings can greatly simplify this type of common code.
On a side note Idea has a regex plugin that is great for testing and running your regex expressions.

7/28/2004 8:07 AM

 
Blogger Paul said...

Yeah, I agree StringTokenizer, indexOf() and friends are definately NOT the answer. But something more maintainable then regexs is out there. One thing that can help in this situation is to use a package like commons-validator that already handles email address validation. It may use regexs under the covers but that code is maintained by a different group of people, not your team.

7/28/2004 10:21 AM

 
Anonymous Anonymous said...

"If Java programmers that maintain their own blogs are having trouble with regex, where does that leave average and below average programmers?"

Which implies that Java programmers who bitch and moan in blogs all the time are above average? ;-)

7/28/2004 12:05 PM

 
Blogger Paul said...

Present company excluded ;)

7/28/2004 12:58 PM

 

Post a Comment

<< Home