<meta name='google-adsense-platform-account' content='ca-host-pub-1556223355139109'/> <meta name='google-adsense-platform-domain' content='blogspot.com'/> <!-- data-ad-client=ca-pub-4320963827702032 --> <!-- --><style type="text/css">@import url(https://www.blogger.com/static/v1/v-css/navbar/3334278262-classic.css); div.b-mobile {display:none;} </style> </head><body><script type="text/javascript"> function setAttributeOnload(object, attribute, val) { if(window.addEventListener) { window.addEventListener('load', function(){ object[attribute] = val; }, false); } else { window.attachEvent('onload', function(){ object[attribute] = val; }); } } </script> <div id="navbar-iframe-container"></div> <script type="text/javascript" src="https://apis.google.com/js/platform.js"></script> <script type="text/javascript"> gapi.load("gapi.iframes:gapi.iframes.style.bubble", function() { if (gapi.iframes && gapi.iframes.getContext) { gapi.iframes.getContext().openChild({ url: 'https://www.blogger.com/navbar/7256432?origin\x3dhttp://frustratedprogrammer.blogspot.com', where: document.getElementById("navbar-iframe-container"), id: "navbar-iframe" }); } }); </script>
| Tuesday, July 27, 2004

In a recent post by Jason Bell, he asks for some help with a regex for validating email addresses. A commenter suggested this:


^[a-zA-Z0-9][_\\-\\.\\w]*@[\\w\\.\\-]+\\.[a-zA-Z]{2,4}$


If it makes sense I'm going to break it down into its pieces:
1) [a-zA-Z0-9] - Match 1 letter or number
2) [_\\-\\.\\w]* - Match 0 or more of: _ - . or a letter or number
3) @ - The @ character
4) [\\w\\.\\-]+ - Match 1 or more of: - . or a letter or number
5) \\. - The . character
6) [a-zA-Z]{2,4} Match between 2 and 4 of any letter

#2 matches valid username characters and #1 ensures they only start with letters or numbers. This is ok, except you will match usernames like 'a----.....___'

In the hostname, #4 suffers from the same problem as #2 (the begining of your hostname can begin with '....-----'). #5 requires that a hostname contains a '.' character. Didn't your orginal question ask about not requiring this? And finally #6 ensures your hostname ends in 2 to 4 letters? I think #5 and #6 were supposed to have ()'s around them to group them together. It would be nice to standardize on either [a-zA-Z0-9] or \w too.

To fix things up I'd trim #1 to just \w since its equivelant. Change #4 to [\w-]* and combine #5 and #6 into (\.[\w-]*){0,3}.

If you've read this far, you've come to the point. Although I love regular expressions, I been feeling more and more lately that maybe they are getting over used. Unless you have a solid understanding of finate state automata, your just programming them by trial and error. Tools and web pages like this can help, but I'm still left concerned about the maintenance of regex related code by non-regex loving peers. Its similar to my thoughts on AOP. The average team has average programmers on it. If Java programmers that maintain their own blogs are having trouble with regex, where does that leave average and below average programmers?