Holes in most preg_match() filtersWednesday, April 4. 2007Wednesday, April 4. 2007 During the last week I was performing some audits and like so often it contained preg_match() filters that were not correct. Most PHP developers use ^ and $ within their regular expressions without actually reading the documentation about what they really achieve. You will find a lot of input filters like the following one. <?phpQuite common way to filter incoming data, isn't it? However the problem is, that the author of such a regular expression did not correctly read the documentation and mistakes the $ character for the definitive end of the subject. However the real meaning, as it is even documented in the PHP manual is that $ means the end of the subject OR not the real end but nearly, only followed by a single '\n' linebreak. This means that the following request will also pass the filter. http://server.tld/index.php?var=012345:XYZ%0aIn several circumstances a newline character can be dangerous. For example when you want to stop HTTP Response Splitting or Email Injection attacks. To correct the above regular expression it is necessary to add the D modifier to it that changes the meaning of the $ specifier to really mean the end of the subject. Here is the corrected example. <?phpI hope this tip helps getting rid of all these wrong filters once and for all. People using ext/filter should prepare for a recompile, too.
Comments
Display comments as
(Linear | Threaded)
Thanks for sharing, Stefan. I guess 90% of the people out there are using (p)regexp for validating an integer and therefore could run into this pitfall.
Also you'll see is_numeric() often, too. That's also wrong because it's NOT validating integers but also floats etc. I found one (maybe) forgotten nice and plain function in the core that IMHO does what we want: ctype_digit() checks if each character in a string is numeric. posted on Wednesday, April 4. 2007
From the first 2 comments I see that I have choosen a bad example. I wanted a very simple regular expression just as demonstration.
Of course for plain numbers a cast to int and similiar things are good. I did not take an email address as example, because there are millions of email regular expressions. And one of those email regular expression that misses the D modifier is the validate email filter in PHP's ext/filter. posted on Wednesday, April 4. 2007
Hi Stefan,
you're right, I focussed on the integer Another idea - maybe for suhosin: GET Parameters with newline characters IMHO are suspicious in general. Maybe this could be added to suhosin?! posted on Wednesday, April 4. 2007
==GET Parameters with newline characters IMHO are suspicious in general.
==Maybe this could be added to suhosin?! It can cut last "\n" in value, so "Foo\nBar\nBoo\n" becomes "Foo\nBar\nBoo" Just my 2cents posted on Wednesday, April 4. 2007
Honestly I pretty much doubt that removing \n would be a good idea.
I am not a friend of modifying content the user sent to me. Therefore the comment of guy #6 is something I would never do. One should never ever try to repair user input. Removing a final \n would be exactly this. The only thing I could do would be removing the whole variable if \n is at the end and that could be fatal if some browser has a bug with multiline edit boxes and formular POSTs over GET. posted on Thursday, April 5. 2007
As this problem does not only affect PHP/PCRE I checked some of my other apps some weeks ago as well. One was using Python/SRE which does not have a D modifier.
But instead of a modifier there is a "replacement" for $ which does what most people would expect -- match only the real end of the string: \Z PCRE has this feature as well, but unfortunately another char is used: \z And I think using \z in PCRE is equivalent to $ plus D modifier, e.g. preg_match('/^[0-9]+\z/", $_GET['number']) There is also \a which matches the beginning of the string, but IMO ^ is equivalent to it so I don't see much sense in replacing it as well. posted on Wednesday, April 4. 2007
I believe Christian is spot on here.
After looking through the docs using \z is probably what most people were expecting out of $. Using the \z works in Perl (http://perldoc.perl.org/perlre.html) and PCRE (http://www.pcre.org/pcre.txt) and therefore PHP preg_* functions (http://www.php.net/manual/en/reference.pcre.pattern.syntax.php). Using the /D is also an option for preg_* functions in PHP, but since there is no equivalent modifier in Perl it is probably easier to just remember to use \z if you really want to match the end. Nice catch Stefan. posted on Wednesday, April 4. 2007
ctype_digit()?
I always try to avoid casual regex for simple scalars - give programmers a variable method and they'll inevitably vary off the beaten track... posted on Wednesday, April 4. 2007
Thank you for reading the other comments prior writing your own comment and stating the obvious again.
posted on Wednesday, April 4. 2007
You're more than welcome, Stefan
posted on Wednesday, April 4. 2007
#6
another anonymous bastard
()
Hi,
IMVHO, i always consider using user submited data as a bad practice; let me explain. You demonstrated us the "trick" of the regexp was flawed, but what if the web developper did: Hey, you see? no linebreak (maybe it would have been better using ()?). Why? because now we save in the clean variable what matched, not the full string; well you understand. Okay, that was simplistic, but it was to say the regexp was not the cause, it was the misuse of the result. posted on Wednesday, April 4. 2007
I have no clue what you were trying to achieve, because s9y most probably killed your PHP tags in the comment.
Try again. "Okay, that was simplistic, but it was to say the regexp was not the cause, it was the misuse of the result." Sorry but you are completely missing the point. It was not the misuse of the result that caused the problem, but the fact that the regular expression was not doing what people are told. Again I can only guess what your code looked like but saying "i always consider using user submited data as a bad practice" is nonsense. For example when it comes to storing user submitted data in the database. I have seen a lot of nonsense filters that stripped everything beside a known good whitelist from the input. This is the wrong approach when it comes to arbitrary user supplied data. When the user tells you that his favourite word is {${phpinfo()}}'/* then it most probably is his favourite word and your job is to put this data into the database and not to enforce your own idea what words you allow him to like. Filters and validators can only be used where filtering and validation makes sense. posted on Wednesday, April 4. 2007
#7
Anonymous
()
Stefan - would you provide us with a link to the page in the PHP manual where it talks about about this? I'm just curious to see what it says. Thanks.
posted on Wednesday, April 4. 2007
Documentation of the D modifier is here:
http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php posted on Wednesday, April 4. 2007
Great article. I've read the PHP manual's chapter about PCRE countless times and I knew that the "D" modifier stood for "PCRE_DOLLAR_ENDONLY" but I had always assumed that it was on by default (and the only other mode would be PCRE_DOLLAR_ENDONLY). Thanks to that article now I know exactly the difference between no modifier and the D modifier.
A lot of my own scripts are affected, and even though I don't think it can be exploited in any efficient way (in my scripts that is) it should not be overlooked, so thanks for the heads up. posted on Wednesday, April 4. 2007
Stefan,
I understand the problem with the dollar metacharacter, but what about the carot? I checked out the PHP page you referred to in the comments, but I'm not sure I completely understand that either. By default, does the carot match at the absolute beginning of the line? Or is there a way to inject prior to that using a newline or something else? posted on Wednesday, April 4. 2007
Jason: As opposed to the dollar character, there's nothing stated in the PHP manual about any exceptions in the behaviour of the circumflex character (outside a class of course).
Therefore you can be sure that it asserts the very start of a subject only. posted on Thursday, April 5. 2007
Excellent. That saves a little on the re-factoring I need to do now.
Stefan, thanks for the heads up. This is invaluable advice and I truly appreciate it! posted on Thursday, April 5. 2007
hi!
i know what "$" matches, but i did not understand the possible security hole, you were talking about. you wrote: "In several circumstances a newline character can be dangerous. For example when you want to stop HTTP Response Splitting or Email Injection attacks." could you please tell me some dangerous examples for that? of course "foo@example.org\nspam_bar@example.com" can be evil, but that would not be matched by /^[...]+$/. so how can "foo@example.org\n" damage anything? prosit seth posted on Wednesday, April 11. 2007
Seth,
you are missing the point that the world does not only consist of a single variable. And that email addresses are not only given to the mail() function. A single newline at the end can be a problem for all the stuff that follows. Not long ago some guy wanted "us" (those who read his blog) to add debugging information into SQL queries so that you can see who caused the problem. Now imagine the SQL Query looks like // $firstname $lastname SELECT * FROM ... Now guess what happens if $firstname ends in a \n. Suddendly $lastname ends up in the next line and might allow SQL injection. And this is not the only example. Imagine a session ID sent by the browser ends in a \n and this results in PHP code like header("Set-Cookie: sessid=$sessid\nNext-Header: $userinput"); While header() was never intended to be used with multiple HTTP headers, people used it like that. In the previous example $sessid might end on \n which results in the end of the HTTP Headers before Next-Header: and suddenly $userinput is in the body and might contain JavaScript... There are many more examples where this can be dangerous. f.e. many Shells consider \n the separator between two commands... posted on Thursday, April 12. 2007
You're more meek than I am, seth. There could be exploits, but claiming that there are holes in most preg_match filters is FUD. As #15.1 points out, it's trivial compared to the ereg vulnerability.
For one thing, the php header() documentation says this: Note: Since PHP 4.4.2 and PHP 5.1.2 this function prevents more than one header to be sent at once as a protection against header injection attacks. So if you're staying up to date with security patches, HTTP header injection isn't an issue. Mail header injection and line comment injection are possible, but both depend on rather contrived circumstances. (Okay, let's see . . . two variables on one mail header line . . . and the second one has to have a colon in the middle of a bunch of letters . . . ) And anyone who doesn't break into a cold sweat using user input in a system call shouldn't be touching a computer. posted on Thursday, April 26. 2007
There are holes in most preg_match() filters.
Noone says that they all lead to exploitable problems, but there are holes. So there is no FUD at all, boy. posted on Friday, April 27. 2007
#13
Anonymous
()
I have tried to exploit this in my code and I cannot. I think this never happens in my code because regular expressions are greedy and hence will make the $ match the end of the string.
The following regexp: |^['0-9a-zA-Z\\\/\[\]\(\)\\,\'\~\-\^\@\$\%\*\&\?\:\;\_\.\€\+\|\^\@\%\s\"]+$| will always fail if I set the value to (from PHP): "a\r\n\n\r!\"£$%^&*()_+¬" Invalid characters appear at the end, and the preg_match call fails, as it should. Is there anything I am doing wrong?. Thanks! posted on Thursday, April 12. 2007
Anonymous you misunderstood the complete blog posting.
You cannot inject newlines whereever you want. It is only possible to inject A SINGLE newline at the end. posted on Thursday, April 12. 2007
#14
Norbert
()
Is there some kind of program/PHP script available where you can feed/test your pregexp with junk before releasing it to the public?
This shouldn't be the salvation to the problem, but help finding holes before some hacker does on a production system. posted on Sunday, April 15. 2007
#15
Thomas Tallyce
()
Presumably this is *not* an issue with the ereg* family.
http://www.php.net/regex posted on Wednesday, April 18. 2007
#15.1
Anonymous
()
the ereg functions have more things to worry about than the dollar end, namely null byte injections
posted on Thursday, April 19. 2007
Matching before a newline is actually standard behavior for the $ line anchor, since historically it was used by line filters like egrep where it makes sense. In that environment, you want to be able to specify something like /^[a-z0-9]+$/ to match a line with all lower-case alphanumeric characters. It would be less intuitive to do /^[a-z0-9]+\n$/.
posted on Tuesday, May 15. 2007
i have been reading a lot about regular expresions, but I have not fount the meaning of the / ,, why do they always use it after the " ???? i read about the back slash \ that is used for special characters the + the $ and * but I could not find why they use always the / when using the preg_match
posted on Wednesday, May 16. 2007
#17.1
Stephen
()
I was wondering exactly the same thing. See this thread I found which might make things much clearer - http://www.thescripts.com/forum/thread465073.html
posted on Saturday, June 16. 2007
First of all you should not assume that just because YOU trim user input everyone does this.
Secondly the world does not consist of a single data entity. Where one email address/data entitiy is embedded into headers/shell commands/... there is usually another or even more. A newline at the end of one data entity might not look dangerous to you becaue you think in the size of single elements. However when I have a newline in the middle of a shell command for example this is considered the end of the command and start of a new one. (atleast in many unix shells) So something like system("do_a $userinput1 $userinput2"); suddenly becomes dangerous if $userinput1 has a newline at its end. (PS: don't tell me that you should escape the arguments, because that is not the point. If userinput1 already passed a preg_match("/^[0-9]*$/,....); people assume it is save to put it into the shell command.) With email injection it is the same. It always depends on the context. A single additional newline at the end of an email header will f.e. terminate the headers (because now there are 2 of them). Suddenly everything afterwards is considered part of the email body. So suddenly the attacker is able to manipulate the email body. This is also an email injection attack... Use you imagination... In many contexts a single newline at the end can cause problems. posted on Sunday, June 17. 2007
I'm too new at this to appreciate the discussion here. Just wanted to say your posting helped me. I may not have it in the most efficient format yet, but it seems to permit a string of words with white spaces between them and the use of commas. Most important for me, linefeeds etc are denied.
if (!empty($c['phone'])) { //NOTE: this is an optional field if (!preg_match("/^[a-zA-Z0-9\s\,]+$/D", $c['phone'])){ $problem = TRUE; $error_phone = ('^ Use alphanumeric characters only ^'); } } Thanks posted on Thursday, July 19. 2007
I have some problem whith my date check (FR), what's wrong ?
preg_match("/[0-9]{2}//[0-9]{2}//[0-9]{4}/",$cocr_date) posted on Tuesday, August 14. 2007
Add Comment
|
Calendar
Archives Categories Syndicate This Blog |
|||||||||||||||||||||||||||||||||||||||||||||||||


