Lets Parse Things Out

When it comes to hacking, one of the most common vulnerabilities you try and take advantage of are user inputs. Any time a piece of software takes information directly from a user there is a chance for exploitation. Instead of going over types of vulnerabilities or attack methods, today I am going to do a deep dive into something a bit different. We are going to reimplement the JSON.parse function. Projects like this are GREAT ways to not only beef up your development skills but also gives us the opportunity to think like a hacker and learn how the inner workings is things like parsers work. Any time you are trying to exploit something from a user input, it is going to go through some type of parser. So just like the more you know about TPC protocols the better equipped you are to take advantage of and or manipulate a network, the more you know about things like parsers the better equipped you'll be when it comes time to exploit one.

Parsers are everywhere. We use them all the time and never even notice, but I garuntee that if you've ever had a project where you had to write your own parser, you came out with a new found appreciation for them. Another one of my goals with this post is to show the importance of breaking down a problem into smaller bits and how useful and powerful it can be. Also I am going to try and throw in some cool lesser-used Javascript techniques.

Lets begin!

First, lets clearly define the problem by asking some important questions. What exactly does JSON.parse do? What does it expect to intake and what do you expect it to output? Some of you may even be asking, "What is JSON to begin with?". If you are reading this I am going to assume you probably know what JSON is but for those that don't here is a quick rundown.

JSON is short for JavaScript Object Notation, and is a way to store information in an organized, easy-to-access manner. In a nutshell, it gives us a human-readable collection of data that we can access in a really logical manner.

Below is an example of some JSON data you might get back from the Google Maps API.

{
    "markers": [
        {
            "point":new GLatLng(40.266044,-74.718479), 
            "homeTeam":"Lawrence Library",
            "awayTeam":"LUGip",
            "markerImage":"images/red.png",
            "information": "Linux users group meets second Wednesday of each month.",
            "fixture":"Wednesday 7pm",
            "capacity":"",
            "previousScore":""
        },  
        {
            "point":new GLatLng(40.211600,-74.695702),
            "homeTeam":"Hamilton Library",
            "awayTeam":"LUGip HW SIG",
            "markerImage":"images/white.png",
            "information": "Linux users can meet the first Tuesday of the month to work out harward and configuration issues.",
            "fixture":"Tuesday 7pm",
            "capacity":"",
            "tv":""
        },
        {
            "point":new GLatLng(40.294535,-74.682012),
            "homeTeam":"Applebees",
            "awayTeam":"After LUPip Mtg Spot",
            "markerImage":"images/newcastle.png",
            "information": "Some of us go there after the main LUGip meeting, drink brews, and talk.",
            "fixture":"Wednesday whenever",
            "capacity":"2 to 4 pints",
            "tv":""
        },
    ] 
}

So if you wanted to access something in this you would have to reference something like makers[1].homeTeam and you would get the value "Hamilton Library" returned to you. Here is a link to a great in-depth piece about JSON.

Next question. What exactly does JSON.parse do?

JSON Example

Above is a short and simple example of what and why you use JSON.parse. The TLDR is that when you transmit data over a network, like for example the internet, it is best transmitted via strings. Now I could go into serious detail as to why that is but now is not the time or place but here is an article explaining it. Anyways, for our purposes lets just say "OK, we ship our data over networks as strings. This means that whenever we want to send something like JSON data we first need to "stringify" if. Now that word is pretty self explanatory, it takes whatever you give it and turns the entire thing into one giant string. I'll be putting out another blog post as to how you would go about writing JSON.stringify, but thats for another time. Ok, so we have stringified the data and have sent it over the wire to whoever has requested it. Now what?

Here is where JSON.parse comes into play. The person receiving the data doesn't want or need a giant string representation of the data they asked for. They need to original data back. Enter our parser...

var parseJSON = function(json) {
    // The index of the current character
    var index = 0;

    // The current character
    var ch = ' ';

    // This is for special characters that will need to be escaped
    var escapee = {
        '"': '"',
        '\\': '\\',
        '/': '/',
        b: '\b',
        f: '\f',
        n: '\n',
        r: '\r',
        t: '\t'
    };

    // Move to the next character and check that it is what we expect
    var nextCh = function(c) {
    // If ch is not what we expect, error
        if (c && c !== ch) {
        throw new SyntaxError('Expected "' + c + '" instead of "' + ch + '"');
        }

    ch = json.charAt(index);
    index += 1;

    return ch;
    };

First thing we do above is to declare some simple variables we will need to iterate over the data passed in. Index will keep track of the current char, ch will keep track of the actual char and then we have a escape reference object for when we need to use these special chars.

Following that we declare an internal function called nextCh. This is fairly self explanatory, it iterates over each char comparing it and making sure it is what we expect. If not, we get thrown an error. This may seem out of place here but trust me you will see how it comes in handy later.

From here we are going to continue the trend of smaller unique functions that will take care of specific tasks. Below we take care of parsing numbers.

var numberParse = function() {
    var number;
    var string = '';

    if (ch === '-') {
        string = '-';
        nextCh('-');
    }
    while (ch >= '0' && ch <= '9') {
        string += ch;
        nextCh();
    }
    if (ch === '.') {
        string += ch;
        while (nextCh() && ch >= '0' && ch <= '9') {
            string += ch;
        }
    }

    // Scientific notation
    if (ch === 'e' || ch === 'E') {
        string += ch;
        nextCh();
        if (ch === '-' || ch === '+') {
            string += ch;
            nextCh();
        }
        while (ch >= '0' && ch <= '9') {
            string += ch;
            nextCh();
        }
    }

    number = parseFloat(string, 10);

    // Check that number is valid
        if (!isFinite(number)) {
            throw new SyntaxError('Bad number');
        } else {
            return number;
        }
    };

At first glance this block of code may seem daunting but if you give it a once over you will notice it is pretty straight forward. We use our previously defined function to iterate over the chars and if they match any of our definitions for being a number, we return the string value for that char. First we check for dashes, numbers from 0 - 9, and then scientific notation. At the very end, we check to make sure the number we are about to return is actually valid with a little known function isFinite(). This just checks to make sure the value passed is a finite number and safe for us to return.

Next up we take care of parsing string values which is a bit more involved. There is a bit going on in the code below so lets break it down.

var stringParse = function() {
    var hex, i, unicodeValue;
    var string = '';

    if (ch === '"') {
        while (nextCh()) {
            if (ch === '"') {
                nextCh();
                return string;
            }

            // Parse escaped characters
            if (ch === '\\') {
                nextCh();
                if (ch === 'u') {
                    unicodeValue = 0;
                    for (i = 0; i < 4; i += 1) {
                        hex = parseInt(next(), 16);
                        if (!isFinite(hex)) {
                            break;
                        }
                        unicodeValue = unicodeValue * 16 + hex;
                    }
                    string += String.fromCharCode(unicodeValue);
                } else if (typeof escapee[ch] === 'string') {
                    string += escapee[ch];
                } else {
                    break;
                }
            } else {
                string += ch;
            }
        }
    }
    throw new SyntaxError('Bad string');
};

We start by again defining our internal function for stringParse. You may be wondering why we are declaring hex, i and unicodeValue. It has to do with converting values that are sent over as unicode and escaping them. Any character with a character code lower than 65536 can be escaped using the hexadecimal value of its character code, prefixed with \u. Now I could get really in depth but once again that is for another post. If you are confused by what is going on with the escaping of unicode characters follow this link for a more in depth explanation.

// Remove whitespace between characters
var compressWhitespace = function() {
    while (ch && ch <= ' ') {
        nextCh();
    }
};

var booleanParse = function() {
    switch (ch) {
    case 't':
        nextCh('t');
        nextCh('r');
        nextCh('u');
        nextCh('e');
        return true;
    case 'f':
        nextCh('f');
        nextCh('a');
        nextCh('l');
        nextCh('s');
        nextCh('e');
    return false;
    case 'n':
        nextCh('n');
        nextCh('u');
        nextCh('l');
        nextCh('l');
    return null;
    }
    throw new SyntaxError('Unexpected "' + ch + '"');
};

Next up we take care of boolean values, True, False and NULL. For some of you newer developers, especially those who only know Javascript the switch statement might be a bit foreign. I know there has been some back and forth as to whether or not they are a good thing to teach beginners. Personally I think they are GREAT. You just need to make sure you fully understand the rules of how they work and REMEMBER them when implementing. Here is a good resource in case you are unfamiliar. Essentially what it does is shorten an if/else statement and allows the logic to flow down each 'case'. When it gets passed a string with any of the values it is looking for it passes that value into our compressWhitespace function and then returns the actual Javascript value instead of the string. Because 'false' !== false.

var arrayParse = function() {
    var array = [];

    if (ch === '[') {
        nextCh('[');
        compressWhitespace();
        if (ch === ']') {
            nextCh(']');
            // Array is empty
            return array;
        }
        while (ch) {
            array.push(initiateParse());
            compressWhitespace();
            if (ch === ']') {
                nextCh(']');
                return array;
            }
            nextCh(',');
            compressWhitespace();
        }
    }
    throw new SyntaxError('Bad array');
};

Next up...Arrays. This should be pretty straight forward. First we check to see if the value is '[' and if it is AND it is then proceeded by ']' then we simply return the empty array. If on the other hand, the array actually contains something then we use a while loop. Now those of you with a keen eye you may have noticed that I make a call to the function initiateParse, which if you have been paying attention, doesn't exists....(yet) Oh the wonders of Javascript! From there we cycle through any chars or values inside of the array, again compressing any whitespace, until we reach the end and simply return the parsed version of the array along with all of it's contents!

var objectParse = function() {
    var key;
    var object = {};

    if (ch === '{') {
        nextCh('{');
        compressWhitespace();
        if (ch === '}') {
            nextCh('}');
            // Object is empty
            return object;
    }
    while (ch) {
        key = stringParse();
        compressWhitespace();
        nextCh(':');
        if (Object.hasOwnProperty.call(object, key)) {
            throw new SyntaxError('Duplicate key "' + key + '"');
        }
        object[key] = initiateParse();
        compressWhitespace();
        if (ch === '}') {
            nextCh('}');
            return object;
        }
        nextCh(',');
        compressWhitespace();
        }
    }
    throw new SyntaxError('Bad object');
};

When it comes to parsing objects, you will notice this function looks awfully similar to the arrayParse...and thats because it is. Same initial check to see if the object actually contains anything and if it does then we initiate the parsing of each value inside. Now although it is similar its not exactly the same. Remember objects are only allowed t,o have unique keys, meaning once you use, for example, the string "name" as a key, it cannot be used again. So we use the native Object.hasOwnProperty function on the already parsed portion of the object to make sure we aren't going to pass back a mutated object. The hasOwnProperty function is simple. It takes the object and the key you are looking for and returns true or false. So for each iteration we check to make sure the key being parsed does not already exists.

// Initiate parsing of JSON string
var initiateParse = function() {
    compressWhitespace();
    switch (ch) {
    case '{':
        return objectParse();
    case '[':
        return arrayParse();
    case '"':
        return stringParse();
    case '-':
        return numberParse();
    default:
        return ch >= '0' && ch <= '9' ? numberParse() : booleanParse();
    }
};

var result = initiateParse();
compressWhitespace();
    if (ch) {
        throw new SyntaxError('Syntax error');
    }

    return result;
};

And we save the best for last! That super secret function you kept seeing...here it finally is in all of its glory! As you can see all it does is take whatever char is passed to it and then determine which one of our awesome functions to pass it to next. Again if you aren't familiar with switch statements the Default: case at the end works just like it sounds. It acts as a catch all for values that don't match the ones we have defined above. Meaning if it comes across something other then '{', '[', '""' or '-' that logic flows down to the default and hits our ternary statement. That folks, is pretty much it!

I will post a copy of the complete code at the bottom of the post as well but this has been a great excerises for many reasons. First notice how we made sure to make things modular and separated as much of the functionality out into seperate pieces as we could. This is absolutely crucial for something like this. When you are going to have a single function that internally is going to be this complex, the last thing you or anyone else wants is spaghetti code. This way, when it comes to debugging or changing any one part of the function, we can isolate the problem or upgrade without the rest of the functionality being disturbed. I am sure you have heard this principle before and if you have worked on any larger projects you know first hand how important and powerful this idea is. My favorite explanation of this principle was given by Rich Hickey (creator of Clojure). He has given the talk several times on many different platforms because the core idea can be applied to any area of development in any language. Here is a link to the one I enjoy the most.

Ok we flexed our divide-and-conquer/modularity muscles but what else did we do? Well hopefully, like I mentioned in the beginning, you now have an appreciation for the work that goes into something as seemingly simple and mundane as a parser function. With that being said, as a hacker this is a great way to start thinking about the core functionality of ANY parser and what similarities they might share as well as how each would differ depending on the language and or what type of data you are trying to parse. You can start to think of ways to trick them, bypass logic and start injecting things you want in the places you want. This is my favorite part of the Hacker ethos. The desire, and frankly the obligation if you want to be good, one has to dig deeper, understand more fully, so that you might be able to see the larger picture, find a flaw and exploit it.

I hope you all have enjoyed this endeavor as much as I have. Until next time remember ###Everything can be Hacked###

My apologies for any strange looking formatting

var parseJSON = function(json) {


    // The index of the current character
    var index = 0;

    // The current character
    var ch = ' ';
    var escapee = {
        '"': '"',
        '\\': '\\',
        '/': '/',
        b: '\b',
        f: '\f',
        n: '\n',
        r: '\r',
        t: '\t'
    };

    // Move to the next character and check that it is what we expect
    var nextCh = function(c) {
        // If ch is not what we expect, error
        if (c && c !== ch) {
            throw new SyntaxError('Expected "' + c + '" instead of "' + ch + '"');
        }

        ch = json.charAt(index);
        index += 1;

        return ch;
    };

    var numberParse = function() {
        var number;
        var string = '';

        if (ch === '-') {
            string = '-';
            nextCh('-');
        }
        while (ch >= '0' && ch <= '9') {
            string += ch;
            nextCh();
        }
        if (ch === '.') {
            string += ch;
            while (nextCh() && ch >= '0' && ch <= '9') {
                string += ch;
            }
        }

        // Scientific notation
        if (ch === 'e' || ch === 'E') {
            string += ch;
            nextCh();
            if (ch === '-' || ch === '+') {
                string += ch;
                nextCh();
            }
            while (ch >= '0' && ch <= '9') {
                string += ch;
                nextCh();
            }
        }

        number = parseFloat(string, 10);

        // Check that number is valid
        if (!isFinite(number)) {
            throw new SyntaxError('Bad number');
        } else {
            return number;
        }
    };

    var stringParse = function() {
        var hex, i, unicodeValue;
        var string = '';

        if (ch === '"') {
            while (nextCh()) {
                if (ch === '"') {
                    nextCh();
                    return string;
                }

                // Parse escaped characters
                if (ch === '\\') {
                    nextCh();
                    if (ch === 'u') {
                        unicodeValue = 0;
                        for (i = 0; i < 4; i += 1) {
                            hex = parseInt(next(), 16);
                            if (!isFinite(hex)) {
                                break;
                            }
                            unicodeValue = unicodeValue * 16 + hex;
                    }
                    string += String.fromCharCode(unicodeValue);
                } else if (typeof escapee[ch] === 'string') {
                    string += escapee[ch];
                } else {
                    break;
                }
            } else {
                string += ch;
            }
        }
        throw new SyntaxError('Bad string');
    };

    // Remove whitespace between characters
    var compressWhitespace = function() {
        while (ch && ch <= ' ') {
            nextCh();
        }
    };

    var booleanParse = function() {
        switch (ch) {
        case 't':
            nextCh('t');
            nextCh('r');
            nextCh('u');
            nextCh('e');
            return true;
        case 'f':
            nextCh('f');
            nextCh('a');
            nextCh('l');
            nextCh('s');
            nextCh('e');
            return false;
        case 'n':
            nextCh('n');
            nextCh('u');
            nextCh('l');
            nextCh('l');
            return null;
        }
        throw new SyntaxError('Unexpected "' + ch + '"');
    };

    var arrayParse = function() {
        var array = [];

        if (ch === '[') {
            nextCh('[');
            compressWhitespace();
            if (ch === ']') {
                nextCh(']');
                // Array is empty
                return array;
            }
            while (ch) {
                array.push(initiateParse());
                compressWhitespace();
                if (ch === ']') {
                    nextCh(']');
                    return array;
                }
                nextCh(',');
                compressWhitespace();
            }
        }
        throw new SyntaxError('Bad array');
    };

    var objectParse = function() {
        var key;
        var object = {};

        if (ch === '{') {
            nextCh('{');
            compressWhitespace();
            if (ch === '}') {
                nextCh('}');
                // Object is empty
                return object;
            }
            while (ch) {
                key = stringParse();
                compressWhitespace();
                nextCh(':');
                if (Object.hasOwnProperty.call(object, key)) {
                    throw new SyntaxError('Duplicate key "' + key + '"');
                }
                object[key] = initiateParse();
                compressWhitespace();
                if (ch === '}') {
                    nextCh('}');
                    return object;
                }
                nextCh(',');
                compressWhitespace();
            }
        }
        throw new SyntaxError('Bad object');
    };

    // Initiate parsing of JSON string
    var initiateParse = function() {
        compressWhitespace();
        switch (ch) {
        case '{':
            return objectParse();
        case '[':
            return arrayParse();
        case '"':
            return stringParse();
        case '-':
            return numberParse();
        default:
            return ch >= '0' && ch <= '9' ? numberParse() : booleanParse();
        }
    };

    var result = initiateParse();
    compressWhitespace();
    if (ch) {
        throw new SyntaxError('Syntax error');
    }

    return result;
};