barbarian meets coding

WebDev, UX & a Pinch of Fantasy

Data Structures

This article is part of my personal wiki where I write personal notes while I am learning new technologies. You are welcome to use it for your own learning!

Linked Lists

From Wikipedia.

A linked list is a data structure consisting of a group of nodes which together represent a sequence. Under the simplest form, each node is composed of data and a reference (in other words, a link) to the next node in the sequence; more complex variants add additional links. This structure allows for efficient insertion or removal of elements from any position in the sequence.

Linked lists are among the simplest and most common data structures. They can be used to implement several other common abstract data types, including lists (the abstract data type), stacks, queues, associative arrays, and S-expressions, though it is not uncommon to implement the other data structures directly without using a list as the basis of implementation.

The principal benefit of a linked list over a conventional array is that the list elements can easily be inserted or removed without reallocation or reorganization of the entire structure because the data items need not be stored contiguously in memory or on disk, while an array has to be declared in the source code, before compiling and running the program. Linked lists allow insertion and removal of nodes at any point in the list, and can do so with a constant number of operations if the link previous to the link being added or removed is maintained during list traversal.

On the other hand, simple linked lists by themselves do not allow random access to the data, or any form of efficient indexing. Thus, many basic operations may require sequential scanning of most or all of the list elements ( operations such as obtaining the last node of the list – assuming that the last node is not maintained as separate node reference in the list structure –, or finding a node that contains a given datum, or locating the place where a new node should be inserted). The advantages and disadvantages of using linked lists are given below:

  • Advantages
    • Linked lists are a dynamic data structure, allocating the needed memory while the program is running.
    • Insertion and deletion node operations are easily implemented in a linked list.
    • Linear data structures such as stacks and queues are easily executed with a linked list.
    • They can reduce access time and may expand in real time without memory overhead.
  • Disadvantages
    • They have a tendency to use more memory due to pointers requiring extra storage space. (data + pointer)
    • Nodes in a linked list must be read in order from the beginning as linked lists are inherently sequential access.
    • Nodes are stored incontiguously, greatly increasing the time required to access individual elements within the list.
    • Difficulties arise in linked lists when it comes to reverse traversing. For instance, singly linked lists are cumbersome to navigate backwards and while doubly linked lists are somewhat easier to read, memory is wasted in allocating space for a back pointer. (data + pointer + back pointer)

Stacks

A stack or LIFO (last in, first out) is an abstract data type that serves as a collection of elements, with two principal operations: push, which adds an element to the collection, and pop, which removes the last element that was added.

The term LIFO stems from the fact that, using these operations, each element “popped off” a stack in series of pushes and pops is the last (most recent) element that was “pushed into” within the sequence. This is equivalent to the requirement that, considered as a linear data structure, or more abstractly a sequential collection, the push and pop operations occur only at one end of the structure, referred to as the top of the stack. (Additionally, a peek operation may give access to the top.)

A stack may be implemented to have a bounded capacity. If the stack is full and does not contain enough space to accept an entity to be pushed, the stack is then considered to be in an overflow state. The pop operation removes an item from the top of the stack. A pop either reveals previously concealed items or results in an empty stack, but, if the stack is empty, it goes into underflow state, which means no items are present in stack to be removed.

A stack is a restricted data structure, because only a small number of operations are performed on it. The nature of the pop and push operations also means that stack elements have a natural order. Elements are removed from the stack in the reverse order to the order of their addition. Therefore, the lower elements are those that have been on the stack the longest.

Stack Applications

Expression evaluation and syntax parsing

Calculators employing reverse Polish notation use a stack structure to hold values. Expressions can be represented in prefix, postfix or infix notations and conversion from one form to another may be accomplished using a stack. Many compilers use a stack for parsing the syntax of expressions, program blocks etc. before translating into low level code. Most programming languages are context-free languages, allowing them to be parsed with stack based machines.

Backtracking (undo)

Consider a simple example of finding the correct path in a maze. There are a series of points, from the starting point to the destination. We start from one point. To reach the final destination, there are several paths. Suppose we choose a random path. After following a certain path, we realise that the path we have chosen is wrong. So we need to find a way by which we can return to the beginning of that path. This can be done with the use of stacks. With the help of stacks, we remember the point where we have reached. This is done by pushing that point into the stack. In case we end up on the wrong path, we can pop the last point from the stack and thus return to the last point and continue our quest to find the right path. This is called backtracking.

Runtime memory management

A number of programming languages are stack-oriented, meaning they define most basic operations (adding two numbers, printing a character) as taking their arguments from the stack, and placing any return values back on the stack. For example, PostScript has a return stack and an operand stack, and also has a graphics state stack and a dictionary stack. Many virtual machines are also stack-oriented, including the p-code machine and the Java Virtual Machine.

Almost all calling convention – the ways in which subroutines receive their parameters and return results – use a special stack (the “call stack”) to hold information about procedure/function calling and nesting in order to switch to the context of the called function and restore to the caller function when the calling finishes. The functions follow a runtime protocol between caller and callee to save arguments and return value on the stack. Stacks are an important way of supporting nested or recursive function calls. This type of stack is used implicitly by the compiler to support CALL and RETURN statements (or their equivalents) and is not manipulated directly by the programmer.

Some programming languages use the stack to store data that is local to a procedure. Space for local data items is allocated from the stack when the procedure is entered, and is deallocated when the procedure exits. The C programming language is typically implemented in this way. Using the same stack for both data and procedure calls has important security implications (see below) of which a programmer must be aware in order to avoid introducing serious security bugs into a program.

Queues

From Wikipedia

A queue is a particular kind of abstract data type or collection in which the entities in the collection are kept in order and the principal (or only) operations on the collection are the addition of entities to the rear terminal position, known as enqueue, and removal of entities from the front terminal position, known as dequeue. This makes the queue a First-In-First-Out (FIFO) data structure. In a FIFO data structure, the first element added to the queue will be the first one to be removed. This is equivalent to the requirement that once a new element is added, all elements that were added before have to be removed before the new element can be removed. Often a peek or front operation is also entered, returning the value of the front element without dequeuing it. A queue is an example of a linear data structure, or more abstractly a sequential collection.

Queues provide services in computer science, transport, and operations research where various entities such as data, objects, persons, or events are stored and held to be processed later. In these contexts, the queue performs the function of a buffer.

Queues are common in computer programs, where they are implemented as data structures coupled with access routines, as an abstract data structure or in object-oriented languages as classes. Common implementations are circular buffers and linked lists.

Priority Queues

A priority queue is an abstract data type which is like a regular queue or stack data structure, but where additionally each element has a “priority” associated with it. In a priority queue, an element with high priority is served before an element with low priority. If two elements have the same priority, they are served according to their order in the queue.

While priority queues are often implemented with heaps, they are conceptually distinct from heaps. A priority queue is an abstract concept like “a list” or “a map”; just as a list can be implemented with a linked list or an array, a priority queue can be implemented with a heap or a variety of other methods such as an unordered array.

Hash Tables

See Wikipedia (Associative Array) See Wikipedia (Hash Table)

Hash tables are data structures in the family of associative arrays. They store key/value pairs in an array, where the key doesn’t need to be an integer. The keys must be unique. Keys within a hash table are mapped into a specific integer so they can be stored within an array.

1
2
int index = Hash(item.key);
hash[index] = item;

The Hash function returns a fixed size result from an arbitrary input. The hashing algorithm used within the function must have the following characteristics:

  • Deterministic: The same input generates the same output every time
  • Uniform: The hash value should be uniformly distributed through the available space (determine by the fixed size mentioned above)
  • Efficient in time and space
  • Secure: The cost of turning a hash into the original data must be prohibitive

Popular use cases of hash tables are:

* Store key/value pairs :)
* Count items with the same value (use value as key, and store the count as value)

What happens If You Use A Smaller Array to Store Hashed Values?

You can use an array to store the values of a hash table of a finite size smaller than the size of the hash. In that case you need to use the % modulus operator to get the value of the index within the array:

1
2
int hash = Hash(item.key);
int index = hash % arraySize;

How Can You Handle Collisions?

Sooner or later you’re going to have more than one value that result in the same hash which will result in a collision. Collisions must be transparent to the user of the hash table so we need to devise a strategy to handle collisions. Two common strategies are:

  • Open addressing, or moving a specific value to the next index in the hash table.
  • Chaining, which is to use a linked list within within each array position. This allows us to store multiple values that result in the same hash.

The collision likelihood is based on how many positions available in the hash table and how many items are stored within it. The ratio of filled hash slots to empty slots is known as the load factor or fill factor. Oftentimes this load factor will decide whether or not we need to grow the array used to store values in the hash table. Growing the array will consist in allocating a bigger array and copying all items from the former array into the new array.

Trees

Data structure which stores data in a tree structure. A tree is composed of nodes that contain a value and pointers to other nodes. A tree has one unique root node, and each child node can be seen as the root of its own subtree. A node pointing to another node form a parent child relationship. Generic tree nodes can have more than one child.

Binary Tree

Tree in which each node has 0, 1 or 2 children.

Binary Search Tree

Binary tree in which nodes are sorted in the following manner. For each node, all nodes to the left are smaller than the current node, and all nodes to the right are equal or larger than the current node.

One of the great advantages of BST over other data structures is that you only need to look at a subset of the items within a tree to find a specific item. Whereas in a linked list you need to look at all items (particularly in the case where an item is not in the tree/linked-list).

It is very useful for sorting stuff and keeping stuff in a sorted order. Traversing a binary search tree in-order results in all items within the tree being processed in order. Traversing a tree in pre-order and post-order is useful when implementing compilers and interpreters.

Tree Traversal Algorithms

These are some common algorithms to traverse a BST:

  • pre-order: visit(n){process(n.value), visit(n.left), visit(n.right)}
  • in-order: visit(n){ visit(n.left), process(n.value), visit(n.right)}
  • post-order: visit(n){visit(n.left), visit(n.right), process(n.value)}

Unbalanced Binary Trees

In the worst case scenario, an unbalanced tree can become a singly linked list. In this case, we lose the most important features that make using binary search trees attractive. An example of unbalanced tree occurs when we add increasing values to a binary search tree (1, 2, 4, 5, 6), since all of them would be addle as right children. In this case the performance of searching would be the same of a linked list, that is O(n).

Balanced Binary Trees

Balanced trees remain balanced as nodes are inserted or deleted keeping the O(logn) search performance. A tree will be balanced by executing a balancing algorithm whenever the height of the left and right subtrees differ by more a fixed value (for instance 1).

AVL Tree (Adelson-Velsky & Landis)

An AVL is a self-balancing tree. They are similar to binary search trees but they provide different algorithms for insertion and deletion in order to keep the tree balanced.

Most of the balancing algorithms belong to the AVL Node and not te AVL tree itself. An AVL node has, in addition to the common features of a tree node, the following characteristics:

  • It has a reference to the parent node
  • Node height (left and right height)
  • Balance factor or difference between right and left heights. If it is positive it means that the tree is right heavy, if it is negative it means that the tree is left heavy.

Tree Balancing Algorithms

The balancing of a tree is performed via node rotation. The rotation occurs at the insertion or deletion point and is repeated for all parents until the root.

Rotation algorithms:

  • Right Heavy Tree:
    • Right Child is left heavy:
      • Left-Right rotation
    • else:
      • Left rotation
  • Left Heavy Tree:
    • Left Child is right heavy
      • Right-Left rotation
    • else
      • Right rotation
Right Rotation

Algorithm that rotates a node and its children to the right:

  1. Left child becomes the new root
  2. Right child of new root is assigned to left child of old root
  3. Previous root becomes the new root’s right child
Left Rotation

Red-Black Tree

Heaps

From Wikipedia

A heap is a specialized tree-based data structure that satisfies the heap property: If A is a parent node of B then the key of node A is ordered with respect to the key of node B with the same ordering applying across the heap. A heap can be classified further as either a “max heap” or a “min heap”. In a max heap, the keys of parent nodes are always greater than or equal to those of the children and the highest key is in the root node. In a min heap, the keys of parent nodes are less than or equal to those of the children and the lowest key is in the root node. Heaps are crucial in several efficient graph algorithms such as Dijkstra’s algorithm, and in the sorting algorithm heapsort. A common implementation of a heap is the binary heap, in which the tree is a complete binary tree.

In a heap, the highest (or lowest) priority element is always stored at the root, hence the name heap. A heap is not a sorted structure and can be regarded as partially ordered. As visible from the heap-diagram, there is no particular relationship among nodes on any given level, even among the siblings. When a heap is a complete binary tree, it has a smallest possible height—a heap with N nodes always has log N height. A heap is a useful data structure when you need to remove the object with the highest (or lowest) priority.

Note that, as shown in the graphic, there is no implied ordering between siblings or cousins and no implied sequence for an in-order traversal (as there would be in, e.g., a binary search tree). The heap relation mentioned above applies only between nodes and their parents, grandparents, etc. The maximum number of children each node can have depends on the type of heap, but in many types it is at most two, which is known as a binary heap.

The heap is one maximally efficient implementation of an abstract data type called a priority queue, and in fact priority queues are often referred to as “heaps”, regardless of how they may be implemented. Note that despite the similarity of the name “heap” to “stack” and “queue”, the latter two are abstract data types, while a heap is a specific data structure, and “priority queue” is the proper term for the abstract data type.

A heap data structure should not be confused with the heap which is a common name for the pool of memory from which dynamically allocated memory is allocated. The term was originally used only for the data structure.

Graphs

Sets

A set is a collection of unique comparable items. Items within a set cannot be duplicated.

Set methods/algorithms

Union

Union compares two sets and returns a new set that contains all unique items in both sets.

1
2
3
4
5
var setA = new HashSet<int>(1, 2, 3);
var setB = new HashSet<int>(1, 4, 5);

var setC = setA.UnionWith(setB);
// setC = [1, 2, 3, 4, 5]

Intersection

Intersection compares two sets and returns a new set that contains all unique items that exist in both sets

1
2
3
4
5
var setA = new HashSet<int>(1, 2, 3);
var setB = new HashSet<int>(1, 4, 5);

var setC = setA.Intersect(setB);
// setC = [1]

Set Difference

Set difference takes two sets and returns all the items in the first set that are not within the second set.

1
2
3
4
5
var setA = new HashSet<int>(1, 2, 3);
var setB = new HashSet<int>(1, 4, 5);

var setC = setA.Except(setB);
// setC = [2, 3]

Symmetric Difference

Symmetric difference takes two sets and returns a new set that contains all the unique items within each set that are not in the other. In summary, it is the set difference of the union and intersection of input sets.

1
2
3
4
5
6
7
8
9
10
var setA = new HashSet<int>(1, 2, 3);
var setB = new HashSet<int>(1, 4, 5);

var setC = setA.SymmetricExceptWith(setB);
// setC = [2, 3, 4, 5]

// also
var union = setA.UnionWith(setB);
var intersection = setA.Intersect(setB);
var symmetricIntersection = union.Except(intersection);

Trie Trees

In computer science, a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are not necessarily associated with every node. Rather, values tend only to be associated with leaves, and with some inner nodes that correspond to keys of interest. For the space-optimized presentation of prefix tree, see compact prefix tree.

In the figure keys are listed in the nodes and values below them. Each complete English word has an arbitrary integer value associated with it. A trie can be seen as a tree-shaped deterministic finite automaton. Each finite language is generated by a trie automaton, and each trie can be compressed into a deterministic acyclic finite state automaton.

Though tries are usually keyed by character strings, they need not be. The same algorithms can be adapted to serve similar functions of ordered lists of any construct, e.g. permutations on a list of digits or shapes. In particular, a bitwise trie is keyed on the individual bits making up any fixed-length binary datum, such as an integer or memory address.

Advantages

As discussed below, a trie has a number of advantages over binary search trees.[6] A trie can also be used to replace a hash table, over which it has the following advantages:

  • Looking up data in a trie is faster in the worst case, O(m) time (where m is the length of a search string), compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an imperfect hash table is O(N) time, but far more typically is O(1), with O(m) time spent evaluating the hash.
  • There are no collisions of different keys in a trie.
  • Buckets in a trie, which are analogous to hash table buckets that store key collisions, are necessary only if a single key is associated with more than one value.
  • There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
  • A trie can provide an alphabetical ordering of the entries by key.

Drawbacks

Tries do have some drawbacks as well:

  • Tries can be slower in some cases than hash tables for looking up data, especially if the data is directly accessed on a hard disk drive or some other secondary storage device where the random-access time is high compared to main memory.[7]
  • Some keys, such as floating point numbers, can lead to long chains and prefixes that are not particularly meaningful. Nevertheless, a bitwise trie can handle standard IEEE single and double format floating point numbers.
  • Some tries can require more space than a hash table, as memory may be allocated for each character in the search string, rather than a single chunk of memory for the whole entry, as in most hash tables.

Summary

  • Linked lists
  • Stacks
  • Queues
  • Trees
    • Tree: A data structure composed of nodes where each node has 1 or more child nodes and each node can only have 1 parent
      • Search with BFS (Breadth-First Search) or DFS (Depth-First Search)
      • BFS: Moves entire breadth of each level. Requires additional memory to track child nodes for all nodes in a level
      • DFS: Follows one branch of the tree down as many levels as possible. Uses less memory than BFS (track less).
    • Binary Search Tree: A tree with only two child per node
      • Each node has only smaller nodes to the left and bigger or equal nodes to the right
      • Because it is sorted searching takes only O(logn)
      • Good for: Searching O(logn), Adding/Removing O(logn)
    • AVL Tree: Self-balancing tree (it never becomes a linked-list)
      • It works like a normal BST but the insertion/deletion includes a balancing algorithm
      • The height of the tree, its balance factor and whether it is left-heavy or right-heavy is taking into account
      • The height (distance from root node to deepest children) of left and right tree can differ by at most 1
      • The difference of heights is called Balance factor (Right Height – Left Height)
        • It tells you whether a tree is right heavy or left heavy (based on the sign)
      • During insertion, do as BST, if height differs more than 1, run balancing algorithm for each parent
      • During deletion, do as BST, if height differs more than 1, run balancing algorithm for each parent
      • An interesting implementation detail of the AVL tree is that each child node has a reference to the parent
      • Balancing is done through rotation. The rotation at the point of insertion/deletion and up.
      • Rotation changes the physical structure of the tree within the constraints of a BST
      • Algorithm
        • If (RightHeavyTree) { If (RightChildLeftHeavy) LeftRightRotation(); Else LeftRotation();}
        • If (LeftHeavyTree) { If (LeftChildRightHeavy) RightLeftRotation(); Else RightRotation();}
        • It is easier to understand if you visualize it
      • Right rotation: left becomes root, left.right becomes oldRoot.left, oldRoot becomes root.right
      • Left rotation: right becomes root, right.left becomes oldRoot.right, oldRoot becomes root.left
      • Right-Left rotation: left rotate left child, right rotate tree (right rotation of a left-rotated tree)
      • Left-Right rotation: right rotate right child, left rotate tree (left rotation of a right-rotated tree)
    • Red-Black Tree:
    • N-ary or Generic Tree:
    • Trie-tree: Tree used to store a dynamic set or associate array using strings as keys
      • No node in the tree stores a key. It is the position of the node that determines the key.
      • All children of a node share a common prefix. The root is empty string. Leaves usually store keys.
    • Heaps:
      • min-heap: Tree where each child of a node has a value less than or equal to the node’s own value
      • max-heap: Tree where each child of a node has a value larger than or equal to the node’s own value
      • Can find max/min in O(1). Insertion/deletion are O(logn). Lookup is O(n).
      • Often used as priority queues (always sorted in priority order)
  • HashTables
  • Graphs
  • Sets

References

Comments