Steganography in USB mass storage devices

I have always feel a deep interest in steganography, the art of concealing information inside information. Not surprisingly, steganography is probably as old as cryptography with the first recorded use of it done by Demaratus, king of Sparta.

With the advent of the digital era and the massive data sizes we handle, steganography is being a very important field for researchers. Most of the more common digital steganography is done in image files because they are a very good container, being the LSB (Less significant bit) replacement the simplest technique, which allows to hide information inside a image with a minor alteration that can’t be noted by the human eye.

But steganography using digital images is not the only way of hide digital information. There are steganography in, practically, every corner of the digital word, even in TCP/IP packets! I am going to describe some ways to conceal information in USB mass storage devices that use FAT32 (most of them). These techniques allow to hide information that is not going to be shown by regular file explorers (except the last example which is probably not a truly steganography scheme but I thought it was a cool addition) and they are not going to delete any other user data. Although described here in a briefly and basic manner, they can be used in sophisticated schemes to hide data.

1) Information in the Boot sector
The first sector is called Master Boot Record (MBR), it contains information about different partitions and, more importantly for our purposes, executable code to be used when the unit is bootable.

master_boot_record (1)

If the USB drive is not bootable, that part of the first sector is not used, those are 446 bytes to store information that is not going to be readable from a common file explorer.

2) EOF (End of file) occultation
Files are stored in one or more clusters (each of them usually of 4k, but they can be up to 32k or as small as a single sector) conveniently linked in the FAT. If we have, for example, a single file that weighs 1024 bytes, it would be stored in a single cluster with the remaining 3072 bytes not available for any other allocation. The remaining bytes are a good place to hide information.

raw bmp
Example of a bmp file that only requires three sectors of a 8-sector cluster to be stored, leaving most of the last sector used and five more available to store more information.

3) Information encoded in the FAT
As previously mentioned, files are composed by a one or more clusters that are referred by a 32-bit (28 bits used) number. Those clusters can be located through the FAT (File Allocation Table), that contains a linked list of cluster entries for each file.

Simple example of how two files and their clusters could be located in a portion of the FAT.

As shown in the previous figure, cluster entries doesn’t need to be contiguously allocated. Systems tend to allocate those entries contiguously for efficiency but in this case we are going to handle the allocation in a different way. We are going to put every cluster entry in well-thought distance from one of them that will be the reference. This way we can assign values to different distances, coding information on those distances!

A single byte coded in a file of nine clusters

In this simple example a single byte is coded using the Modulo-2 operator in every distance from the first cluster. Giving these results:

Bit7 = Mod2(d1) = 0
Bit6 = Mod2(d2) = 1
Bit5 = Mod2(d3) = 0
Bit4 = Mod2(d4) = 1
Bit3 = Mod2(d5) = 1
Bit2 = Mod2(d6) = 0
Bit1 = Mod2(d7) = 1
Bit0 = Mod2(d8) = 0

Different Modulo operations can be done to code more info, for example Modulo-256, that would code more info in fewer cluster entries but the file will be more fragmented.

Eight bytes coded in a eight file using a Mod-256 operator on the distance between cluster entries. Note that, in this examples, distances are measured from the previous cluster. If one of the distances needed for coding one byte is occupied by other file, the next valid distance will be dn + 255, next dn + 255 * 2 … with a generalized form of dn + 255 * k for k= 1, 2, 3…

Applying those techniques to a real file of 256Mb, which using a cluster size of 4k will use 65536 clusters, will achieve 8191 bytes of hidden information using Mod-2 operator or 65535 bytes using the Mod-256 operator. This technique sacrifices efficiency in order to get better concealment of the information. Besides that, more  schemes can be done like reducing the cluster size, using a reduced alphabet if possible, compressing the concealed information with the Huffman algorithm or even using more advances techniques that I don’t really can grasp 🙂

A potential problem that could arise would be an auto-defragmention of the media by the operating system but usb drives almost always use flash technology which, in the first place, doesn’t gain as much as the old hard drives from defragmentation and also discourages write operations because write cycles are not unlimited. For those reasons is not probable that a modern OS would perform an auto-defragmentation of a USB flash drive device.

4) Using the Short File Name legacy feature
If you were on the computers during the 80’s and early 90’s you will remember that file names in the different FAT systems were limited to the 8.3 filenames format. With the development of the FAT32 systems, support for longer names was added without ignoring retro-compatibility.

The content of a folder with both the long names and the short names listed. When a long name is needed the short name is a really a hash that identifies uniquely the file.

If  we have a file named “preguntas_frecuentes.pdf” windows will generate the short name “PREGUN~1.PDF”. If we create a second file named “preguntas_frecuentes_2.pdf” the short name will be PREGUN~2.PDF “.

During development of Pincho, a USB Mass Storage Library for Android, for the sake of simplicity, I refused the mess of having to generate well-built short filenames (I shouldn’t let pass the chance to present you this amazing article about 8.3 filename and its checksum ) adding a long file entry in every case, even if the file don’t really need it. I found some surprises:

1) Even if a long filename entry is added before the normal file entry Windows would ignore it if the name length is less or equal than eight. A long filename entry with the name “FiLeS” will be ignored and show as “FILE” in the file explorer.
2) Short filenames doesn’t need to be created from the long filename.
3) In order to force Windows to read the long filenames, the short filenames must be interpreted by the OS as truncated.

A folder written by Pincho library. The short names are created randomly except the tilde character in the 7th position and the last number in order to force Windows to treat them as truncated

The first six characters of the short name could be used to store information, creating for example 30 files would allow to store 180 bytes of information. Short filenames are not listed by default but it is possible using the “dir \x” command. In order to use this approach in a useful way cryptography should be used too but many people doesn’t seem aware of this little legacy feature that still survives 🙂

If I come up with other schemes that would be interesting I will post them here.

Happy crafting! 🙂

(1) Adding a cache system in Pincho library. Probably the simplest cache

Usb Mass Storage devices usually use the FAT32 filesystem, which it kind of surprising due to how old FAT32 is but it lasts because FAT32 is Win, OSX and Linux compatible and not very difficult to implement in embedded devices.

FAT32 solved some limitations from previous FAT versions as limited file names and file sizes but still uses, as it names implies, a file allocation table to find the cluster that define a file.

FAT32 uses a linked list to allocate clusters for a given file as shown here. Clusters entries doesn’t need to be adjacent and one entry can point to a previous entry.

That simplicity comes with a cost, a Θ(n) cost to be exact, because you must traverse the FAT to find a empty cluster. For this reason FAT32 and other linked-list filesystems does not scale very well.

I was aware of this while developing Pincho but it shocked me when I started the first tests. While read operations performs pretty decent, write operations using a naive approach are painfully slow and it gets worse when the USB mass storage device gets crowded with files.

A extremely best case scenario of an allocation of three files that require three clusters.Although this portion of the FAT looks empty from the last inserted file that can’t be supposed in a real scenario because fragmentation. Allocation is this scenario will be Θ(3n), note that allocation is linear, as previously shown but a constant appears, that is equal to the number of clusters the file will use. In a fragmented device, allocation will be even slower

The way to fight with this issue is through caching. Caching is a world itself, with a lot of algorithms and approximations. An aggressive caching will use a lot of memory to store the state of most of the FAT to minimize the I/O operations but Android heap is limited, also consistency must be guaranteed between data in the cache and data in the device and that adds some complexity.

Pincho library is still in kind of experimental state so I would like to try different caching approaches. The first I implemented is probably the simplest, not by far the best, but it gets some improvements. It is simple and it has a small memory footprint.

This implementation assumes correctly that what we want to minimize the number of I/O operations. Storage devices are usually accessed in blocks, called sectors, of 512 bytes addressed by a 32-bit direction called Logic block addressing (LBA). Sectors that compose the FAT can store 128 cluster entries. Those sectors with enough free entries (What is enough must be decided, If sectors with just a few free entries are cached it may not be a good speed improvement) are stored in an Array, that guarantees that an I/O operation on the FAT will return at least some of the free clusters needed to store a file, avoiding traversing the filled parts of the FAT.
I have defined three levels, the low cache just cache enough space to store 100 mb of data, the medium cache try to cache good sectors of half of the cache and the high cache good sectors of the whole FAT. The main difference between these modes is the time needed to perform the cache.

It is an improvement but there is a lot of room for more improvements. For the next releases I have planned to add more sophisticated caching systems.

Happy coding!