Go to homepage

Reid Main

  1. Archives
  2. Tags
  3. About Me
  4. Email
  5. Resume

Verifying the integrity of recovered data

I wrote previously about how I recovered data from my failed RAID but I skimped on the details of how I ensured that my photos and videos were recovered properly.

The first step was to copy the recovered data off of the drive that I provided the data recovery service. Let's assume I copied the recovered photos to /Users/reid/Recovered Data/Photos1 and /Users/reid/Recovered Data/Photos2.

Step two was to ensure that the files between the two directories were identical and if they weren't figure out which of the two files was worth keeping. I wrote a small Ruby script that iterated over all of the files in the first directory and ensured that the same file existed in the second directory by using the diff utility.

#!/usr/bin/env ruby

if __FILE__ == $PROGRAM_NAME
    Dir.glob("/Users/reid/Recovered Data/Photos1/**/*", File::FNM_DOTMATCH) do |child|
        if File.file?(child)
            if child.include?(".DS_Store")
                next
            end

            file1 = child
            file2 = child.sub("Photos1", "Photos2")

            diff_result = `diff \"#{file1}\" \"#{file2}\"`
            if diff_result.empty? == false
                puts "open \"#{file1}\" \"#{file2}\""
            end
        end
    end
end

For my use case I made the script output open file1 file2 so that I could quickly run the command in the terminal and open up the two files that differed. I was expecting to only need to use this for images so the open command would simply open up both images in Preview.

The problem with this script is that it only ensures Photos2 is a superset of Photos1. I needed to run the script in the other direction to see if there were any files in Photos2 that did not exist in Photos1 and lo and behold there was. This was probably impossible to not have happen because the drives were in such bad shape. They were seven years old for fucks sake! I am once again so thankful that William at Lazarus Data Recovery did a second pass on my photos directory to maximize my chances of getting all of my data back.

I merged the contents of /Users/reid/Recovered Data/Photos1 and /Users/reid/Recovered Data/Photos2 into a new directory called /Users/reid/Recovered Data/Photos. Now I had a directory where all of the "good" files should be and could move onto step three. What if a file was corrupted in both of the directories? I had over 9000 photos and videos so it would be impossible to check them all by hand. I wrote another Ruby script that would iterate over all of the contents of a directory and based on the extension of the file it would run some logic to verify the integrity of the file.

#!/usr/bin/env ruby

require "set"

if __FILE__ == $PROGRAM_NAME
    image_types = Set[".jpg", ".png"]
    video_types = Set[".mov", ".avi", ".mp4"]

    Dir.glob("/Users/reid/Recovered Data/Photos/**/*") do |child|
        if File.file?(child)
            filetype = File.extname(child)

            if filetype.empty?
                next
            elsif filetype == ".jpg"
                if `jpeginfo -c \"#{child}\"`.include?("[OK]") == false
                    puts "Corrupted jpg: #{child}"
                end
            elsif video_types.include?(filetype)
                if `ffprobe -v quiet -show_error -i \"#{child}\"`.empty? == false
                    puts "Corrupted video: #{child}"
                end
            elsif image_types.include?(filetype)
                if `identify -verbose \"#{child}\" 2>&1 >/dev/null`.empty? == false
                    puts "Corrupted image: #{child}"
                end
            else
                puts "Untested file: #{child}"
            end
        end
    end
end

The first edge case is for files with no extenson. Examples would be dotfiles or Apple metadata files. These are files that we don't care about verifying and can simply skip.

The first filetype to verify is jpg. I discovered a tool called Jpeginfo which did this. The -c parameter checks the file for errors and if the output does not contain "[OK]" then it is invalid.

The next filetypes to verify are video files. mov, avi and mp4 are the ones that I encountered. I knew about FFmpeg and after reading into their documentation I came across the ffprobe tool. It attempts to gather information about video files and after some quick tests I realized that it was successfully identifying corrupted videos. The -v quiet parameter first silences all output and then the addition of the -show_error parameter ensures only error information is outputted.

The final filetypes to verify are image files. jpg and png are the only two I had. ImageMagick is one of the perennial open source image manipulation programs and their identify tool allows you to check the integrity of image files. I was already checking jpgs with Jpeginfo but in theory this tool works for them as well. You need to use the -verbose parameter to ensure all possible data is read and outputted (including error data) and then 2>&1 >/dev/null ensures that only the stderr is outputted.

The final edge case is any other filetype that was not checked. A warning is simply printed to the console indicating that the file was not checked and you probably should find a tool that would let you verify it.

All of the tools I mentioned were installed using Homebrew. They are all available on Linux as well so you could easily write these scripts and run them on your RAID to ensure that none of your photos and videos are being corrupted.

And that is it. I lucked out and none of my photos or images were corrupted on both of the recovered directories. There were a couple of differences that did indicate corruption but luckily their sister file in the other directory was fine. Photos2 apparently had more files than Photos1 and luckily they were not corrupted but unfortunately this leads me to believe that not all of my photos and videos were recovered. This is a likely outcome of attempting to recover over 9000+ photos and videos totaling over 200GB. However, all of the important photos and videos that I wanted were recovered and if something was lost that I can't remember well then does it really matter if I lost it anyway? I'm going to go with the "ignorance is bliss" approach and just be thankful that every important picture and video of my grandfather's 80th birthday was recovered.