Handling duplicate data posts faster

There are a lot of ways to try to prevent and stop data from being sent more than once from a form, typically we use cookies, session, and redirects. And somehow we still end up with duplicate data in the database. So we add extra measures on the server side, run extra queries to the database to see if it already exists before inserting, etc.

And we end up cluttering the original code with more logic to double-check, and running extra queries to the database. If the server side is a webservice where performance is important then we just added more bottlenecks.

In this example we are gone consider one website that posts XML data to different server. The receiving server processes the XML file and creates a record in the database. Key values like "first name", "last name" can exist in multiple, but just not a duplicate record.

Concept

Since performance is important we need to determine if the incoming data is a duplicate early on, and if so drop the data and free up resources to handle next request. Ideally we would do this even before loading database and other components that normally would process the incoming data.

figure.1

Using MD5 and empty files

MD5 is good way to determine a unique fingerprint on data, and for this example we assume that the data being received are not huge data files. You could replace MD5 with something like CRC32 or another algorithm that might be faster to calculate.

Since we do not want to use the database yet, we are gone use the file system instead, and simply create empty files where the filename is the MD5 sum of data we have already processed. Now we simply need to check if a file exist to determine if the record is a duplicate, and we run a seperate process thru cron/automated tasks to conduct maintenance on the md5 files.

figure.2

For the example below we create a folder called /last24h/ where we are gone store the MD5 sum files, this just keeps it clean and easy to maintain.

Example

A simplified example in PHP below, illustrates how MD5 is used to check if incoming data file has previously been processed within the last 24 hours (this example uses file upload for simplicity, if the posted data was not a "file" there would be no disk reads).

class WebService
{
   function __construct($httpmethod)
   { }
   function get($arguments)
   { }
   function post($arguments)
   {
       // A data file has been uploaded ..

       $data=file_get_contents($_FILES["file"]["tmp_name"]);
       $md5sum=md5($data);
       if(file_exists("/last24h/".$md5sum))
       {
          // A file with the same name as the MD5 sum exists
          // this means that this job has been processed before.
          return false;
       }

       // Load components and do processing here..

       // ..and final step we create an empty file using the md5 sum as the filename.

       $status=file_put_contents("/last24h/".$md5sum,"");
       return true;
   }
   function put($arguments)
   { }
   function delete($arguments)
   { }
}

Cleaning up

This is a task that should be started by Cron in Linux, or scheduled task in Windows. And would be responsible for deleting old MD5 sum files that has expired. For this example we used 24 hours as the max time period. But this, and the frequency of how often to run the clean up would depend on how much traffic the service receives.

(Notice that before deleting the file we check that length of the name is 32 characters, if its not it would not be an MD5 sum.)

Example in PHP: Scheduled clean up script:

   $maxtime=strtotime("-1 day");
   $dl=new DirectoryListing("/last24h/");

   foreach($dl as $filepath)
   {
      $filename=basename($filepath);
      $filetime=filemtime($filepath);

      $flag_delete=true;
      if(strlen($filename)!=32) $flag_delete=false;
      if($filetime>$maxtime) $flag_delete=false;

      if($flag_delete===true)
      {
          $status=unlink($filepath);
      }

Final Notes

If the directory is accessed quite frequently accessed one would assume that it would be cached by the operating system. Determining if the file exists should be a simple lookup in this directory structure and no file data is actually being read, this should leave us with pretty good performance!

We could of course take this up a notch if we really where hardcore about performance, and simply stick all the MD5 sum filenames in a Ram disk. If we are talking about extremely high amount of requests this is probably the way to go!

(This technique was used on a system to gather leads submitted from other sites. The MD5 sum is really a pretty good guarantee that no duplicates slips thru. A couple of times I had complaints about duplicates in the system, which shouldn't happen! Turned out that people had submitted twice because they had typed their email or phone number wrong the first time, and not being able to edit, they just submitted another one with a minor difference!).

Site maintained by tommy@trapv.com.
Page generated in: 0.0094 seconds.
Memory usage: 429304 bytes (peak: 480752 bytes)