Mechanical Turk for Data Cleaning
When faced with a large volume of data which needs to be checked, reformatted, compared, transformed or otherwise modified, a company may consider using Amazon’s Mechanical Turk (MTurk) system. MTurk data cleaning projects can save a company tremendous time, capital, and human resources when executed carefully. What follows is a general description of the strategies in designing a good data cleaning project with Mechanical Turk.
Mechanical Turk is a large marketplace for human intelligence, or in a manner of speaking, artificial artificial intelligence. Individuals or businesses (known as Requesters) submit work requests (known as Human Intelligence Tasks or HITs) for human workers (known formally as Providers and informally as Turkers) to complete. These HITs are typically broken down into short, single record operations such as:
Turkers are paid per HIT through a marketplace where Turkers can choose HITs to work on from a variety of Requesters. Turkers tend to choose HITs that pay a good rate for a task’s perceived complexity and completion time.
A good MTurk project starts with a lot of planning, including prototypes, revisions, and and a good understanding of the project at hand and its required resources. The project designers should consider all of the possible modifications that can take place to a single record of data, and all of that record’s dependencies and metadata relevant to the transformation. Designers should also put themselves in the Turker’s shoes and consider what data contains technical or industry jargon, abbreviations, or other coded information; especially when that information itself needs to be transformed.
All of these requirements should be drafted into a document of instructions that any layman can follow to perform the modification. Test the instructions by performing them on a sample of records from the data set and see how well they work. Like any large information technology project, this initial instructions draft should be presented to the stakeholders responsible for the success of the project. Encourage these stakeholders to perform the instructions on the same sample set to compare results and gather valuable feedback. Amend the instructions as needed based on this feedback and retest. Also like any large information technology project, expect requirements to change or grow in this process. Keep retesting until requirements stabilize.
Testing against MTurk
Once the requirements are reasonably well understood, it’s time to test your sample against MTurk itself with a small campaign.
Use the same sample set to generate comparable results and make special notes of how well or poorly Turkers understood your instructions, jargon within the data, and overall task performance. It’s generally good practice to not penalize Turkers within the system if they did not perform to your expectations (more on this later). If Turkers don’t complete your HITs within a few hours, increase your fees or consider if the complexity of completing a single HIT is too high.
From these MTurk generated results, designers should have a much better idea of what to expect in terms of the quality, cost, and speed. Results quality depends mainly on the quality of instructions and complexity of the task. Like before, modification of instructions and lots of retesting may be required to get the instructions just right. Though large projects may attract additional attention and the labor pool is humongous, it’s also finite, and fluctuates with the usual business hours (though plenty of night owls complete tasks at night). In general you can expect thousands of records to be completed a day.
It’s important to have a good understanding of Turkers, their expectations, community, and culture. Turkers require a good rating to continue working on many projects and a rejection or bad rating can really penalize and set back someone who’s trying their best. If your instructions are poor, talk to your Turkers through email and take their feedback to improve it.
Besides being bad manners, using Amazon’s rejection tool as a second pass data cleaning method will turn Turkers away from your work, and incur bad ratings for you on third-party sites like Turkopticon or even rants on Reddit. You should approve the vast majority (generally more than 95%) of HITs and only reject Turkers who are very clearly abusers and put in no effort at all.
If data quality is important to your project, and you have the resources to pay additional linear costs, you can use MTurk again to ensure greater data quality. First is to simply run more campaigns and generate more results for the same data sets. If you have the time resources, your team can then select the best results from these sets. Otherwise, if you have the money, you can design another campaign that feeds these results back into MTurk and allows other Turkers to audit the work of their peers. In a similar fashion, you can have other Turkers simply apply a 5-star rating or a pass/fail evaluation to the quality of a single set of results.
Although your data quality can get pretty good using these methods, don’t ever expect it to be perfect, or to ever exceed the quality of your instructions or general project design.
It’s clear that it takes a fair bit of preparation and work to get an MTurk campaign right, but the dividends pay out in project timeliness, reduced costs, and repeatability. Data cleaning should be contained into a business process rather than a painful chore in any organization. If you’re interested in starting a data cleaning project, we can help. Contact OpEx Digital Consulting today to discuss your options and begin the planning process.