In large data warehouses, data duplication is an inevitable phenomenon as millions of data are gathered at very short intervals.
- Goals may be defined at all levels of the enterprise and doing so may aid in acceptance of processes by those who will use them. Some goals include:-
- Increasing consistency and confidence in decision making
- Decreasing the risk of regulatory fines
- Improving data security, also defining and verifying the requirements for data distribution policies
- Maximizing the income generation potential of data
- Designating accountability for information quality
- Enable better planning by supervisory staff
- Minimizing or eliminating re-work
- Optimize staff effectiveness
- Establish process performance baselines to enable improvement efforts
- Acknowledge and hold all gain
Customer with inventory of over than
17Milion items, which is around 2M are being active per day. Customer was collecting data from different business units from different software programs. Some of these data were written manually in data sheets or entered from legacy software. Some of the data after being collected into one formatted table and before imbruing it to the main supply chain system were suffering from frequent rejecting. That caused business to lose control over the total value of the inventory value and items available for business consuming. A customer with inventory of over than 17Milion items, which is around 2M are being active per day. Customer was collecting data from different business units from different software programs. Some of these data were written manually in data sheets or entered from legacy software. Some of the data after being collected into one formatted table and before imbruing it to the main supply chain system were suffering from frequent rejecting. That caused business to lose control over the total value of the inventory value and items available for business consuming.
Part of data problems were:-
– Items with null values in mandatory fields being reject by the import process.
– Items that goes above or below limits in Min/Max values being reject. Due to manual entry
– Wrong data types are converted to Zeros or Null, i.e wrong date formats, entering decimal values into integer fields.
Supply chains systems are efficient enough when it can provide the right information about the best available resources at the lowest price. As Inventory Items groups reach up to 100K+ of items, data Duplication will be an inevitable phenomenon. That brings data redundancy for the same item with different costing. It can even provide wrong data reporting missing item while it in the store. One of our customers with the massive amount of items per his inventory reported that items is being defined in many formats and naming, that forced them to buy the item in high frequency while it is already found in the stock at hand. A very apparent item was truck tyres required for their large goods transportation fleet. Tyres had been named and entered many times for the same brand and model. That bring the overhead cost to unreliable amount.
Data Duplication examples were:
Tyre 16″, Tubeless 16″, Truck Tyre 16 inches, Rubber Track 16″ Tire. All these were representing one item but exists in the inventory with multiple names and quantities.
ERP and Data Entry Software as an
essential asset to run the business. Legacy software will respond very slow to new business demand of data validation. Changing the software may take also longer period of analysis and purchasing cycle. That will yield to data logical corruption which affects total cost, and correctness asset. Khwarizm had help customer to implement new business roles on the current existing legacy software. New validation roles can be run online on data entry or on regular batches against resident data. Filtering can isolate suspected data into separate areas for manual validation. One of our customer reported that he can pick wrong/manipulated data in just few minutes after data entry and he can act immediately to detect such incidents and act on the prompt.
In the data warehousing community, the task of finding duplicated records within large databases has long been a persistent problem and has become an area of active research. There have been many research undertakings to address the problems of data duplication caused by duplicate contamination of data.
Some extracts are form:
** (Extracted from http://www.learn.geekinterview.com/data-warehouse/dw-basics/what-is-data-duplication.html)