Novel Methodology to automatically filter relevant parts from documents

December 13,2016

Research Matters

Photo: Siddharth Kankaria/ Research Matters

Product designers have the responsibility of ensuring the product they design goes to production without any issues. There are various snippets of “knowledge” available in the form of historic production documents, shop floor records, case studies, etc., both offline and online, that can greatly help get an early insight into potential issues. However, a major drawback is the lack of identifying “knowledge” based on this due to their fragmented distribution. Now, researchers at the Indian Institute of Science, Bangalore, Mr. N. Madhusudanan, Prof. Amaresh Chakrabarti and Prof. B. Gurumoorthy, at the Centre for Product Design and Manufacturinghave developed a method for automatically recovering relevant information from document collections. They validated this methodology in the context of aircraft assembly.

Knowledge about problems in assembly or manufacturingthat is generated at different stages of a product lifecycle, can be used in the design or planning phase for other products. These documents help to foresee potential issues in later stages of product lifecycle and seek possible solutions. However, extracting relevant information from these documents can be extremely tedious and time-consuming as most are structured as case studies, and not as knowledge bases. Also, not all information in a given document may be relevant to the topic of interest.

Previous methods that help designers sort documents typically do not “understand” the text, or require large amounts of earlier data or their prior labelling to be available, which is often not the case. “These assume that the documents are tagged with keywords at the time of writing. Unfortunately, such kind of tagged collection is not available for specific scenarios like aircraft assembly”, says Mr. Madhusudanan. The solution proposed by researchers at IISc has a methodology that is based on “understanding” the document contents, rather than just parsing the text. This enables automatic extraction of sections in the document that contain relevant parts.

The proposed solution works on a collection of well-written, rigorously reviewed documents that represent the collective knowledge of a number of experts. Most of these documents are treated as one-way communication, or discourse, from the author to the reader, following a hierarchical, organised structure. There are two steps in the solution; segregation and classification. Segregation is identifying chunks of text that talk about the same topic. Closely related collections of sentences are called coherent chunks. The method scans these sentences and forms a list of entities discussed in them. A lexical database for English, called WordNet, was used here. In the next step of classification, the list of entities is checked to see if they are related to the domain of interest. The chunk is classified as “related” if the similarity of entities in the chunk and the domain of interest are above a certain threshold; else it is “unrelated”. The basis for the similarity are one or more domain ontologies.

A major challenge during the segregation process is the use of pronouns that are frequently used to avoid repetitions. The researchers used existing software that reads through the text, identifies the word to which the pronoun refers to, and replaces it with the actual word. This is called anaphora resolution. It then analyses whether adjacent sentences form a chunk based on the words they contain. If the words are closely related, sentences belong to same chunk. The researchers compared the results of this automated segregation with that of manual segregation and found that the automated segregation had an accuracy of 75%. The validation of the automated classification against manual classification was found to be 85%.

Two key challenges affect the accuracy of the solution - lack of domain specific words in the general English lexicons and the ambiguity caused by using the same words in different contexts. Using further methods for disambiguation, and greater availability of domain specific resources might help mitigate them. This research is pioneering in improvising the design process. “The use of expert knowledge from one phase of a product's lifecycle in another phase is expected to prevent/reduce the occurrence of the same problems. Hopefully this will make assembly less difficult, less dangerous, and more efficient”, concludes Mr. Madhusudanan.