Literature Review for OER Retrieval and Classification Systems
Interoperability is the promise of exchange and requires an agreement on the particular methods of exchange. However, without agreement, interoperability proves to be challenging. In the context of OER (Open Educational Resources) interoperability can be understood to be a discussion about metadata schemas and taxonomy, for which there is no one standard or agreed upon guideline.
Statistical inferences derived from machine learning algorithms can help bridge the gap between the untenable goal of adherence to one standard and the reality of multiple metadata schemas and multiple taxonomies. Aggregating and displaying similar OER from disparate repositories despite different metadata schemas and taxonomies would achieve the goal of interoperability without the requirement of strict compliance to a universally accepted standard. The goal of this brief review is both to review previous work in the area of OER retreival systems and determine an appropriate data mining technique for building a web-based, OER recommender system that classifies OER into relevant taxonomies using metadata which is exposed via RESTful APIs from different institutional repositories.
Strategies that improved the accuracy of data-mining results will be looked at, including which classification algorithms were used and which text processing methods were most effective.
Keywords – Support Vector Machine, K-Nearest Neighbor, Naive Bayes, OERs, Retrieval of OERs, Text-Mining, Algorithms, Data-Mining
OER describes a type of copyrightable work that is licensed with a Creative Commons license. Similar to open source software, the licensing creates a legal framework that promotes retaining, reusing, revising, remixing and redistributing content. Precisely because of the ease with which one can and is encouraged to share, copy and redistribute, OER can be found in many places. As a consequence, some of the challenges are access challenges or simply being able to find them unless, of course, one knows where to look.
Many OER are currently hosted in institutional repositories which rely on metadata for both interoperability and discovery 1. Since there is no agreed upon universal schema, the promise of interoperability is limited by chance that two or more institutions share the same one. Even if there were a universal metadata schema, multiple ontologies still exist, further complicating efforts to converge similar resources. Inconsistencies between schemas and ontologies creates a challenge to the very real problem of finding relevant OER. How to amalgamate resources with heterogeneous metadata schemas across distributed repositories looks to data-mining for potential solutions.
Overview of challenges
The inability to find relevant, quality OER ranks as one of the most significant barriers to increasing the adoption of OER 2. Though generic search engines are frequently used by academics to look for OER they are ineffective at delivering results that could be useful for educational purposes 3. In order to aid in the discovery of OERs which are often hosted in institutional repositories, attention has been paid to extending and applying metadata schemas including Dublin Core, LOM, LRE and others 4. Metadata schemas can be used to express rich taxonomies which can then be used to group resources, making them easier to find so long as the user knows which repositories to visit.
Some solutions to finding OER under these constraints implement metadata mapping, though problems arise when a metadata field in one schema does not have a corresponding entry in another. This problem is referred to as the one-to-none matching situation . The problem compounds when inconsistencies exist between taxonomies, or ontologies. Relevant resources for the category ‘Computer Science’ can also be reasonably categorized under ‘Information Systems’, ‘Computer and Information Systems’, or ‘Data Science’ to name a few. When each repository uses a different taxonomy it becomes challenging to present relevant search results without manually mapping one ontology to another.
To overcome the challenge of ontological inconsistency some approaches look to Linked Open Data (LOD) 5. While the principles behind LOD are aligned with the activities of searching, organizing and retrieving OER, besides the limitations and difficulties described by Vallejo-Figueroa et al, a LOD approach could supplement a text-mining application and may inform a future direction. Other studies that attempted to overcome category heterogeneity looked at data-mining clustering to extract a list of autonomously created keywords .
Review of past approaches
Predicting which resource belongs to a particular category, independent of how they were manually categorized in various repositories can be achieved by a machine learning classifier. Mouriño-García et al looked at previous work and approaches to classifying resources with different taxonomies and noted the limitations of taxonomy mapping . Some of the research cited in that paper makes an assertion about the quality of various classification algorithms in the context of automatic text categorization 6. Based on the findings presented by Rigutini et al, the most successful algorithms for text categorization are the SVM and Naive Bayes.
Mouriño-García et al use an SVM algorithm in their building of the Cross Repository Open Educational Resources Aggregator (CROERA) system with some degree of success. The web-based CROERA system looks to aggregate resources from three different OER repositories each with different metadata schemas . Using a programming language specifically designed for web applications CROERA scrapes the title, keywords, description and category from the metadata of a resource, turning it into a bag-of-words (BoW). Different weights are then applied to each metadata item to help improve the accuracy of the classifier, however, depending on the repository, the value of the weights need to be adjusted. The significant size of these repositories proved to be challenging in finding enough computational resources to use all of the documents to train the classifier which led to sub-optimal classification. Since the process of training took a long time, storing the classification results until the metadata is updated reduces the cost of re-classification.
A comparative analysis of classification techniques looked at the performance results for 4 different algorithms, Naive Bayes, K-Nearest Neighbor (KNN), Decision Tree and Neural Networks 7. The highest accuracy and lowest error rate determined that the Naive Bayes algorithm performed the best when matching relevant OER with specific learning objectives .
Technology frameworks have been proposed as solutions to the problem of organizing OER by implementing clustering techniques in order to generate a list of automatically generated keywords . Though classification was not the primary focus of their research, approaches taken to text-mining such as removing formatting and punctuation prior to extracting all the individual words were noted. Tokenizing words and employing term frequency-inverse document frequency (TF-IDF) to penalize overly-frequent words and using a stop words list were also noted. Ishan Sudeera Abeywardena et al incorporated a novel approach to improving the relevance (and therefore usefulness) of a searchable item by considering, among other things, the level of openness inherent in the type of CC license attached to a resource. Understanding relevance in a context greater than content similarity is helpful in informing future challenges.
In order to assess the performance of classification algorithms, metrics like Precision, Recall and F1 Score can be used. An ideal measurement for both Precision and Recall is 100% . F1 Score combines Precision and Recall indicating a more global performance. Where both measurements represent a fraction of retrieved instances, Precision measures the fraction of retrieved documents that have been successfully predicted and Recall measures the fraction of relevant documents. 
Summary of the status of the field
Mining algorithms that have been most successful in the reviewed literature for the classification of OER are SVM and Naive Bayes. Employing TF-IDF after cleaning and tokenizing the corpus into individual words is a common approach for improving classification accuracy. Measuring the performance of a classifier using Precision, Recall and F1 Score are recognized as reliable performance metrics. Scraping and using metadata fields to form a BoW can be an effective way to mine resources for relevance. Similar to the CROERA system, using a programming language specifically designed for web applications makes sense when both the resources and the aggregator system live on the web.
Description of future challenges
Relevance is subjective and as a measurement of appropriateness takes on vastly different meanings in certain contexts. Content similarity is one way to understand relevance and is the metric by which text-mining applications assess a level of appropriateness. However, if more can be determined about the context of a search, or who is searching and what they want to do with the resource, attributes such as specific CC licenses could also inform a notion of relevance and be implemented through the assignment of weights . Other measurements of relevance could also play a role in improving the level of appropriateness such as the educational context, or instructional need.
Linked Open Data also presents some interesting opportunities to participate in and contribute to a web architecture designed explicitly for sharing, organizing and linking similar entities.
Previous approaches to text-mining OER have attempted to determine which classification algorithm is the most accurate, how best to pre-process text and which quantification techniques are the most effective. An ideal OER retrieval system presents relevant resources despite differences in their taxonomies, or metadata schema. Data-Mining and in particular, text-mining using an SVM algorithm is a proven, viable alternative to metadata mapping or relying on a universal metadata standard. Both SVM and Naive Bayes are classification algorithms that have demonstrated promising levels of accuracy in previous OER retrieval applications. Text-preprocessing methods such as removing punctuation and other noise such as formatting are straight forward steps that will improve the accuracy of the results. Using a stop word list and TF-IDF can also improve results.
- G. Santos-Hermosa, N. Ferran-Ferrer, and E. Abadal, “Repositories of Open Educational Resources: An Assessment of Reuse and Educational Aspects,” Int. Rev. Res. Open Distance Learn., vol. 18, no. 5, pp. 84–120, Aug. 2017.
- D. Prasad and T. Usagawa, “Towards development of OER derived custom-built open textbooks: A baseline survey of university teachers at the University of the South Pacific,” Int. Rev. Res. Open Distrib. Learn., vol. 15, no. 4, Aug. 2014.
- Ishan Sudeera Abeywardena, Chee Seng Chan, and Choy Yoong Tham, “OERScout Technology Framework: A Novel Approach to Open Educational Resources Search,” Int. Rev. Res. Open Distance Learn., vol. 14, no. 4, pp. 214–237, Oct. 2013.
- M. Mouriño-García, R. Pérez-Rodríguez, L. Anido-Rifón, M. J. Fernández-Iglesias, and V. M. Darriba-Bilbao, “Cross-repository aggregation of educational resources,” Comput. Educ., vol. 117, pp. 31–49, Feb. 2018.
- S. Vallejo-Figueroa, M. Rodríguez-Artacho, M. Castro-Gil, and E. S. Cristóbal, “Using text mining and linked open data to assist the mashup of educational resources,” in 2018 IEEE Global Engineering Education Conference (EDUCON), 2018, pp. 1606–1611.
- “An EM based training algorithm for cross-language text categorization,” 2005 IEEEWICACM Int. Conf. Web Intell. WI05 Web Intell. 2005 Proc. 2005 IEEEWICACM Int. Conf. Web Intell., p. 529, 2005.
- A. S. Sabitha, D. Mehrotra, A. Bansal, and B. K. Sharma, “A naive bayes approach for converging learning objects with open educational resources,” Educ. Inf. Technol., vol. 21, no. 6, pp. 1753–1767, Nov. 2016.