This file is raw output from pdftotext and may not be ideal for distribution. If you are a maintainer for Hackipedia, please sit down when you have time and clean this text version up. Source PDF: /mnt/fw-js/docs/Algorithms/Artificial Intelligence/A Genetic Algorithm for the Induction of Natural Language Grammars.pdf Like all conversions the text below should be fully readable as UTF-8 unicode text. --------------------------------------------------------------- A Genetic Algorithm for the Induction of Natural Language Grammars Tony C. Smith and Ian H. Witten Department of Computer Science, University of Waikato, Hamilton, New Zealand Email tcs@waikato.ac.NZ; phone: +64 (7) 838–4453; fax: +64 (7) 838–4155 Abstract known to be impossible to infer a correct grammar Strict pattern-based methods of grammar induction from positive instances only (Gold 1967). are often frustrated by the apparently inexhaustible Probabilistic approaches offer a solution to this variety of novel word combinations in large corpora. problem by allowing frequent, well-formed expres- Statistical methods offer a possible solution by al- sions to statistically overwhelm the infrequent un- lowing frequent well-formed expressions to over- whelm the infrequent ungrammatical ones. They grammatical ones. More precisely, probabilities also have the desirable property of being able to for word combinations (n-grams) are derived in ac- construct robust grammars from positive instances cordance with their observed frequencies. Parsing alone. Unfortunately, the “zero-frequency” prob- from models constructed in this way can thus be lem entails assigning a small probability to all pos- made more efficient than standard Phrase Structure sible word patterns, thus ungrammatical n-grams Grammars (PSGs), which must treat all possible become as probable as unseen grammatical ones. strings as equiprobable. Examples of frequency- Further, such grammars are unable to take advan- based grammars include Probabilistic Context-Free tage of inherent lexical properties that should allow Grammars (PCFGs) (Charniak 1993), which assign infrequent words to inherit the syntactic properties a probability to each string such that the prob- of the class to which they belong. abilities of all strings sum to one, and Hidden This paper describes a genetic algorithm (GA) that Markov Models (HMMs) (Rabiner & Juang 1993; adapts a population of hypothesis grammars to- wards a more effective model of language struc- Kupiec 1989), which assign a probability to each ture. The GA is statistically sensitive in that the string such that the probabilities for all strings of a utility of frequent patterns is reinforced by the per- given length sum to one. sistence of efficient substructures. It also supports Inductive methods for constructing such gram- the view of language learning as a “bootstrapping mars derive probabilities for each of their transitions problem,” a learning domain where it appears nec- by recording frequencies for n-grams garnered from essary to simultaneously discover a set of categories the training set. As the training set will not contain and a set of rules defined over them. Results from all possible n-grams of the language, some means a number of tests indicate that the GA is a robust, for assigning probabilities to unobserved transitions fault-tolerant method for inferring grammars from must be incorporated. Solutions to this so-called positive examples of natural language. "zero-frequency problem" usually entail assigning keywords: grammar induction, genetic algorithms, sta- a small but equal probability to all unseen patterns. tistical methods, context-free grammars. The result, however, is that ungrammatical n-grams become as probable as unseen grammatical ones. Introduction A further difficulty with such models is that they Grammatical inference is the gradual construction will assign lower probabilities to transitions involv- of a correct grammar based on a finite set of sample ing infrequent words than to those involving more expressions. In general, the training set may con- common words of the same syntactic category. For tain both positive and negative examples from the this reason they are unable to take advantage of any language under study. However, in the case of nat- lexical properties known to be characteristic of the ural languages, it is widely held that children learn language. a grammar from positive instances alone, and that The extent of these limitations can be reduced parents tend not to offer overt correction during the by applying frequency analysis to the lexical cat- learning process. This proves problematic as it is egories of a tagged-text. However, recent work indicates that grammar induction is an instance of Grammar induction using genetic a "bootstrapping problem" (Finch & Chater 1992; algorithms Smith & Witten 1995), a learning domain where it appears necessary to simultaneously discover a set Genetic algorithms are an analogue of natural se- of categories and a set of rules defined over them. lection. They adapt principles of reproduction, mu- That is, lexical tagging for the purpose of inferring tation and “survival of the fittest” to evolve suc- syntactic structure should not be constrained by ex- cessively better generations of computer programs isting notions of syntactic categories—the rules and for a particular task. For grammar induction, the categories must be derived together. “chromosome” is usually a context-free grammar whose fitness is measured by its ability to cover This paper outlines a grammar induction method (i.e. parse, accept, generate, etc.) a training set that incorporates the frequency-based conditioning of sample strings. Reproduction is simulated by aspects of probabilistic techniques while supporting dissecting two disparate grammars at suitable, ran- the notion of language learning as a bootstrapping domly chosen points and joining the first half of problem. The approach uses a genetic algorithm one grammar with the second half of the other, and (GA) to adapt hypothesis grammars toward more similarly joining the remaining halves. Mutation effective models of the language under study—in occurs by changing any single randomly chosen this case, English. symbol of a grammar with another symbol taken randomly from the set of symbols for the language. Following the approach of evolutionary program- Some effort has been directed towards apply- ming (Wijkman 1994), a population of grammars is ing GAs to the inference of context-free gram- allowed to evolve for a fixed number of generations mars. Koza (Koza 1992) outlined a GA for de- before any culling decisions are made. In this way, tecting exons of length 5 within DNA. After 35 apparent shortcomings that arise from a single adap- generations, the resulting 61-node tree was 100% tation are given a chance to reveal possible longer successful at identifying all exons in a segment of term benefits. Grammars are represented as LISP 1000 neucleoitide bases. Wyard (Wyard 1991) de- AND-OR s-expressions, and evolution is achieved vised a genetic algorithm for the language of cor- using standard GA techniques for tree transforma- rectly balanced nested parentheses which success- tions (Bickel & Bickel 1987). Individual grammars fully inferred a concise correct grammar from 20 are selected for reproduction and mutation in pro- test strings in just 3 generations. Wyard also con- portion to their size. The strength of the population structed a GA for inferring the more complex lan- as a whole is measured by its ability to parse the guage of all strings containing equal numbers of training set. a’s and b’s, but was unable to produce a successful grammar due to the model becoming stuck in local The GA is statistically sensitive in that the utility maximum. of frequent patterns is reinforced during the random In developing a genetic algorithm for inferring a selection of tree nodes. The more often a given node context-free grammar description for a more com- can account for substrings within training expres- plex language such as English, aspects of prior suc- sions, the more likely it will persist as a recurring cessful GA models were retained while those of feature in subsequent generations of grammars. Its failed attempts were avoided. Koza’s successful presence in a large number of candidate grammars exon identifier represented a grammar as a logi- thereafter gives it a higher probability of continued cal s-expression using AND, OR and NOT, while selection. Wyard’s failed “a-b language” recogniser used the more traditional Chomsky Normal Form (CNF). The view that language learning is a bootstrap- Many CNF grammars trivially translate into log- ping process is supported because lexical categories ical s-expressions. For example, the CNF grammar emerge as nodes most efficient at capturing regu- ) ) S NP VP larities in the training set. For example, an inferred ) NP Det N structure in which an OR-node of nouns is followed ) VP V NP ) by an OR-node of verbs is a more efficient general- Det a ) isation of the combinatoric capacity of these words Det the ) than is an OR-node whose arguments are all pos- N dog ) sible sequences combining one noun and one verb. N cat ) By using grammar size as a measure of fitness, the V saw V bit most simple representation has the best chance of being carried forward into the next generation. expands to OR AND the AND OR AND dog AND AND saw OR dog saw AND AND AND the cat a cat a the AND cat OR saw NULL dog Figure 1: Two sample grammars for “the dog saw a cat”. ) ) S Det N V Det N bedding (e.g. “the cheese that the mouse that the ) Det a cat ... chased ate smelled.”) or merely embedding ) Det the to a fixed depth indicate that the issue of a recursive ) N dog natural language grammar has not yet been fully ) N cat resolved. ) V saw V bit In a further step towards simplification, we limit the set of operators to just AND and OR. It is worth which has the following equivalent s-expression noting that NOT is seldom included in CFG formu- (AND (OR a the)(OR dog cat)(OR saw bit) (OR lations anyway. In addition, n-ary AND and OR a the)(OR dog cat)) expressions are rendered in their binary equivalents to simplify many of the GA functions. This representation is not a true s-expression in that the AND symbol is used to indicate catenation, and thus imposes a strict ordering on its arguments. The The evolutionary process OR symbol indicates that any one of its operands The GA proceeds from the creation of a random may substitute for the complete node, and thus population of diverse grammars based on the first imposes no ordering. This representation is used sample string. The vocabulary of the expression throughout the rest of the paper. is added to an initially empty lexicon of terminal Recursive CFGs cannot be translated to s- symbols, and these are combined with randomly expressions in this way, thus our decision to use chosen operators in the construction of a candidate s-expressions as the representation has certain lim- grammar. A NULL symbol is also included in the itations. However, it is quite difficult to develop lexicon to support the possibilityof optionallyomit- an efficient evaluation function for recursive gram- ted constructs within a grammar. If the candidate mars. First, a breadth-first parser would be needed grammar can parse the first string, it is admitted into for testing grammars, adding a great deal of addi- the initial population. Figure 1 shows two possible tional computation to the evaluation process. S- grammars for “the dog saw a cat”. expressions are easily tested bottom-up, allowing Early experiments in which initial grammars more time for the evolutionary search. Second, were constructed in a completely random manner simplistic evolution on recursive CFGs permits the often took a considerable amount of time to produce ) constant threat of infinite looping in the form of, say, even a single grammar capable of parsing the first S S , and thus requires additional constraints in expression. Constraints, such as a maximum depth testing for fitness. Because the training set consists for nesting of randomly constructed nodes, were of finite length strings, the resulting s-expressions subsequently placed on the grammar-generator to are a clean, simple intermediate representation from guide it towards suitable grammars more quickly. which questions about recursive substructure can be Additional simplification steps include the substitu- addressed at a later time. Furthermore, questions tion of (OR x x) and (AND x NULL) with just x to such as whether English allows infinite center em- preempt unnecessary steps during parsing. Also, in parent−1 parent−2 AND AND the AND AND AND OR AND the OR saw AND OR dog saw AND a dog OR AND NULL saw a cat the NULL a cat AND AND the AND AND AND OR AND a cat saw AND OR dog saw AND OR AND NULL saw the OR the NULL a cat offspring−a a dog offspring−b Figure 2: Reproduction via node exchange. keeping with Mauldin’s (Mauldin 1984) suggestion parents also persist. Figure 2 shows parents and for maintaining diversity, duplicate grammars are offspring from one reproductive event. always removed from the population. Mutation Reproduction Whenever two grammars are chosen for reproduc- At each presentation of a new sample string, the GA tion, another is also randomly selected for muta- adds any new vocabulary items to the set of terminal tion. Selection here is done in direct proportion to symbols. Two grammars from the existing popu- a grammar’s size (i.e. a larger grammar is more lation are randomly chosen for “sexual” reproduc- likely to be selected than a smaller one) under the tion. Selection is carried out in inverse proportion assumption that a larger grammar is not adding to to a grammar’s size. That is, grammars with fewer the fitness of the population as a whole. Moreover, nodes are given a higher probability than those with a larger size indicates the presence of extraneous a larger number of nodes under the assumption that nodes which further implies a greater capacity to smaller grammars are more effective. benefit from mutation. A single node is chosen ran- A single, randomly selected node from one par- domly from the grammar. If it is a leaf node, it is ent is exchanged with one from the other parent to replaced with a randomly selected terminal symbol produce two offspring. If an offspring is able to (possibly the NULL terminal) from the lexicon. If parse at least one string from the training set then it is an internal node, it is replaced with the com- it is added to the existing population, otherwise it plement operator. The mutated grammar is added is discarded. The process is non-destructive so the to the existing population if it can parse at least one AND AND AND OR a cat AND OR dog bit saw a the Figure 3: Lexical categories emerge in OR nodes. of the test strings. Its original form is also allowed test set, then they alone survive into the next gener- to persist. ation. At this point a new string is added to the test GAs traditionally assign probabilities to both re- set and a new evolutionary cycle is initiated. The production and mutation. Mutation is given a low shortest grammar in the population is output as the probability (presumably) to allow the adaptation “best-of-generation” after each cycle. procedures to play the greater part in dictating the This cyclic approach to population adaptation is direction of the evolutionary process. The primary in accordance with principles of evolutionary pro- function of mutation is to reduce the possibility of gramming. It has the benefit of allowing seemingly becoming stuck in local maximum, though this is poor changes that occur through reproduction or not always successful if culling decisions are made mutation to reveal their possible advantages at a too early. We needed a method of adding newly later stage of development. For example, the off- acquired lexical items into the population of gram- spring of a reproductive event might not be able to mars. One option was to simply add a conjunctive parse as well as its parents. It may be extraordinar- representation of the new sentence as a disjunctive ily large and complicated. However, it could also option at the root of one or all grammars. However, be that a simple, random change in one of its op- this was seen as too strong an intervention in the erators (brought about through mutation) renders adaptative process. We decided on the more pas- it a powerful grammar able to parse many more sive approach of allowing new words to be swapped well-formed expressions than any of its ancestors. in during a mutation. Because new words would be given the same chance of selection as all other vocabulary items, mutation was allowed to occur Results frequently so that the population did not flounder A series of experiments were devised to examine too long in its search for a correct grammar. the behaviour of the GA. The first test sought to compare the GA’s response to a small subset of Fitness English. A population of 10 randomly constructed Adaptation occurs in cycles of a fixed number of grammars was generated for the sentence “the dog reproductions and mutations. After each cycle, all saw a cat”, the best (i.e. smallest) of which was the grammars are tested for their ability to parse the following 13 node s-expression. existing test set. Parsing is carried out bottom-up, (AND (AND (AND the dog)(AND saw (OR (OR and grammars able to parse the greatest number of cat NULL) a))) cat) expressions in the test set are allowed to persist into the next cycle. Thus if no grammars have evolved The decision to start with 10 grammars was arbi- to parse the complete test set, the grammars that trary, intended merely to avail the algorithm of some could parse all but the latest addition will still be initial diversity. present in the population. Any new grammars that The sentence “a dog saw a cat” was added to the parse an equal number of expressions are also al- training set and the initial population was allowed lowed to persist, and the population is subjected to to evolve for one cycle of 25 generations. The another cycle of adaptation. Thus the population cycle length was arrived at through trial and error. could conceivably continue to grow indefinitely. If, Successful grammars were often found in one or however, grammars have evolved to cover the entire two generations, but a larger period allowed a wider AND AND AND OR a cat AND OR saw OR OR bit AND a the bit OR cat dog a dog Figure 4: The grammar after four sample strings. exploration of the search space and a better chance has successfully collapsed the nouns of the subject for the emergence of more efficient grammars. nounphrase under a single node. The corresponding Five correct grammars resulted from the first cy- tree for this grammar is shown in Figure 4. From cle, the best of which was the four training expressions, four additional well- formed sentences have been inferred. Even so, two (AND (AND (AND (OR a the) dog) (AND dog aspects have emerged that weaken the grammar’s saw)) (AND (OR a the)a cat)) overall appeal. First, the word “bit” has been erro- This grammar has grouped the determiners into a neously incorporated into the set of subject nouns single disjunctive node—a desirable result from a (the fact that consequent sentences might be argued grammar induction system. It has also fortuitously as grammatical is simply an accident of the vocab- copied the determiner node into the direct object ulary used). This extraneous node adds nothing to nounphrase, and thus has inferred a new string. the grammar’s ability to cover the training set, and The sentence “the dog bit a cat” was added to would be removed if enough time were allowed the training set to see if a similar grouping effect because the result would be a more concise gram- could be achieved for regular verbs. The population mar and therefore more likely to persist into future obtained from the previous cycle was evolved for generations. another 2 cycles (50 generations) before it devel- Second, the direct object nounphrases “a dog” oped any grammars that could parse all three test and “a cat” have been distributed between two dis- strings. The best of the two successful grammars sociated nodes. This also is the result of an inade- was this 13-node grammar quate amount of evolutionary time, as the seemingly obvious advantage of (AND (AND (AND (OR a the) dog)(OR bit (AND a (OR dog cat)) saw))(AND a cat)) over and its corresponding tree is shown in Figure 3. As (OR (AND a dog)(AND a cat)) was hoped, the verbs have emerged within a single node, though the serendipitous determiner node in has not yet emerged. To increase the combinatoric the direct object nounphrase has been lost in the cost of failing to generalise the object nouns, we process. added the strings “the dog saw a mouse” and “a cat Finally, the sentence “the cat saw a cat” was chased the mouse” to the training set in two suc- added to the training set to see if nouns will also cessive cycles. The best grammar obtained from group together. After 3 cycles (75 generations), the additional 100 generations is shown in Figure 5 22 grammars evolved that could parse the complete and, as expected, the additional complexity of the test set. The shortest of these, training set has led the GA toward a more effective ) generalisation of syntactic components—including (AND (AND (AND (OR a the) (OR bit (OR cat the emergence of the S NP VP general form of- dog)))(OR (OR bit (AND a dog)) saw)) (AND a ten used in stochastic context-free grammars for cat)) English. AND AND AND OR OR OR AND a the OR dog bit OR OR OR bit OR the a OR dog cat AND chased saw bit a cat AND a mouse Figure 5: A S )NP VP structure emerges after 6 sample strings. Earlier, we mentioned that many CNF grammars grammar, as translate into s-expressions. The reverse is also ... the very elegance and simplicity of [an possible. If we perform this transformation on inferred grammar] is rather more evidence the grammar shown in Figure 5, using nontermi- against, than evidence for, it being the gram- nal symbols commonly found in natural language mar our brain is built to use ... [since] adapta- PSRs to stand in for some of the expanded nodes, tions are typically not maximally efficient en- the following rules are obtained. ) gineering solutions to the problems they solve. ) S NP1 VP (Devitt & Sterelny 1987, 145–146) ) NP1 Det N1 That slight perturbations are so often present in ) VP V NP2 inferred grammars, irrespective of the approach ) NP2 Det N2 adopted for their construction, is the consequence ) Det a of building a description based on facts rather than ) Det the ) N cat speculation or intuition (Stich 1980). Grammar in- ) N dog duction as a whole seems to be bringing an end to ) N1 N the view that a good grammar will generate all and ) N1 bit only the grammatical strings of a language (M. & ) N2 N K. 1987). ) N2 a mouse ) V bit Discussion ) V chased ) V saw Recent years have seen a great deal of interest and V bit a activity in developing new approaches to learning This description differs in detail only slightly for natural language processing. A variety of con- from the grammar presented as an example of CNF nectionist models have been devised that are ap- earlier in the paper. The differences in its two noun- parently able to learn the underlying structure of a phrase rules would probably be eliminated if suf- subset of English expressions (Rager & Berg 1990). ficient time was given to discover the weaknesses While it is often difficult to assess the quality of a of their errant nodes. Furthermore, the verbphrase- model within the resulting multi-dimensional maps, like structure would very likely adapt to one more progress towards representation in classical terms closely resembling those found in standard CFGs is being made (Moisl 1992), and the full benefit if enough time were allotted to eradicate its “bit a” of connectionist techniques may gradually become option. more apparent. The seemingly persistent occurrence of extrane- Symbolic approaches have long been the princi- ous nodes may be an inherent penalty paid for the pal methodology for grammar induction. In gen- over-use of mutation. However, minor discrepan- eral, they seek to characterize language structure cies such as these do not necessarily invalidate the through basic pattern recognition techniques, and have proven quite useful when applied to artificial Such grammars can in principle be incorporated languages, such as the L-strings of graphics pro- into any existing processor that uses CNF gram- gramming (Nevill-Manning, Witten, & Maulsby mars. The GA we have described is an effective, 1994). However, repeating patterns of more than fault-tolerant method for inferring practical natural four or five words are uncommon in samples of nat- language grammars. ural language, thus models produced in this way tend to be insufficient for processing novel sen- References tences even when very large corpora are used for training. In addition, unlike connectionist mod- Bickel, A. S., and Bickel, R. W. 1987. Tree els, frequent patterns do not reinforce the model structured rules in genetic algorithms. In Davis, L., any more than infrequent ones. Once a concept is ed., Genetic Algorithms and Simulated Annealing. learned, it tends to persist unchanged throughout Pittman. the training period. Charniak, E. 1993. Statistical language learning. Statistical models attempt to circumvent prob- Massachusetts: MIT Press. lems arising from novel word patterns by assigning Finch, S., and Chater, N. 1992. Bootstrapping small probabilities to all unseen n-grams. But un- syntactic categories using statistical methods. In grammatical patterns are consequently viewed as Daelemans, W. & Powers, D., ed., Background being just as likely as unseen grammatical ones and Experiments in Machine Learning of Natural and, because any possible word combination must Language, 229–236. Tilburg, NL.: ITK. be assigned some probability, the model eventu- ally degenerates to a complete graph. Even so, this Gold, E. M. 1967. Language identification in the comprehensive nature of statistical models makes limit. Information Control 10:447–474. them extremely robust when processing novel ex- Koza, J. R. 1992. Genetic Programming. MIT pressions. In addition, frequently observed n-grams Press. reinforce learned concepts, and the resulting model Kupiec, J. M. 1989. Augmenting a hidden markov can greatly assist processing tasks, such as parsing, model for phrase-dependent word tagging. In Pro- by directing searches to more likely solutions first. ceedings of the 1989 DARPA Speech and Natural The GA described in this paper is a compromise Language Workshop, 92–98. Philadelphia: Mor- between these three approaches. Like connection- gan Kaufmann. ism, it is an adaptive process whereby the model M., D., and K., S. 1987. Language and Reality: is gradually conditioned by the training set. Recur- An Introduction to the Philosphy of Language. Ox- ring patterns help to reinforce partial inferences, but ford: Blackwell. intermediate states of the model may include incor- rect generalisations that can only be eradicated by Mauldin, M. L. 1984. Maintaining diversity in continued evolution. This is not unlike the devel- genetic search. In Proceedings of the National oping grammar of a child which includes mistakes Conference on AI, 247–250. AAAI. and overgeneralisations that are slowly eliminated Moisl, H. 1992. Connectionist finite state nat- as their weaknesses are made apparent by increas- ural langugae processing. Connection Science ing positive evidence. 4(2):67–91. Like statistical methods, a GA is sensitive to pat- Nevill-Manning, C. G.; Witten, I. H.; and tern frequency. The more often a learned concept Maulsby, D. L. 1994. Compression by induction of can account for substrings of the sample expres- hierarchical grammars. In Storer, J. A., and Cohen, sions, the more likely it will persist as a recur- M., eds., Proceedings of the Data Compression ring feature in subsequent generations of the model. Conference, 244–253. Los Alamitos, California: From an evolutionary programming perspective, the IEEE Press. presence of a concept in a proportionately large number of models in the population also gives it a Rabiner, L., and Juang, B. H. 1993. Fundamentals of speech recognition. Prentice Hall. higher probability of continued survival. Processors using the grammar produced by this Rager, J., and Berg, G. 1990. A connectionist GA do not have explicit access to the frequencies model of motion and government in Chomsky’s used in its construction and thus are not able to government-binding theory. Connection Science take advantage of this information for processing 2(1 & 2):35–52. tasks. The output of the GA is instead a grammar Smith, T. C., and Witten, I. H. 1995. Probability- that, as has been shown, is easily translated into the driven lexical classification: A corpus-based ap- CNF representation typical of symbolic learners. proach. In Proceedings of PACLING-95. accepted. Stich, S. 1980. Grammar, psychology, and indeter- minacy. In Block, N., ed., Readings in Philosophy of Psychology, Volume 2. London: Methuen and Co. 208–222. reprint. Wijkman, P. A. I. 1994. A framework for evolu- tionary computation. In The Third Turkish Sym- posium on Artificial Intelligence and Neural Net- works. Wyard, P. 1991. Context free grammar induction using genetic algorithms. In Proceedings of the 4th International Conference on Genetic Algorithms, 514–518.