Nominal Classification
Here, we present different gold-standard lists for noun classification in Spanish and English and their sizes. Also, we present the results we obtained using those gold-standard lists. If you perform classification experiments using those gold-standards we will be very happy to publish your results here, so let us know!
For more information on how we developed the gold-standards and about the experiments we performed, see this document.
Gold Standards
English
For English, we have three gold-standard lists. The abstract nouns were extracted from Altarriba et al (1999), the human nouns have been translated from Spanish gold-standard and the non-deverbal eventive nouns were developed by Bel et al. (2010).
| Class | Number of nouns |
||
| In class | Out class | Total | Abstract | 126 | 99 | 225 |
| Human | 283 | 270 | 553 |
| Non-deverbal eventive nouns | 74 | 93 | 167 |
Spanish
We used the Spanish working lexicon of the Incyta Machine Translation system (Alonso and Bocsák, 2005) to create the concrete, human and semiotic gold-standard lists. The gold standard for non-deverbal eventive nouns was developed for the experiments of Bel et al. (2010). Next table summarizes the nouns in each gold-standard:
| Class | Number of nouns |
||
| In class | Out class | Total | Concrete | 3,806 | 3,800 | 7,606 |
| Human | 3,258 | 3,258 | 6,516 |
| Semiotic | 311 | 329 | 640 |
| Non-deverbal eventive nouns | 100 | 100 | 200 |
Results obtained with these gold-standards
Here, we summarize the results obtained in the task of noun classification with these gold-standards. We have performed the experiments using Decision Trees (see Bel et al., 2010) and the Corpus Tècnic de l’IULA (Cabré et al., 2006) to extract noun occurrences. This corpus contains a collection of written texts from the fields of Law, Economy, Genomics, Medicine, and Environment, as well as a contrastive corpus from the newspapers. In our experiments, for Spanish we used the 21M tokens newspaper corpora and the Economy corpus (1 milion words) to study domain-specific and sparse data problems. For English, we used exts of different domains: Economy, Medicine, Computer science and Environmental issues, of about 3.2M tokens.
Beat me!: If you perform classification experiments using these gold-standards we will be very happy to publish your results here, so let us know!
Results for English
|
Class |
Acc. % |
FP % |
FN % |
|
Abstract |
71.36 |
9.86 |
18.78 |
|
Human |
80.48 |
5.16 |
14.36 |
|
Non-deverbal eventive nouns |
80.24 |
1.80 |
17.96 |
Results for Spanish
|
Class |
General Corpus |
Economy Corpus |
||||
| Acc. % | FP % | FN % | Acc. % | FP % | FN % | |
|
Human |
78.14 |
9.13 |
12.74 |
71.70 |
5.25 |
23.05 |
|
Semiotics |
71.46 |
7.94 |
20.60 |
65.54 |
10.46 |
24.00 |
|
Non-deverbal eventive nouns |
86.43 |
3.02 |
10.55 |
68.55 |
18.87 |
12.58 |
References
J. A. Alonso and A. Bocsák. 2005. Machine Translation for Catalan-Spanish. The Real Case for Productive MT; In Proceedings of the tenth Conference on European Association of Machine Translation. (EAMT 2005), Budapest, Hungary.
J. Altarriba, L. M. Bauer and C. Benvenuto. 1999. Concreteness, context availability, and imageability ratings and word associations for abstract, concrete, and emotion words. Behavior Research Methods, Instruments, & Computers, 31.
N. Bel, M Coll, and G Resnik. 2010. Automatic Detection of Non-deverbal Event Nouns for Quick Lexicon Production. In Chu-Ren Huang and Dan Jurafsky (ed.), Proceedings of the 23rd International Conference on Computational Linguistics (COLING-2010). Beijing, China. Pàg. 46—52. ISBN 978-7-900-268-00
M. T. Cabré, C. Bach and J. Vivaldi. 2006. 10 anys del Corpus de l’IULA. Barcelona: Institut Universitari de Lingüística Aplicada. Universitat Pompeu Fabra.








