seventh framework programmecordis.europa.eu
  • Castellano
  • Français
  • English
  • Deutsch
  • Italiano
  • Ελληνικά

Nominal Classification

Here, we present different gold-standard lists for noun classification in Spanish and English and their sizes. Also, we present the results we obtained using those gold-standard lists. If you perform classification experiments using those gold-standards we will be very happy to publish your results here, so let us know!

For more information on how we developed the gold-standards and about the experiments we performed, see this document.

Gold Standards

English

For English, we have three gold-standard lists. The abstract nouns were extracted from Altarriba et al (1999), the human nouns have been translated from Spanish gold-standard and the non-deverbal eventive nouns were developed by Bel et al. (2010).

Class Number of nouns
In class Out class Total
Abstract 126 99 225
Human 283 270 553
Non-deverbal eventive nouns 74 93 167


Spanish

We used the Spanish working lexicon of the Incyta Machine Translation system (Alonso and Bocsák, 2005) to create the concrete, human and semiotic gold-standard lists. The gold standard for non-deverbal eventive nouns was developed for the experiments of Bel et al. (2010). Next table summarizes the nouns in each gold-standard:

Class Number of nouns
In class Out class Total
Concrete 3,806 3,800 7,606
Human 3,258 3,258 6,516
Semiotic 311 329 640
Non-deverbal eventive nouns 100 100 200


Results obtained with these gold-standards

Here, we summarize the results obtained in the task of noun classification with these gold-standards. We have performed the experiments using Decision Trees (see Bel et al., 2010) and the Corpus Tècnic de l’IULA (Cabré et al., 2006) to extract noun occurrences. This corpus contains a collection of written texts from the fields of Law, Economy, Genomics, Medicine, and Environment, as well as a contrastive corpus from the newspapers. In our experiments, for Spanish we used the 21M tokens newspaper corpora and the Economy corpus (1 milion words) to study domain-specific and sparse data problems. For English, we used exts of different domains: Economy, Medicine, Computer science and Environmental issues, of about 3.2M tokens.

Beat me!: If you perform classification experiments using these gold-standards we will be very happy to publish your results here, so let us know!

Results for English


Class

Acc. %

FP %

FN %

Abstract

71.36

9.86

18.78

Human

80.48

5.16

14.36

Non-deverbal eventive nouns

80.24

1.80

17.96


Results for Spanish


Class

General Corpus

Economy Corpus

Acc. % FP % FN % Acc. % FP % FN %

Human

78.14

9.13

12.74

71.70

5.25

23.05

Semiotics

71.46

7.94

20.60

65.54

10.46

24.00

Non-deverbal eventive nouns

86.43

3.02

10.55

68.55

18.87

12.58


References

J. A. Alonso and A. Bocsák. 2005. Machine Translation for Catalan-Spanish. The Real Case for Productive MT; In Proceedings of the tenth Conference on European Association of Machine Translation. (EAMT 2005), Budapest, Hungary.

J. Altarriba, L. M. Bauer and C. Benvenuto. 1999. Concreteness, context availability, and imageability ratings and word associations for abstract, concrete, and emotion words. Behavior Research Methods, Instruments, & Computers, 31.

N. Bel, M Coll, and G Resnik. 2010. Automatic Detection of Non-deverbal Event Nouns for Quick Lexicon Production. In Chu-Ren Huang and Dan Jurafsky (ed.), Proceedings of the 23rd International Conference on Computational Linguistics (COLING-2010). Beijing, China. Pàg. 46—52. ISBN 978-7-900-268-00

M. T. Cabré, C. Bach and J. Vivaldi. 2006. 10 anys del Corpus de l’IULA. Barcelona: Institut Universitari de Lingüística Aplicada. Universitat Pompeu Fabra.