ABSTRACT |
Corpora are an important resource for both teaching and
research. Arabic lacks sufficient resources in this field, so a research
project has been designed to compile a corpus, which represents the
state of the Arabic language at the present time and the needs of end-users.
This paper presents the result of a survey of the needs of teachers of
Arabic as a foreign language (TAFL) and language engineers. A
quantitative analysis of the result shows that a number of text types
should have priority in the corpus. However, even the less useful
categories were judged “useful” by some of the respondents, so we should
not exclude these entirely.
Overall, our survey confirms our view that existing corpora are too
narrowly limited in source-type and genre, and that there is a need for
a freely-accessible corpus of contemporary Arabic covering a broad range
of text-types. Our survey also showed support for the inclusion of
parallel English-Arabic samples. Supplementary questions showed support
for potential use in wide range of Language Engineering applications;
and indicated that teachers of Arabic as a foreign language already make
significant use of computers in teaching, and want to include
contemporary, authentic examples.
The stages of the project include initially getting copyright clearance
from owners of resources which are mainly online general and specialized
magazines and general websites. We had lots of correspondence with these
owners and on the whole we had a positive response and obtained
permission to use several valuable resources. There are some materials
which need to be included in the corpus but need to be scanned. However,
due to lack of time and equipment we have not managed to include them.
We then collected texts and processed them onto the computer.
We invested a good amount of time and effort to annotate every text with
a header which provides internal and external information. The
information we included in the header follows largely accepted standard
but we included only the minimal required information such as authorship,
publication and details about texts such as their types and domains. The
minimal components are: File description, Encoding description, and
Profile description. In addition, we annotated the texts with
paragraphing. The method we used for processing the files is that we
created a template of the header and handled them in the Unicode editor
UNIRED. Collecting the texts for the corpus and annotating them are all
done manually.
The final result is that we compiled a corpus of around 1M words
covering some of the categories we decided to collect following the
result of the questionnaire. These are: short stories, radio, newspapers,
children’s stories, health and medicine, autobiography, magazines,
economics. To maximize the use of this corpus it will be freely
available on the WWW. However, our earlier investigation had shown that
it is still difficult to use corpus analysis tools such as concordancers
in handling Arabic text unless they are used in Arabic windows and even
so the result is not as tidy as in the case of languages with Roman
script. Since our corpus will be available on the internet we hope it
would be an interesting challenge for software engineers to develop
suitable analysis tools.
|