How to get those from wikimedia dump? Also i need speech corpus since just
started recording audio for my research.
On Fri, Jan 6, 2012 at 8:08 PM, JAGANADH G <jaganadhg at gmail.com> wrote:
Pardon my ignorance. What do you mean by language model ?
A language model is a statistical model which populate from a data set.
Here I think OP is taling about creating language model for Speech
Processing. N-Gram is a kind of language model
http://en.wikipedia.org/wiki/N-gram
And by
Tamil-corpus do you mean a large collection of tamil text ?
Corpus in the context of Natural Language Processing is:
A large collection of text .
There are different types of corpus such as Text Corpus, Speech Corpus,
Image corpus etc..
Here OP requires a text corpus. I think he can use the Tamil Wikipedia dump
as corpus for his research purpose. Or he can populate a corpus from
newspaper RSS feeds and Tamil blog feeds too.
--
**********************************
JAGANADH G
http://jaganadhg.in
*ILUGCBE*
http://ilugcbe.org.in
_______________________________________________
ILUGC Mailing List:
http://www.ae.iitm.ac.in/mailman/listinfo/ilugc