Oracle has released its chatbots application recently and I was curious about the technology behind chatbots in general. So I explored open source natural language processing framework OpenNLP and build a simple sentence detector application using my favourite IDE Oracle JDeveloper 12c.
Input/Output:
Here is my first OpenNLP project for sentence detection, for example given a string : How are you ? This is Mike, the output should detect two sentences : How are you ? and This is Mike.
By default OpenNLP does not support Bahasa language and I tried to create sample corpora for training.
For example : Apa kabar ? saya Elva , would have output : Apa kabar ? and saya Elva
Train model for Bahasa:
OpenNLP provides command line tool for training new model. To train new Bahasa sentence detector, use following command:
$ ./opennlp SentenceDetectorTrainer -model ../id-opennlp-models/id-sent.bin -lang id -data ../id-opennlp-models/id-train.sent -encoding UTF-8
This command will generate id-sent.bin as our model file and can be used in our project:
public static void SentenceDetectIndo() throws InvalidFormatException,
IOException {
String paragraph = "Apa Kabar? Saya Elva.";
// always start with a model, a model is learned from training data
InputStream is = new FileInputStream("id-sent.bin");
SentenceModel model = new SentenceModel(is);
SentenceDetectorME sdetector = new SentenceDetectorME(model);
String sentences[] = sdetector.sentDetect(paragraph);
System.out.println(sentences[0]);
System.out.println(sentences[1]);
is.close();
}
Output:
Apa Kabar?
Saya Elva.
Conclusion:
OpenNLP is a powerful open source natural language processing framework, easy to use and allow us to train our own model using simple command line. In this project, I successfully create a simple sentence detector application, training my own Bahasa language model and use it to detect sentences in Bahasa.
Sample project file download here (sorry about big file size 57 MB, because the project includes complete required libraries and english models)
Input/Output:
Here is my first OpenNLP project for sentence detection, for example given a string : How are you ? This is Mike, the output should detect two sentences : How are you ? and This is Mike.
By default OpenNLP does not support Bahasa language and I tried to create sample corpora for training.
For example : Apa kabar ? saya Elva , would have output : Apa kabar ? and saya Elva
Train model for Bahasa:
OpenNLP provides command line tool for training new model. To train new Bahasa sentence detector, use following command:
$ ./opennlp SentenceDetectorTrainer -model ../id-opennlp-models/id-sent.bin -lang id -data ../id-opennlp-models/id-train.sent -encoding UTF-8
This command will generate id-sent.bin as our model file and can be used in our project:
public static void SentenceDetectIndo() throws InvalidFormatException,
IOException {
String paragraph = "Apa Kabar? Saya Elva.";
// always start with a model, a model is learned from training data
InputStream is = new FileInputStream("id-sent.bin");
SentenceModel model = new SentenceModel(is);
SentenceDetectorME sdetector = new SentenceDetectorME(model);
String sentences[] = sdetector.sentDetect(paragraph);
System.out.println(sentences[0]);
System.out.println(sentences[1]);
is.close();
}
Output:
Apa Kabar?
Saya Elva.
Conclusion:
OpenNLP is a powerful open source natural language processing framework, easy to use and allow us to train our own model using simple command line. In this project, I successfully create a simple sentence detector application, training my own Bahasa language model and use it to detect sentences in Bahasa.
Sample project file download here (sorry about big file size 57 MB, because the project includes complete required libraries and english models)