प्रयोगकर्ता:RajeshPandey/Nepali Wikipedia Translator
This project is also available at the svn, you can download from svn and build if you want to see the source.
Web Interface Enabled
[सम्पादन गर्नुहोस्]You can translate the text here तपाइले यहाँ पनि अनुवाद गर्न सक्नुहुन्छ।
- The web interface has been updated on 27th June 2011 in aspspider.info/Nepaliwikipedia, but the website operator says : it might delete the website since the trial 90 days is over. If anyone is interested to host this project please let me know or you may directly build it from the svn, publish the web project, host it and then let us know the link. --Rajesh (Talk) 19:32, 26 June 2011 (UTC)
किसे अपना कहें
Latest release : 7th October 2012
[सम्पादन गर्नुहोस्]- Make sure you are running the latest release, New releases are more accurate than the past versions. Since I keep on adding rules and update them constantly.
- You can uninstall the older versions and install a latest version to get better accuracy.
- Some new features are:
- 7th October 2012
- Fixed English Translation
- Added Feature English to Nepali direct translation. [Thanks to Microsoft Translation apis]
- However there is a limit set by Microsoft for those using free translation(that would be us). This limit can be overcome by paying Microsoft/Google for data usages.
- Hindi to Nepali Translation will work normally as before.
- 19th April 2012
- Added more translations for Hindi to Nepali Translation
- Removed feature: English to Nepali direct translation. Hindi to Nepali Translation will work.
- Template Copier Bot will now provide the source from where the template was copied.
Release : 30th October 2011
[सम्पादन गर्नुहोस्]- 30th October 2011
- New Build : Contains features to copy templates from English Wikipedia to Nepali Wikipedia, Hindi Wikipedia to Nepali Wikipedia,
- While copying templates, it automatically checks if the /doc exist or not and tries to save them as well.
- Dotnetwikibot has been removed and Wikifunctions has not been removed to avoid redundancy.
- Fewer rules have also been updated in translation.
- If you find any bug, please paste in the talk pages. Thanks.
- September 17 2011
- Special Build to mark the Software Freedom Day.
- Added a feature which will be useful to add words in dictionaries (Whenever a word is highlighted and the menu "edit rules" is clicked, it automatically appends the word at the end of the dictionary.
- Added some more rules.
- September 1 2011
- Manual is added
- Features added
- Save file as .mediawiki format
--Rajesh (वार्ता) १०:१६, १ सेप्टेम्बर २०११ (UTC)
- June 9th 2011
- A major performance update. (loads lesser nouns and verbs rather than all nouns and verbs which took more resources and more time, Now it is significantly faster)
- Moved Progress bar to the bottom as a status bar.
- Added a diff feature (Thanks to Autowikibrowser, I had to include the wikifunctions dll to use the diff functionality )
- Accuracy : some more rules are added, I think we need to work more on accuracy. A couple of post processing rules are added to increase accuracy. However because I changed the post processor heavily, to make it faster, I think some bugs should come up, though luckily I didn't found any bugs till now.. and the accuracy was the same as before (meaning not bad even when I was loading lesser rules.. )
Fetches Wikipedia article content. Thanks to dotnetwikibot for this. --राजेश १९:४७, ८ जुन २०११ (UTC)
- 8th March 2011
- Users can have their own rules,
- progress bar is added,
- Print functionality is added
- .mediawiki and .nwt file extensions are registered in registry for this application.
- While I downloaded the translated thing as a file from Google translate(translate.google.com/toolkit), it saved in .mediawiki format and so I chose that format so that I could translate by double clicking in it. The progress bar and the thread took away most of my development time. Some more rules are added in this version, but this time I could not work in rules though. Its better though, I like it, don't know why .. may be because I made it :P .
Download
[सम्पादन गर्नुहोस्]SVN and homepage
[सम्पादन गर्नुहोस्]Alternatively you can browse the "Google projects" page here. You can also download the svn and modify the application if you are familiar with c# .net. This application was developed in a system which had a Windows 7 machine, Visual Studio 2010 and C# with 3 GB of RAM, and an I3 Processor (Four CPUs 2.24 GHZ), Dell Vostro 3400 Machine.
SVN address:
# Non-members may check out a read-only working copy anonymously over HTTP. svn checkout http://nepaliwikipediatranslator.googlecode.com/svn/trunk/ nepaliwikipediatranslator-read-only
Introduction
[सम्पादन गर्नुहोस्]This project was developed while I was editing Nepali Wikipedia, specially replacing some texts from the articles that were imported from other languages.
I was using word's find and replace functionality however I wanted those find and replace to use repeatedly, and I thought it would be nice to save what I replaced and use them later on again.
I started to save those "find and replace texts". Later this program was developed as a result.
Happy Birthday Wikipedia
[सम्पादन गर्नुहोस्]A new version has been released on the occasion of the birthday of Wikipedia. I have updated words, added some more rules and the accuracy is increased. New fixes has been added and the user interface looks better. The application has an icon now and looks better.
Overall, the performance is better. I don't use Google transliteration because it is time consuming there and I prefer in my own editor. Though Google transliteration is a good place because they might use the data for research and they might one day add Nepali language support. But for now I have been using both. The unavailability of Google translator in Nepali led to the development of this application and I am pretty happy with what this application can do now.
How does this work?
[सम्पादन गर्नुहोस्]तुम बहुत लाल है । | तिमी धेरै रातो छ । |
(पुराने औरत) | (पुराना आइमाइ) |
नहीं खाता है | खादैन |
नहीं पढने पर | नपढए पछि |
नहीं निभाना पडा | निभान परेन |
(I don't know whether "Corpus" is the right word to use, but I am using it as a collection of words to make a dictionary.) I created a hindi to Nepali corpus so that I could easily convert hindi text to Nepali and thus translate it. There are various transliteration tools that are available in the internet but they are focused around English. There are very few translation tools available for Nepali language. I chose Nepali language because:
- Nepali is my native language (Primary reason)
- The lack of any transliteration tool available for Nepali.
- Nepali language closely resembles Hindi in various forms which is a benefit as well as a an interestingly challenge for Nepali to get a translated form of text from any language to Nepali.
Author’s Native language
[सम्पादन गर्नुहोस्]The author was born in Nepal and spoke Nepali during his entire life so the preference of the language to be Nepali is not a surprise. Discovery of Wikipedia: Though not exactly a discovery but it seemed like while the author was playing with certain articles in English Wikipedia, administrators of English Wikipedia started to delete author’s articles which were originally written in the domain of “KnightOnline” which is a massively multiply role playing game(MMRPG). The author then was struck with an idea of inverting the characters: en.wikipedia.org to ne.wikipedia in the way they existed: “e“ and “n” would be “n“ and “e“. After hitting the web browser with this address : ne.wikipedia.org opened a new world of Wikipedia which had a familiar interface in Nepali language. This was overwhelming since he was unaware of such a place in the web. Somebody would rarely put something like Nepali Wikipedia similar to that of English Wikipedia.
The author went on exploring Nepali Wikipedia. It was interesting that nobody would object him whatever he would do in that Wikipedia. After initially playing in the playground of that Wikipedia he found interesting to translate some of the articles and add some of the things he knew. This is a usual way of an Wikipedian to react with any Wikipedia.
Why a separate project
[सम्पादन गर्नुहोस्]The author was soon delimited by the tools and technologies at that time, to translate any article from any other language. The translation of text seemed cumbersome and he would do a find and replace in a word processor, which used to be a Microsoft word 2000, the latest word processor that he could use at that time. Soon he found some long articles in Hindi Wikipedia which were of common interest and posed an ancient value. Soon a find and replace could quickly translate the whole article in a meaningful article in Nepali which was easy and didn’t take more time than three hours. The article was posted in the Nepali Wikipedia. All of these seemed quite interesting. Soon there were translation requests from other languages, which weren’t interesting. Yet he tried to translate using any kind of automated software or a find and replace kind of thing. The find and replace was cumbersome and he had to type in each find and replace keywords every time. Further he had to remember the already used find and replace keywords, and this was like a redo of what he had already done with previous articles. There was no software to record the valuable find and replace keywords. The author being a student of software engineering could imagine of such software and could think of saving his work and keywords for future work; however that was limited to a brainchild rather than software itself. Years after that first thought he was able to program software that he wished for. A find and replace software.
Okay then what ?
[सम्पादन गर्नुहोस्]After soon implementing certain find and replace, he wished to identify the part of speech in the sentences and work instead. The identification would be done by an exact match of the words placed in different files according to the parts of speech. For example nouns were placed in a noun file, verbs were placed in a verb file and the adjectives were placed in a separate file. The main file would consist of any other unclassified words along with the rules illustrating how the noun, verb and adjectives would be handled.
Working of the software
[सम्पादन गर्नुहोस्]A combination of rules could be produced programmatic-ally. The words were replaced by giving the preference of first come first serve basis. The rules that were formed by the initial word in the rule file would be processed initially compared to the words at the last. For example if “apple” was to be translated into “banana” and “cherry apple” was to be translated to “strawberry” and if “apple” would occur at the top would convert any occurrence of “cherry apple” into “cherry banana”. In such cases “cherry apple” should be placed above “apple” in the rules file.
The processing was done by adding texts in flat files and processing them. The input and the output would be provided in the application interface.
Some of the articles that were edited by this application
[सम्पादन गर्नुहोस्]How does this work then? What are the rules?
[सम्पादन गर्नुहोस्]These rules are processed during the runtime, the nouns, adjectives and verbs are fed along with the rules, and a list of rules are made which are processed accordingly. The lexicons/corpus or whatever it is might be a developed form of the combination of the verbs, nouns and the adjectives. These make up the sentences when combined together. We need to take care of these and translate accordingly.
Requirements
[सम्पादन गर्नुहोस्]- Requirements: .Net Framework 4
- Microsoft Windows Operating System [Windows xp, Vista, Windows 7, all versions ]
- The application is memory hungry, and needs a lot of memory while you translate, because the rules it loads is almost around 10 MB. I run on a Windows 7 laptop with an Intel Core I3 machine with 4 CPUs in it having 3 GB of memory in it and which is obvious I wouldn't find a situation when I would be running low in resources while I run this application. However I expect this application to work in a system which has at least 256 MB of RAM and a 1000 MHZ processor. I might have made this program a much more memory efficient but that is not my goal. I wish people could translate text from Hindi to Nepali easily. For the time being I am happy with this application and I find it has much more accuracy than Google for the time being. I know the Google guys will work it out and they will outsmart the application in the long run because that's what we want. Once the Google guys figure out how to do this, I can then leave because millions of users use Google everyday and I think I am the only person who uses this translation tool :).
- Linux: The application might work in mono but I have never tried it in systems other than Windows.
Umm yes recently translated the desktop application to a web application and searched for a free hosting website aspspider and uploaded there in February 2011. That simplifies a lot and the user can use the software from a web browser which is lot more simpler.
Natural Language Processing
[सम्पादन गर्नुहोस्]Umm yes there are a lot of tools that offer natural language processing. Madan Puraskar Pustakalaya, Kathmandu University Natural Language Processing in Kathmandu University are some I could remember.
I was also looking at how to work at Google Translator Toolkit Data Reference GuideI and Google Translator Toolkit Data HTTP/Atom Guide for developers. I believe there are some researches going on in Microsoft research.
Comparision with Dobhase
[सम्पादन गर्नुहोस्]Obviously there should not be a comparision between the Dobhase system which has been developed far more scientifically and skillfully. It might have been designed in such a way that once there are enough words in the system, it might be the ideal Machine Translator. However I found the performance of Nepali Wikipedia Translator to be better than that of Dobhase, because Nepali Wikipedia Translator relies on Google translate for the actual translation from English to Hindi which is already in Devanagari script. While comparing these systems side by side, I found Nepali Wikipedia Translator to be far far much accurate, and is more dependent on Google for Hindi whereas Dobhase is independent of Google, and that is actually good. A machine translation system should be independent and the "Dobhase machine translation system" is better in that sense.
However in the end, its us to decide what we want. For now we want a complete machine translation. What I want is to spend less time in translation and scratch my head for the meanings in English and Nepali for the whole day. It happens while I try to translate anything from English.
So if we want we can have a workaround for this and get what we want more easily using Nepali Wikipedia Translator rather than Dobhase, which is the only Nepali Machine Translation system(as far as I know) for the time being. I know it is not fair to compare both of them because Nepali Wikipedia Translator is mere like a rule based "find-and-replace-tool", but I find it to be more useful at this time. Dobhase needs more words and a larger corpus and more lexicons etc, but Nepali Wikipedia Translator uses Hindi which is similar to Nepali and in this way it serves better result and does not have to worry about creating the entire Nepali Corpus and the entire lexicon and the entire corpus.
Here is a sentence that I translated recently
he made notable contributions to analytic geometry probability, and optics.
I was not biased and it was a random test, and I was not much excited when Nepali Wikipedia Translator scored better, but I was expecting that Dobhase would serve better for me because I know Dobhase will keep on evolving and will last longer. In that case I could skip going to Google translate, translate into Hindi, re-translate into Nepali. At the end there should be Google translate in Nepali, Dobhase the Nepali machine translator, or Microsoft translate for Nepali.
The result from the Dobhase system was :
"ऊसले बनायो उल्लेखनीय योगदानहरू अनल्य्तिच् ज्यामिती सम्भावना , र ओप्तिच्स् |"
The expected result was luckily obtained by Nepali Wikipedia Translator this time :
"उनले विश्लेषणात्मक ज्यामिति, प्रायिकता, र प्रकाशिकी को लागि उल्लेखनीय योगदान दिए।"
The Google translated text in Hindi was :
"उन्होंने विश्लेषणात्मक ज्यामिति, प्रायिकता, और प्रकाशिकी के लिए उल्लेखनीय योगदान दिया."
Common Mistakes done by the translator
[सम्पादन गर्नुहोस्]- Sword or तलवार in Nepali is तरवार.
Recently the translator translated the following proper noun "समरजित सिंह तलवार" as "समरजित सिंह तलवार".
Bollywood movies
[सम्पादन गर्नुहोस्]- (Blood and Sweat) "खून पसीना" is a name of a Bollywood movie.
Blood is रगत in Nepali and खून in Hindi. The translator translated it into "रगत पसीना"
Similarly "शतरंजका खिलाडी" to "शतरंजका खेलाडी"
- धुप in Nepali is घाम and देह is शरिर.
Sometimes these are typos are also spoken sometimes at certain cases such as in Literature. Because both Hindi and Nepali tend to go towards their origin: Sanskrit and they merge at that place. Both of the languages use these words but they have their own alternatives. So when a person talks in a literary way, they use these words and there its hard to translate. So translation is also a place where people have to judge the scenario and the background. But how would a rule based Machine translator know about these scenarios?
It depends on places and when people talk about "धुप" its a kind of incense sticks that is offered to the Gods. So in an article related to a festival धुप was there. But धुप is a general Hindi term for Sun.
Example:
एडी कुर्कुच्चा
एडी डास्लर(proper noun): कुर्कुच्चा डास्लर should not be translated because its a proper noun for Adi Dassler
चलिरहेको विवाद : Exact Nepali translation because चल(Hindi) is हिड(Nepali) in general. But चलिरहेको means current. For example:
Current debate should be translated to चलिरहेको विवाद
However when we translate it, we get the following thing:
चल रहा विवाद : Hindi phrase
हिड रहा विवाद : Translated phrase (incorrect)
हिड रहे विवाद : Translated phrase (incorrect)