You can switch what engine is used by using the -dOCREngine= flag when you call Ghostscript. Currently, by default, Ghostscript uses the "LSTM" engine (aka the 'modern' engine). It is therefore important to match the traineddata file you have with the build of Tesseract that you are using. These engines work in different ways, and hence need different information in the 'traineddata' file. To complicate matters further, Tesseract can be built with different engines. In order for Tesseract to work, it must have access to the appropriate 'traineddata' file for the selected language(s). This knowledge comes in the form of 'traineddata' files. Tesseract relies on encapsulated knowledge so it can recognise particular languages and/or scripts. Our version of the Leptonica source is kept on the 'artifex' branch in the following git repository:ĭownload that and unpack it into a directory called 'leptonica' within the ghostpdl sources. As this has not always been the case (and because it takes time to pass new changes back upstream), we suggest that people start off by using our own version of Leptonica, which is guaranteed to have all our changes in, and not to have been broken by upstream changes. By and large, these modifications have been passed back upstream, so Ghostscript should now work with a standard unmodified version of Leptonica. In order to make Ghostscript work as efficiently as possible with Leptonica, we have made some modifications to Leptonica. If you are building 9.53, or from a git checkout of Ghostscript, then you will need to import a copy of Leptonica into your source tree. If you are using the 9.54 release archives you already have the Leptonica source. Use the versions linked to above instead.ĭownload that, and unpack it into a directory called 'tesseract' within the ghostpdl sources. If our server is overloaded, downloads from that location will fail. For the 9.54 release, you want to use the artifex-9.54 tag.įor the Ghostscript 9.53 release, you can download a snapshot of this source here. For the 9.53.3 release, you want to use the artifex-9.53.3 tag. The artifex branch is updated over time to track improvements. Our version of the Tesseract source is kept on the 'artifex' branch in the following git repository: As this has not always been the case (and because it takes time to pass new changes back upstream), we suggest that people start off by using our own version of Tesseract, which is guaranteed to have all our changes in, and not to have been broken by upstream changes. By and large, these modifications have been passed back upstream, so Ghostscript should now work with a standard unmodified version of Tesseract. In order to make Ghostscript work as efficiently as possible with Tesseract, we have made some modifications to Tesseract. If you are building 9.53 or from a git checkout of Ghostscript, then you will need to import a copy of Tesseract into your source tree. If you are using the 9.54 release archives you already have the Tesseract source. Step 1 – Ensure you have the Tesseract Source As we identified and fixed such problems, we kept an updated branch in git, called ghostpdl-9.53.x-ocr-fixes.Ĭorresponding tesseract and leptonica archives can be found here and here respectively. The code as shipped in 9.53.3 was found to have minor problems on some systems. Building on any platform Step 0 - (Version 9.53 only) Update the Ghostscript source This page gives you step-by-step instructions of what to do, both to build Ghostscript with the OCR devices enabled and to actually use them. We encourage people to use the 9.54 release rather than the 9.53 release unless they have a very good reason. If you wish to enable OCR support in 9.53, you will need to build your own version of Ghostscript with this support included. Version 9.53 was shipped without Tesseract and/or Leptonica in the release. The supplied release binaries contain the OCR devices, but no traineddata files. Starting with release 9.53, Ghostscript gained preliminary support for OCR devices, using the open-source Tesseract and Leptonica libraries.Īs from Version 9.54, the Tesseract and/or Leptonica sources are contained within the Ghostscript release archive. Enabling Tesseract For Ghostscript 9.53 and later
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |