Quantcast

Re: Cleaning" MARC files for use with java importer

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Cleaning" MARC files for use with java importer

Barnett, Jeffrey
We use marcexport as utf-8 at Yale without difficulty, but so far have not loaded all 8 million records or most or our non Latin scripts (still in design and customization stage, and re-indexing too often to be worth the time).  We did find one vendor who was sending non-roman characters encoded as "&<charname>;" tags designed to be rendered through stylesheets that had to be cleaned up.

-----------original message -------------------------------------------
Date: Thu, 12 Jun 2008 17:17:55 -0500
From: Chris Delis <[hidden email]>
Subject: [VuFind-General] "Cleaning" MARC files for use with java
        importer        (was Re: diacritic display -- font problem?)
To: [hidden email]
Message-ID: <[hidden email]>
Content-Type: text/plain; charset=iso-8859-1

Hello all,

Are there any Voyager customers out there using Voyager's marcexport
tool along with the java importer?  If so, are you exporting as MARC21
MARC-8?  And how are you "cleaning" your marc records, if at all?  I
am having trouble getting the ISOLatin1Filter to work properly in SOLR
and am guessing the problem may have to do with a bad encoding
somewhere.  Are there any good tools (which can run in a batch on a
*nix system) someone can recommend?  Or is it just better to translate
(via yaz-marcdump or whatever) to MARCXML and modify the java importer
to read MARCXML?

Thanks!
Chris

On Wed, May 21, 2008 at 02:29:36PM -0700, Naomi Dushay wrote:

> There is a C "utf8conditioner" program available at the OAI-PMH web
> site (look under "tools").  It changes bad UTF-8 characters to a
> benign (but unmeaningful) character.  The program comes with test
> files with bad UTF-8 characters.
>
> When I worked for the National Science Digital Library, we harvested
> OAI data that had bad UTF-8 chars.  It was fairly common.
>
> The multi-byte UTF-8 characters tend to be particularly thorny, as I
> recall.
>
> - Naomi
>

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Loading...