"Cleaning" MARC files for use with java importer (was Re: diacritic display -- font problem?)

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

"Cleaning" MARC files for use with java importer (was Re: diacritic display -- font problem?)

Delis, Christopher
Hello all,

Are there any Voyager customers out there using Voyager's marcexport
tool along with the java importer?  If so, are you exporting as MARC21
MARC-8?  And how are you "cleaning" your marc records, if at all?  I
am having trouble getting the ISOLatin1Filter to work properly in SOLR
and am guessing the problem may have to do with a bad encoding
somewhere.  Are there any good tools (which can run in a batch on a
*nix system) someone can recommend?  Or is it just better to translate
(via yaz-marcdump or whatever) to MARCXML and modify the java importer
to read MARCXML?

Thanks!
Chris

On Wed, May 21, 2008 at 02:29:36PM -0700, Naomi Dushay wrote:

> There is a C "utf8conditioner" program available at the OAI-PMH web  
> site (look under "tools").  It changes bad UTF-8 characters to a  
> benign (but unmeaningful) character.  The program comes with test  
> files with bad UTF-8 characters.
>
> When I worked for the National Science Digital Library, we harvested  
> OAI data that had bad UTF-8 chars.  It was fairly common.
>
> The multi-byte UTF-8 characters tend to be particularly thorny, as I  
> recall.
>
> - Naomi
>
> On May 21, 2008, at 1:13 PM, Andrew Nagy wrote:
>
> > Well I slightly agree - I like putting the burden on the programmer  
> > and make things very easy for the implementer.  Especially when the  
> > programmer is Wayne :)
> >
> > While we are on this topic - I have talked with some folks here as  
> > well as other libraries and there seems to be a common issue of  
> > records that are in utf-8 format but we not fully converted and have  
> > records that are ridden with bad utf-8 characters.
> >
> > Wayne - do you know of any java toolkits that can help cleanup utf-8  
> > data during the import?
> >
> > Andrew
> >
> >> -----Original Message-----
> >> From: [hidden email] [mailto:vufind-
> >> [hidden email]] On Behalf Of James Farrugia
> >> Sent: Wednesday, May 21, 2008 2:56 PM
> >> To: Wayne Graham
> >> Cc: [hidden email]
> >> Subject: Re: [VuFind-General] diacritic display -- font problem?
> >>
> >> Hi Wayne,
> >>
> >> Thanks. I think the easiest way all around is to put the "burden" of
> >> getting records into UTF-8 (which is what VuFind uses/requires, yes?)
> >> on users rather than developers.
> >>
> >> The simple one-line yaz command with -o marc (thanks, Doug) is
> >> all that's needed it seems.
> >>
> >> This seems the best way to deal with it (or some other conversion
> >> to UTF-8 before loading into VuFind).
> >>
> >> Jim
> >>
> >>>>> On 5/21/2008 at 2:01 PM, Wayne Graham <[hidden email]> wrote:
> >>> Not sure if this will answer you question, but here it goes.
> >>>
> >>> The Java that does the indexing has several converters for different
> >>
> >>> formats . These include Ansel, ISO5426 (Latin), and ISO 6937  
> >>> (ASCII).
> >>
> >>> The Ansel converter will convert to- and from- the MARC-8 format.
> >> Right
> >>> now the code to do the indexing doesn't do any conversion... is this
> >>
> >>> something you need? If so, we can do an enhancement request.
> >>>
> >>> If you're asking about UTF-8, this is a slightly different answer.  
> >>> By
> >>
> >>> virtue that it's Java, String objects are stored in UTF-16. I can't
> >>> really think of a reason to do the extra programming to make it
> >> UTF-8...
> >>>
> >>> Wayne
> >>>
> >>> James Farrugia wrote:
> >>>> Andrew,
> >>>>
> >>>> Does VuFind offer a MARC to UTF-8 converter?
> >>>>
> >>>> Jim
> >>>>
> >>>>
> >>>>>>> On 5/21/2008 at 1:39 PM, Andrew Nagy <[hidden email]>
> >>>>>>>
> >>>> wrote:
> >>>>
> >>>>> I just changed the CSS for vufind to no longer use Lucida Grande
> >> as
> >>>>>
> >>>> the
> >>>>
> >>>>> default font due to the diacritics issues, the default is now
> >> Arial
> >>>>>
> >>>> Unicode
> >>>>
> >>>>> MS, Arial, Sans-Serif.
> >>>>>
> >>>>> Arial Unicode MS is one of the most unicode compliant fonts and is
> >>>>>
> >>>> installed
> >>>>
> >>>>> with windows and OSX 10.5 or later.
> >>>>>
> >>>>> Andrew
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: [hidden email]
> >> [mailto:vufind-
> >>>>>> [hidden email]] On Behalf Of Corinna
> >> Baksik
> >>>>>> Sent: Wednesday, May 21, 2008 1:16 PM
> >>>>>> To: [hidden email]
> >>>>>> Subject: [VuFind-General] diacritic display -- font problem?
> >>>>>>
> >>>>>> Hi - It seems that diacritical marks are not displaying properly.
> >>>>>>
> >>>> The
> >>>>
> >>>>>> accent displays over the letter to the right of where it should.
> >> I
> >>>>>> think
> >>>>>> this is a font problem as I can save an html page and use a
> >>>>>>
> >>>> different
> >>>>
> >>>>>> font
> >>>>>> and it displays correctly. For example, in this record the accent
> >>>>>>
> >>>> over
> >>>>
> >>>>>> the
> >>>>>> first e in Bibliothèque is displaying over the q:
> >>>>>> http://vufind.org/demo/Record/243957
> >>>>>>
> >>>>>> This happens consistently for different types of accents and
> >>>>>>
> >>>> different
> >>>>
> >>>>>> letters. I suspect that the source record is in decomposed
> >> Unicode,
> >>>>>> otherwise it might display properly. We use Arial Unicode MS in
> >> our
> >>>>>> catalog
> >>>>>> because it displays the most number of diacritics and non-Latin
> >>>>>> characters
> >>>>>> properly (though it is not without bugs).
> >>>>>>
> >>>>>> corinna
> >>>>>>
> >>>>>>
> >>>>>> Corinna Baksik
> >>>>>> Harvard University Library
> >>>>>> Office for Information Systems
> >>>>>> 90 Mt. Auburn St
> >>>>>> Cambridge, MA 02138
> >>>>>>
> >>>>>> Phone: 617-495-3724
> >>>>>> Fax: 617-496-5600
> >>>>>> Email: [hidden email]
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >> -----------------------------------------------------------------------
> >>>>
> >>>>>> --
> >>>>>> This SF.net email is sponsored by: Microsoft
> >>>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> >>>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> >>>>>> _______________________________________________
> >>>>>> VuFind-General mailing list
> >>>>>> [hidden email]
> >>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> >>>>>>
> >>>>>
> >>>>
> >> -----------------------------------------------------------------------
> >> --
> >>>>
> >>>>> This SF.net email is sponsored by: Microsoft
> >>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> >>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> >>>>> _______________________________________________
> >>>>> VuFind-General mailing list
> >>>>> [hidden email]
> >>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> >>>>>
> >>>>
> >>>>
> >> -----------------------------------------------------------------------
> >> --
> >>>> This SF.net email is sponsored by: Microsoft
> >>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> >>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> >>>> _______________________________________________
> >>>> VuFind-General mailing list
> >>>> [hidden email]
> >>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> >>>>
> >>>
> >>>
> >>>
> >> -----------------------------------------------------------------------
> >> --
> >>> This SF.net email is sponsored by: Microsoft
> >>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> >>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> >>> _______________________________________________
> >>> VuFind-General mailing list
> >>> [hidden email]
> >>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> >>
> >> -----------------------------------------------------------------------
> >> --
> >> This SF.net email is sponsored by: Microsoft
> >> Defy all challenges. Microsoft(R) Visual Studio 2008.
> >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> >> _______________________________________________
> >> VuFind-General mailing list
> >> [hidden email]
> >> https://lists.sourceforge.net/lists/listinfo/vufind-general
> >
> > -------------------------------------------------------------------------
> > This SF.net email is sponsored by: Microsoft
> > Defy all challenges. Microsoft(R) Visual Studio 2008.
> > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > _______________________________________________
> > VuFind-General mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/vufind-general
>
> Naomi Dushay
> [hidden email]
>
>
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> VuFind-General mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/vufind-general

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|

Re: "Cleaning" MARC files for use with java importer (was Re: diacritic display -- font problem?)

Wayne Graham
Hi Chris,

How pressed are you for this? The reason I mention this is that with the solrmarc project there area a few patches added into the marc4j library that do a lot better job of guessing what the actual record is written in, rather than what the record reports itself as (and hopefully produce better results). There is some committed code in the solrmarc project, I just haven't had time (yet) to pull them into the Vufind trunk. Looking at my schedule, the code probably won't be pulled into Vufind until July, but you may want to grab that code on your own and test (and if you do, please let me know how it goes).

http://code.google.com/p/solrmarc/

Wayne

On Thu, Jun 12, 2008 at 6:17 PM, Chris Delis <[hidden email]> wrote:
Hello all,

Are there any Voyager customers out there using Voyager's marcexport
tool along with the java importer?  If so, are you exporting as MARC21
MARC-8?  And how are you "cleaning" your marc records, if at all?  I
am having trouble getting the ISOLatin1Filter to work properly in SOLR
and am guessing the problem may have to do with a bad encoding
somewhere.  Are there any good tools (which can run in a batch on a
*nix system) someone can recommend?  Or is it just better to translate
(via yaz-marcdump or whatever) to MARCXML and modify the java importer
to read MARCXML?

Thanks!
Chris

On Wed, May 21, 2008 at 02:29:36PM -0700, Naomi Dushay wrote:
> There is a C "utf8conditioner" program available at the OAI-PMH web
> site (look under "tools").  It changes bad UTF-8 characters to a
> benign (but unmeaningful) character.  The program comes with test
> files with bad UTF-8 characters.
>
> When I worked for the National Science Digital Library, we harvested
> OAI data that had bad UTF-8 chars.  It was fairly common.
>
> The multi-byte UTF-8 characters tend to be particularly thorny, as I
> recall.
>
> - Naomi
>
> On May 21, 2008, at 1:13 PM, Andrew Nagy wrote:
>
> > Well I slightly agree - I like putting the burden on the programmer
> > and make things very easy for the implementer.  Especially when the
> > programmer is Wayne :)
> >
> > While we are on this topic - I have talked with some folks here as
> > well as other libraries and there seems to be a common issue of
> > records that are in utf-8 format but we not fully converted and have
> > records that are ridden with bad utf-8 characters.
> >
> > Wayne - do you know of any java toolkits that can help cleanup utf-8
> > data during the import?
> >
> > Andrew
> >
> >> -----Original Message-----
> >> From: [hidden email] [mailto:[hidden email]
> >> [hidden email]] On Behalf Of James Farrugia
> >> Sent: Wednesday, May 21, 2008 2:56 PM
> >> To: Wayne Graham
> >> Cc: [hidden email]
> >> Subject: Re: [VuFind-General] diacritic display -- font problem?
> >>
> >> Hi Wayne,
> >>
> >> Thanks. I think the easiest way all around is to put the "burden" of
> >> getting records into UTF-8 (which is what VuFind uses/requires, yes?)
> >> on users rather than developers.
> >>
> >> The simple one-line yaz command with -o marc (thanks, Doug) is
> >> all that's needed it seems.
> >>
> >> This seems the best way to deal with it (or some other conversion
> >> to UTF-8 before loading into VuFind).
> >>
> >> Jim
> >>
> >>>>> On 5/21/2008 at 2:01 PM, Wayne Graham <[hidden email]> wrote:
> >>> Not sure if this will answer you question, but here it goes.
> >>>
> >>> The Java that does the indexing has several converters for different
> >>
> >>> formats . These include Ansel, ISO5426 (Latin), and ISO 6937
> >>> (ASCII).
> >>
> >>> The Ansel converter will convert to- and from- the MARC-8 format.
> >> Right
> >>> now the code to do the indexing doesn't do any conversion... is this
> >>
> >>> something you need? If so, we can do an enhancement request.
> >>>
> >>> If you're asking about UTF-8, this is a slightly different answer.
> >>> By
> >>
> >>> virtue that it's Java, String objects are stored in UTF-16. I can't
> >>> really think of a reason to do the extra programming to make it
> >> UTF-8...
> >>>
> >>> Wayne
> >>>
> >>> James Farrugia wrote:
> >>>> Andrew,
> >>>>
> >>>> Does VuFind offer a MARC to UTF-8 converter?
> >>>>
> >>>> Jim
> >>>>
> >>>>
> >>>>>>> On 5/21/2008 at 1:39 PM, Andrew Nagy <[hidden email]>
> >>>>>>>
> >>>> wrote:
> >>>>
> >>>>> I just changed the CSS for vufind to no longer use Lucida Grande
> >> as
> >>>>>
> >>>> the
> >>>>
> >>>>> default font due to the diacritics issues, the default is now
> >> Arial
> >>>>>
> >>>> Unicode
> >>>>
> >>>>> MS, Arial, Sans-Serif.
> >>>>>
> >>>>> Arial Unicode MS is one of the most unicode compliant fonts and is
> >>>>>
> >>>> installed
> >>>>
> >>>>> with windows and OSX 10.5 or later.
> >>>>>
> >>>>> Andrew
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: [hidden email]
> >> [mailto:[hidden email]
> >>>>>> [hidden email]] On Behalf Of Corinna
> >> Baksik
> >>>>>> Sent: Wednesday, May 21, 2008 1:16 PM
> >>>>>> To: [hidden email]
> >>>>>> Subject: [VuFind-General] diacritic display -- font problem?
> >>>>>>
> >>>>>> Hi - It seems that diacritical marks are not displaying properly.
> >>>>>>
> >>>> The
> >>>>
> >>>>>> accent displays over the letter to the right of where it should.
> >> I
> >>>>>> think
> >>>>>> this is a font problem as I can save an html page and use a
> >>>>>>
> >>>> different
> >>>>
> >>>>>> font
> >>>>>> and it displays correctly. For example, in this record the accent
> >>>>>>
> >>>> over
> >>>>
> >>>>>> the
> >>>>>> first e in Bibliothèque is displaying over the q:
> >>>>>> http://vufind.org/demo/Record/243957
> >>>>>>
> >>>>>> This happens consistently for different types of accents and
> >>>>>>
> >>>> different
> >>>>
> >>>>>> letters. I suspect that the source record is in decomposed
> >> Unicode,
> >>>>>> otherwise it might display properly. We use Arial Unicode MS in
> >> our
> >>>>>> catalog
> >>>>>> because it displays the most number of diacritics and non-Latin
> >>>>>> characters
> >>>>>> properly (though it is not without bugs).
> >>>>>>
> >>>>>> corinna
> >>>>>>
> >>>>>>
> >>>>>> Corinna Baksik
> >>>>>> Harvard University Library
> >>>>>> Office for Information Systems
> >>>>>> 90 Mt. Auburn St
> >>>>>> Cambridge, MA 02138
> >>>>>>
> >>>>>> Phone: 617-495-3724
> >>>>>> Fax: 617-496-5600
> >>>>>> Email: [hidden email]
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >> -----------------------------------------------------------------------
> >>>>
> >>>>>> --
> >>>>>> This SF.net email is sponsored by: Microsoft
> >>>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> >>>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> >>>>>> _______________________________________________
> >>>>>> VuFind-General mailing list
> >>>>>> [hidden email]
> >>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> >>>>>>
> >>>>>
> >>>>
> >> -----------------------------------------------------------------------
> >> --
> >>>>
> >>>>> This SF.net email is sponsored by: Microsoft
> >>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> >>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> >>>>> _______________________________________________
> >>>>> VuFind-General mailing list
> >>>>> [hidden email]
> >>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> >>>>>
> >>>>
> >>>>
> >> -----------------------------------------------------------------------
> >> --
> >>>> This SF.net email is sponsored by: Microsoft
> >>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> >>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> >>>> _______________________________________________
> >>>> VuFind-General mailing list
> >>>> [hidden email]
> >>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> >>>>
> >>>
> >>>
> >>>
> >> -----------------------------------------------------------------------
> >> --
> >>> This SF.net email is sponsored by: Microsoft
> >>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> >>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> >>> _______________________________________________
> >>> VuFind-General mailing list
> >>> [hidden email]
> >>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> >>
> >> -----------------------------------------------------------------------
> >> --
> >> This SF.net email is sponsored by: Microsoft
> >> Defy all challenges. Microsoft(R) Visual Studio 2008.
> >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> >> _______________________________________________
> >> VuFind-General mailing list
> >> [hidden email]
> >> https://lists.sourceforge.net/lists/listinfo/vufind-general
> >
> > -------------------------------------------------------------------------
> > This SF.net email is sponsored by: Microsoft
> > Defy all challenges. Microsoft(R) Visual Studio 2008.
> > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > _______________________________________________
> > VuFind-General mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/vufind-general
>
> Naomi Dushay
> [hidden email]
>
>
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> VuFind-General mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/vufind-general

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general



--
PJ O'Rourke  - "You can't get rid of poverty by giving people money."
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|

Re: "Cleaning" MARC files for use with java importer (was Re: diacritic display -- font problem?)

Corinna Baksik
In reply to this post by Delis, Christopher
If it's possible to get your Voyager data out in UTF8, I would highly
recommend that. It's my understanding that Voyager stores data in UTF8, so
to convert it to MARC8 during export, then back to UTF8 for VuFind import,
would likely create unnecessary problems. (Apologies if I misunderstood the
question).

In case it's helpful to anyone, I would point out that MARC21, a record
structure, is distinct from MARC8, a character repertoire. The MARC8
character repertoire is not synonymous with ISO 8859-1 (Latin 1). MARC8
actually encompasses non-Roman, like CJK. Even if you don't have non-Roman
in your records, MARC8 includes the extended Latin character set, so if you
have records with the musical sharp or the copyright sign, for example, I
don't think that would be covered by tools relying on ISO 8859-1.

corinna

At 05:17 PM 6/12/2008 -0500, Chris Delis wrote:

>Hello all,
>
>Are there any Voyager customers out there using Voyager's marcexport
>tool along with the java importer?  If so, are you exporting as MARC21
>MARC-8?  And how are you "cleaning" your marc records, if at all?  I
>am having trouble getting the ISOLatin1Filter to work properly in SOLR
>and am guessing the problem may have to do with a bad encoding
>somewhere.  Are there any good tools (which can run in a batch on a
>*nix system) someone can recommend?  Or is it just better to translate
>(via yaz-marcdump or whatever) to MARCXML and modify the java importer
>to read MARCXML?
>
>Thanks!
>Chris
>
>On Wed, May 21, 2008 at 02:29:36PM -0700, Naomi Dushay wrote:
> > There is a C "utf8conditioner" program available at the OAI-PMH web
> > site (look under "tools").  It changes bad UTF-8 characters to a
> > benign (but unmeaningful) character.  The program comes with test
> > files with bad UTF-8 characters.
> >
> > When I worked for the National Science Digital Library, we harvested
> > OAI data that had bad UTF-8 chars.  It was fairly common.
> >
> > The multi-byte UTF-8 characters tend to be particularly thorny, as I
> > recall.
> >
> > - Naomi
> >
> > On May 21, 2008, at 1:13 PM, Andrew Nagy wrote:
> >
> > > Well I slightly agree - I like putting the burden on the programmer
> > > and make things very easy for the implementer.  Especially when the
> > > programmer is Wayne :)
> > >
> > > While we are on this topic - I have talked with some folks here as
> > > well as other libraries and there seems to be a common issue of
> > > records that are in utf-8 format but we not fully converted and have
> > > records that are ridden with bad utf-8 characters.
> > >
> > > Wayne - do you know of any java toolkits that can help cleanup utf-8
> > > data during the import?
> > >
> > > Andrew
> > >
> > >> -----Original Message-----
> > >> From: [hidden email] [mailto:vufind-
> > >> [hidden email]] On Behalf Of James Farrugia
> > >> Sent: Wednesday, May 21, 2008 2:56 PM
> > >> To: Wayne Graham
> > >> Cc: [hidden email]
> > >> Subject: Re: [VuFind-General] diacritic display -- font problem?
> > >>
> > >> Hi Wayne,
> > >>
> > >> Thanks. I think the easiest way all around is to put the "burden" of
> > >> getting records into UTF-8 (which is what VuFind uses/requires, yes?)
> > >> on users rather than developers.
> > >>
> > >> The simple one-line yaz command with -o marc (thanks, Doug) is
> > >> all that's needed it seems.
> > >>
> > >> This seems the best way to deal with it (or some other conversion
> > >> to UTF-8 before loading into VuFind).
> > >>
> > >> Jim
> > >>
> > >>>>> On 5/21/2008 at 2:01 PM, Wayne Graham <[hidden email]> wrote:
> > >>> Not sure if this will answer you question, but here it goes.
> > >>>
> > >>> The Java that does the indexing has several converters for different
> > >>
> > >>> formats . These include Ansel, ISO5426 (Latin), and ISO 6937
> > >>> (ASCII).
> > >>
> > >>> The Ansel converter will convert to- and from- the MARC-8 format.
> > >> Right
> > >>> now the code to do the indexing doesn't do any conversion... is this
> > >>
> > >>> something you need? If so, we can do an enhancement request.
> > >>>
> > >>> If you're asking about UTF-8, this is a slightly different answer.
> > >>> By
> > >>
> > >>> virtue that it's Java, String objects are stored in UTF-16. I can't
> > >>> really think of a reason to do the extra programming to make it
> > >> UTF-8...
> > >>>
> > >>> Wayne
> > >>>
> > >>> James Farrugia wrote:
> > >>>> Andrew,
> > >>>>
> > >>>> Does VuFind offer a MARC to UTF-8 converter?
> > >>>>
> > >>>> Jim
> > >>>>
> > >>>>
> > >>>>>>> On 5/21/2008 at 1:39 PM, Andrew Nagy <[hidden email]>
> > >>>>>>>
> > >>>> wrote:
> > >>>>
> > >>>>> I just changed the CSS for vufind to no longer use Lucida Grande
> > >> as
> > >>>>>
> > >>>> the
> > >>>>
> > >>>>> default font due to the diacritics issues, the default is now
> > >> Arial
> > >>>>>
> > >>>> Unicode
> > >>>>
> > >>>>> MS, Arial, Sans-Serif.
> > >>>>>
> > >>>>> Arial Unicode MS is one of the most unicode compliant fonts and is
> > >>>>>
> > >>>> installed
> > >>>>
> > >>>>> with windows and OSX 10.5 or later.
> > >>>>>
> > >>>>> Andrew
> > >>>>>
> > >>>>>
> > >>>>>> -----Original Message-----
> > >>>>>> From: [hidden email]
> > >> [mailto:vufind-
> > >>>>>> [hidden email]] On Behalf Of Corinna
> > >> Baksik
> > >>>>>> Sent: Wednesday, May 21, 2008 1:16 PM
> > >>>>>> To: [hidden email]
> > >>>>>> Subject: [VuFind-General] diacritic display -- font problem?
> > >>>>>>
> > >>>>>> Hi - It seems that diacritical marks are not displaying properly.
> > >>>>>>
> > >>>> The
> > >>>>
> > >>>>>> accent displays over the letter to the right of where it should.
> > >> I
> > >>>>>> think
> > >>>>>> this is a font problem as I can save an html page and use a
> > >>>>>>
> > >>>> different
> > >>>>
> > >>>>>> font
> > >>>>>> and it displays correctly. For example, in this record the accent
> > >>>>>>
> > >>>> over
> > >>>>
> > >>>>>> the
> > >>>>>> first e in Bibliothèque is displaying over the q:
> > >>>>>> http://vufind.org/demo/Record/243957
> > >>>>>>
> > >>>>>> This happens consistently for different types of accents and
> > >>>>>>
> > >>>> different
> > >>>>
> > >>>>>> letters. I suspect that the source record is in decomposed
> > >> Unicode,
> > >>>>>> otherwise it might display properly. We use Arial Unicode MS in
> > >> our
> > >>>>>> catalog
> > >>>>>> because it displays the most number of diacritics and non-Latin
> > >>>>>> characters
> > >>>>>> properly (though it is not without bugs).
> > >>>>>>
> > >>>>>> corinna
> > >>>>>>
> > >>>>>>
> > >>>>>> Corinna Baksik
> > >>>>>> Harvard University Library
> > >>>>>> Office for Information Systems
> > >>>>>> 90 Mt. Auburn St
> > >>>>>> Cambridge, MA 02138
> > >>>>>>
> > >>>>>> Phone: 617-495-3724
> > >>>>>> Fax: 617-496-5600
> > >>>>>> Email: [hidden email]
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>
> > >> -----------------------------------------------------------------------
> > >>>>
> > >>>>>> --
> > >>>>>> This SF.net email is sponsored by: Microsoft
> > >>>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> > >>>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > >>>>>> _______________________________________________
> > >>>>>> VuFind-General mailing list
> > >>>>>> [hidden email]
> > >>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> > >>>>>>
> > >>>>>
> > >>>>
> > >> -----------------------------------------------------------------------
> > >> --
> > >>>>
> > >>>>> This SF.net email is sponsored by: Microsoft
> > >>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> > >>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > >>>>> _______________________________________________
> > >>>>> VuFind-General mailing list
> > >>>>> [hidden email]
> > >>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> > >>>>>
> > >>>>
> > >>>>
> > >> -----------------------------------------------------------------------
> > >> --
> > >>>> This SF.net email is sponsored by: Microsoft
> > >>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> > >>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > >>>> _______________________________________________
> > >>>> VuFind-General mailing list
> > >>>> [hidden email]
> > >>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> > >>>>
> > >>>
> > >>>
> > >>>
> > >> -----------------------------------------------------------------------
> > >> --
> > >>> This SF.net email is sponsored by: Microsoft
> > >>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> > >>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > >>> _______________________________________________
> > >>> VuFind-General mailing list
> > >>> [hidden email]
> > >>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> > >>
> > >> -----------------------------------------------------------------------
> > >> --
> > >> This SF.net email is sponsored by: Microsoft
> > >> Defy all challenges. Microsoft(R) Visual Studio 2008.
> > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > >> _______________________________________________
> > >> VuFind-General mailing list
> > >> [hidden email]
> > >> https://lists.sourceforge.net/lists/listinfo/vufind-general
> > >
> > > -------------------------------------------------------------------------
> > > This SF.net email is sponsored by: Microsoft
> > > Defy all challenges. Microsoft(R) Visual Studio 2008.
> > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > _______________________________________________
> > > VuFind-General mailing list
> > > [hidden email]
> > > https://lists.sourceforge.net/lists/listinfo/vufind-general
> >
> > Naomi Dushay
> > [hidden email]
> >
> >
> >
> >
> > -------------------------------------------------------------------------
> > This SF.net email is sponsored by: Microsoft
> > Defy all challenges. Microsoft(R) Visual Studio 2008.
> > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > _______________________________________________
> > VuFind-General mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/vufind-general
>
>-------------------------------------------------------------------------
>Check out the new SourceForge.net Marketplace.
>It's the best place to buy or sell services for
>just about anything Open Source.
>http://sourceforge.net/services/buy/index.php
>_______________________________________________
>VuFind-General mailing list
>[hidden email]
>https://lists.sourceforge.net/lists/listinfo/vufind-general



-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|

Re: "Cleaning" MARC files for use with java importer (was Re: diacritic display -- font problem?)

Delis, Christopher
On Thu, Jun 12, 2008 at 07:49:07PM -0400, Corinna Baksik wrote:
> If it's possible to get your Voyager data out in UTF8, I would highly
> recommend that. It's my understanding that Voyager stores data in UTF8, so
> to convert it to MARC8 during export, then back to UTF8 for VuFind import,
> would likely create unnecessary problems. (Apologies if I misunderstood the
> question).

Hi, Corrina,

Actually, I have been exporting using the default (which I suppose is
UTF8?).  I wasn't sure if that was the best way to go about it; looks
like it is!

I *think* the problem may have to do with some multi-byte characters
not being properly handled by marc4j ?   We shall soon see...

Thanks,
Chris


>
> In case it's helpful to anyone, I would point out that MARC21, a record
> structure, is distinct from MARC8, a character repertoire. The MARC8
> character repertoire is not synonymous with ISO 8859-1 (Latin 1). MARC8
> actually encompasses non-Roman, like CJK. Even if you don't have non-Roman
> in your records, MARC8 includes the extended Latin character set, so if you
> have records with the musical sharp or the copyright sign, for example, I
> don't think that would be covered by tools relying on ISO 8859-1.
>
> corinna
>
> At 05:17 PM 6/12/2008 -0500, Chris Delis wrote:
> >Hello all,
> >
> >Are there any Voyager customers out there using Voyager's marcexport
> >tool along with the java importer?  If so, are you exporting as MARC21
> >MARC-8?  And how are you "cleaning" your marc records, if at all?  I
> >am having trouble getting the ISOLatin1Filter to work properly in SOLR
> >and am guessing the problem may have to do with a bad encoding
> >somewhere.  Are there any good tools (which can run in a batch on a
> >*nix system) someone can recommend?  Or is it just better to translate
> >(via yaz-marcdump or whatever) to MARCXML and modify the java importer
> >to read MARCXML?
> >
> >Thanks!
> >Chris
> >
> >On Wed, May 21, 2008 at 02:29:36PM -0700, Naomi Dushay wrote:
> > > There is a C "utf8conditioner" program available at the OAI-PMH web
> > > site (look under "tools").  It changes bad UTF-8 characters to a
> > > benign (but unmeaningful) character.  The program comes with test
> > > files with bad UTF-8 characters.
> > >
> > > When I worked for the National Science Digital Library, we harvested
> > > OAI data that had bad UTF-8 chars.  It was fairly common.
> > >
> > > The multi-byte UTF-8 characters tend to be particularly thorny, as I
> > > recall.
> > >
> > > - Naomi
> > >
> > > On May 21, 2008, at 1:13 PM, Andrew Nagy wrote:
> > >
> > > > Well I slightly agree - I like putting the burden on the programmer
> > > > and make things very easy for the implementer.  Especially when the
> > > > programmer is Wayne :)
> > > >
> > > > While we are on this topic - I have talked with some folks here as
> > > > well as other libraries and there seems to be a common issue of
> > > > records that are in utf-8 format but we not fully converted and have
> > > > records that are ridden with bad utf-8 characters.
> > > >
> > > > Wayne - do you know of any java toolkits that can help cleanup utf-8
> > > > data during the import?
> > > >
> > > > Andrew
> > > >
> > > >> -----Original Message-----
> > > >> From: [hidden email] [mailto:vufind-
> > > >> [hidden email]] On Behalf Of James Farrugia
> > > >> Sent: Wednesday, May 21, 2008 2:56 PM
> > > >> To: Wayne Graham
> > > >> Cc: [hidden email]
> > > >> Subject: Re: [VuFind-General] diacritic display -- font problem?
> > > >>
> > > >> Hi Wayne,
> > > >>
> > > >> Thanks. I think the easiest way all around is to put the "burden" of
> > > >> getting records into UTF-8 (which is what VuFind uses/requires, yes?)
> > > >> on users rather than developers.
> > > >>
> > > >> The simple one-line yaz command with -o marc (thanks, Doug) is
> > > >> all that's needed it seems.
> > > >>
> > > >> This seems the best way to deal with it (or some other conversion
> > > >> to UTF-8 before loading into VuFind).
> > > >>
> > > >> Jim
> > > >>
> > > >>>>> On 5/21/2008 at 2:01 PM, Wayne Graham <[hidden email]> wrote:
> > > >>> Not sure if this will answer you question, but here it goes.
> > > >>>
> > > >>> The Java that does the indexing has several converters for different
> > > >>
> > > >>> formats . These include Ansel, ISO5426 (Latin), and ISO 6937
> > > >>> (ASCII).
> > > >>
> > > >>> The Ansel converter will convert to- and from- the MARC-8 format.
> > > >> Right
> > > >>> now the code to do the indexing doesn't do any conversion... is this
> > > >>
> > > >>> something you need? If so, we can do an enhancement request.
> > > >>>
> > > >>> If you're asking about UTF-8, this is a slightly different answer.
> > > >>> By
> > > >>
> > > >>> virtue that it's Java, String objects are stored in UTF-16. I can't
> > > >>> really think of a reason to do the extra programming to make it
> > > >> UTF-8...
> > > >>>
> > > >>> Wayne
> > > >>>
> > > >>> James Farrugia wrote:
> > > >>>> Andrew,
> > > >>>>
> > > >>>> Does VuFind offer a MARC to UTF-8 converter?
> > > >>>>
> > > >>>> Jim
> > > >>>>
> > > >>>>
> > > >>>>>>> On 5/21/2008 at 1:39 PM, Andrew Nagy <[hidden email]>
> > > >>>>>>>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> I just changed the CSS for vufind to no longer use Lucida Grande
> > > >> as
> > > >>>>>
> > > >>>> the
> > > >>>>
> > > >>>>> default font due to the diacritics issues, the default is now
> > > >> Arial
> > > >>>>>
> > > >>>> Unicode
> > > >>>>
> > > >>>>> MS, Arial, Sans-Serif.
> > > >>>>>
> > > >>>>> Arial Unicode MS is one of the most unicode compliant fonts and is
> > > >>>>>
> > > >>>> installed
> > > >>>>
> > > >>>>> with windows and OSX 10.5 or later.
> > > >>>>>
> > > >>>>> Andrew
> > > >>>>>
> > > >>>>>
> > > >>>>>> -----Original Message-----
> > > >>>>>> From: [hidden email]
> > > >> [mailto:vufind-
> > > >>>>>> [hidden email]] On Behalf Of Corinna
> > > >> Baksik
> > > >>>>>> Sent: Wednesday, May 21, 2008 1:16 PM
> > > >>>>>> To: [hidden email]
> > > >>>>>> Subject: [VuFind-General] diacritic display -- font problem?
> > > >>>>>>
> > > >>>>>> Hi - It seems that diacritical marks are not displaying properly.
> > > >>>>>>
> > > >>>> The
> > > >>>>
> > > >>>>>> accent displays over the letter to the right of where it should.
> > > >> I
> > > >>>>>> think
> > > >>>>>> this is a font problem as I can save an html page and use a
> > > >>>>>>
> > > >>>> different
> > > >>>>
> > > >>>>>> font
> > > >>>>>> and it displays correctly. For example, in this record the accent
> > > >>>>>>
> > > >>>> over
> > > >>>>
> > > >>>>>> the
> > > >>>>>> first e in Bibliothèque is displaying over the q:
> > > >>>>>> http://vufind.org/demo/Record/243957
> > > >>>>>>
> > > >>>>>> This happens consistently for different types of accents and
> > > >>>>>>
> > > >>>> different
> > > >>>>
> > > >>>>>> letters. I suspect that the source record is in decomposed
> > > >> Unicode,
> > > >>>>>> otherwise it might display properly. We use Arial Unicode MS in
> > > >> our
> > > >>>>>> catalog
> > > >>>>>> because it displays the most number of diacritics and non-Latin
> > > >>>>>> characters
> > > >>>>>> properly (though it is not without bugs).
> > > >>>>>>
> > > >>>>>> corinna
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Corinna Baksik
> > > >>>>>> Harvard University Library
> > > >>>>>> Office for Information Systems
> > > >>>>>> 90 Mt. Auburn St
> > > >>>>>> Cambridge, MA 02138
> > > >>>>>>
> > > >>>>>> Phone: 617-495-3724
> > > >>>>>> Fax: 617-496-5600
> > > >>>>>> Email: [hidden email]
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>
> > > >> -----------------------------------------------------------------------
> > > >>>>
> > > >>>>>> --
> > > >>>>>> This SF.net email is sponsored by: Microsoft
> > > >>>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> > > >>>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > >>>>>> _______________________________________________
> > > >>>>>> VuFind-General mailing list
> > > >>>>>> [hidden email]
> > > >>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >> -----------------------------------------------------------------------
> > > >> --
> > > >>>>
> > > >>>>> This SF.net email is sponsored by: Microsoft
> > > >>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> > > >>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > >>>>> _______________________________________________
> > > >>>>> VuFind-General mailing list
> > > >>>>> [hidden email]
> > > >>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >> -----------------------------------------------------------------------
> > > >> --
> > > >>>> This SF.net email is sponsored by: Microsoft
> > > >>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> > > >>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > >>>> _______________________________________________
> > > >>>> VuFind-General mailing list
> > > >>>> [hidden email]
> > > >>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> > > >>>>
> > > >>>
> > > >>>
> > > >>>
> > > >> -----------------------------------------------------------------------
> > > >> --
> > > >>> This SF.net email is sponsored by: Microsoft
> > > >>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> > > >>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > >>> _______________________________________________
> > > >>> VuFind-General mailing list
> > > >>> [hidden email]
> > > >>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> > > >>
> > > >> -----------------------------------------------------------------------
> > > >> --
> > > >> This SF.net email is sponsored by: Microsoft
> > > >> Defy all challenges. Microsoft(R) Visual Studio 2008.
> > > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > >> _______________________________________________
> > > >> VuFind-General mailing list
> > > >> [hidden email]
> > > >> https://lists.sourceforge.net/lists/listinfo/vufind-general
> > > >
> > > > -------------------------------------------------------------------------
> > > > This SF.net email is sponsored by: Microsoft
> > > > Defy all challenges. Microsoft(R) Visual Studio 2008.
> > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > > _______________________________________________
> > > > VuFind-General mailing list
> > > > [hidden email]
> > > > https://lists.sourceforge.net/lists/listinfo/vufind-general
> > >
> > > Naomi Dushay
> > > [hidden email]
> > >
> > >
> > >
> > >
> > > -------------------------------------------------------------------------
> > > This SF.net email is sponsored by: Microsoft
> > > Defy all challenges. Microsoft(R) Visual Studio 2008.
> > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > _______________________________________________
> > > VuFind-General mailing list
> > > [hidden email]
> > > https://lists.sourceforge.net/lists/listinfo/vufind-general
> >
> >-------------------------------------------------------------------------
> >Check out the new SourceForge.net Marketplace.
> >It's the best place to buy or sell services for
> >just about anything Open Source.
> >http://sourceforge.net/services/buy/index.php
> >_______________________________________________
> >VuFind-General mailing list
> >[hidden email]
> >https://lists.sourceforge.net/lists/listinfo/vufind-general
>
>
>
> -------------------------------------------------------------------------
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services for
> just about anything Open Source.
> http://sourceforge.net/services/buy/index.php
> _______________________________________________
> VuFind-General mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/vufind-general

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|

Re: "Cleaning" MARC files for use with java importer (was Re: diacritic display -- font problem?)

Delis, Christopher
In reply to this post by Wayne Graham
Thanks, Wayne,

I may have to just give the solrmarc project a try.  I'm not sure if
it'd be easier using the overridden marc4j libraries in lieu of the
original, or if it'd be easier just going all-out with solrmarc even
though it's really "young" (I made quite a bit of changes to the
original Java importer vis a vis marc -> solr field mapping).  We'll
see :) I'm just glad that this project exists!  I'm sure once this
project matures, it will make all of this easy as pie.

Chris


On Thu, Jun 12, 2008 at 07:24:53PM -0400, Wayne Graham wrote:

> Hi Chris,
>
> How pressed are you for this? The reason I mention this is that with the
> solrmarc project there area a few patches added into the marc4j library that
> do a lot better job of guessing what the actual record is written in, rather
> than what the record reports itself as (and hopefully produce better
> results). There is some committed code in the solrmarc project, I just
> haven't had time (yet) to pull them into the Vufind trunk. Looking at my
> schedule, the code probably won't be pulled into Vufind until July, but you
> may want to grab that code on your own and test (and if you do, please let
> me know how it goes).
>
> http://code.google.com/p/solrmarc/
>
> Wayne
>
> On Thu, Jun 12, 2008 at 6:17 PM, Chris Delis <[hidden email]> wrote:
>
> > Hello all,
> >
> > Are there any Voyager customers out there using Voyager's marcexport
> > tool along with the java importer?  If so, are you exporting as MARC21
> > MARC-8?  And how are you "cleaning" your marc records, if at all?  I
> > am having trouble getting the ISOLatin1Filter to work properly in SOLR
> > and am guessing the problem may have to do with a bad encoding
> > somewhere.  Are there any good tools (which can run in a batch on a
> > *nix system) someone can recommend?  Or is it just better to translate
> > (via yaz-marcdump or whatever) to MARCXML and modify the java importer
> > to read MARCXML?
> >
> > Thanks!
> > Chris
> >
> > On Wed, May 21, 2008 at 02:29:36PM -0700, Naomi Dushay wrote:
> > > There is a C "utf8conditioner" program available at the OAI-PMH web
> > > site (look under "tools").  It changes bad UTF-8 characters to a
> > > benign (but unmeaningful) character.  The program comes with test
> > > files with bad UTF-8 characters.
> > >
> > > When I worked for the National Science Digital Library, we harvested
> > > OAI data that had bad UTF-8 chars.  It was fairly common.
> > >
> > > The multi-byte UTF-8 characters tend to be particularly thorny, as I
> > > recall.
> > >
> > > - Naomi
> > >
> > > On May 21, 2008, at 1:13 PM, Andrew Nagy wrote:
> > >
> > > > Well I slightly agree - I like putting the burden on the programmer
> > > > and make things very easy for the implementer.  Especially when the
> > > > programmer is Wayne :)
> > > >
> > > > While we are on this topic - I have talked with some folks here as
> > > > well as other libraries and there seems to be a common issue of
> > > > records that are in utf-8 format but we not fully converted and have
> > > > records that are ridden with bad utf-8 characters.
> > > >
> > > > Wayne - do you know of any java toolkits that can help cleanup utf-8
> > > > data during the import?
> > > >
> > > > Andrew
> > > >
> > > >> -----Original Message-----
> > > >> From: [hidden email] [mailto:vufind-
> > > >> [hidden email]] On Behalf Of James Farrugia
> > > >> Sent: Wednesday, May 21, 2008 2:56 PM
> > > >> To: Wayne Graham
> > > >> Cc: [hidden email]
> > > >> Subject: Re: [VuFind-General] diacritic display -- font problem?
> > > >>
> > > >> Hi Wayne,
> > > >>
> > > >> Thanks. I think the easiest way all around is to put the "burden" of
> > > >> getting records into UTF-8 (which is what VuFind uses/requires, yes?)
> > > >> on users rather than developers.
> > > >>
> > > >> The simple one-line yaz command with -o marc (thanks, Doug) is
> > > >> all that's needed it seems.
> > > >>
> > > >> This seems the best way to deal with it (or some other conversion
> > > >> to UTF-8 before loading into VuFind).
> > > >>
> > > >> Jim
> > > >>
> > > >>>>> On 5/21/2008 at 2:01 PM, Wayne Graham <[hidden email]> wrote:
> > > >>> Not sure if this will answer you question, but here it goes.
> > > >>>
> > > >>> The Java that does the indexing has several converters for different
> > > >>
> > > >>> formats . These include Ansel, ISO5426 (Latin), and ISO 6937
> > > >>> (ASCII).
> > > >>
> > > >>> The Ansel converter will convert to- and from- the MARC-8 format.
> > > >> Right
> > > >>> now the code to do the indexing doesn't do any conversion... is this
> > > >>
> > > >>> something you need? If so, we can do an enhancement request.
> > > >>>
> > > >>> If you're asking about UTF-8, this is a slightly different answer.
> > > >>> By
> > > >>
> > > >>> virtue that it's Java, String objects are stored in UTF-16. I can't
> > > >>> really think of a reason to do the extra programming to make it
> > > >> UTF-8...
> > > >>>
> > > >>> Wayne
> > > >>>
> > > >>> James Farrugia wrote:
> > > >>>> Andrew,
> > > >>>>
> > > >>>> Does VuFind offer a MARC to UTF-8 converter?
> > > >>>>
> > > >>>> Jim
> > > >>>>
> > > >>>>
> > > >>>>>>> On 5/21/2008 at 1:39 PM, Andrew Nagy <[hidden email]>
> > > >>>>>>>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> I just changed the CSS for vufind to no longer use Lucida Grande
> > > >> as
> > > >>>>>
> > > >>>> the
> > > >>>>
> > > >>>>> default font due to the diacritics issues, the default is now
> > > >> Arial
> > > >>>>>
> > > >>>> Unicode
> > > >>>>
> > > >>>>> MS, Arial, Sans-Serif.
> > > >>>>>
> > > >>>>> Arial Unicode MS is one of the most unicode compliant fonts and is
> > > >>>>>
> > > >>>> installed
> > > >>>>
> > > >>>>> with windows and OSX 10.5 or later.
> > > >>>>>
> > > >>>>> Andrew
> > > >>>>>
> > > >>>>>
> > > >>>>>> -----Original Message-----
> > > >>>>>> From: [hidden email]
> > > >> [mailto:vufind-
> > > >>>>>> [hidden email]] On Behalf Of Corinna
> > > >> Baksik
> > > >>>>>> Sent: Wednesday, May 21, 2008 1:16 PM
> > > >>>>>> To: [hidden email]
> > > >>>>>> Subject: [VuFind-General] diacritic display -- font problem?
> > > >>>>>>
> > > >>>>>> Hi - It seems that diacritical marks are not displaying properly.
> > > >>>>>>
> > > >>>> The
> > > >>>>
> > > >>>>>> accent displays over the letter to the right of where it should.
> > > >> I
> > > >>>>>> think
> > > >>>>>> this is a font problem as I can save an html page and use a
> > > >>>>>>
> > > >>>> different
> > > >>>>
> > > >>>>>> font
> > > >>>>>> and it displays correctly. For example, in this record the accent
> > > >>>>>>
> > > >>>> over
> > > >>>>
> > > >>>>>> the
> > > >>>>>> first e in Bibliothèque is displaying over the q:
> > > >>>>>> http://vufind.org/demo/Record/243957
> > > >>>>>>
> > > >>>>>> This happens consistently for different types of accents and
> > > >>>>>>
> > > >>>> different
> > > >>>>
> > > >>>>>> letters. I suspect that the source record is in decomposed
> > > >> Unicode,
> > > >>>>>> otherwise it might display properly. We use Arial Unicode MS in
> > > >> our
> > > >>>>>> catalog
> > > >>>>>> because it displays the most number of diacritics and non-Latin
> > > >>>>>> characters
> > > >>>>>> properly (though it is not without bugs).
> > > >>>>>>
> > > >>>>>> corinna
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Corinna Baksik
> > > >>>>>> Harvard University Library
> > > >>>>>> Office for Information Systems
> > > >>>>>> 90 Mt. Auburn St
> > > >>>>>> Cambridge, MA 02138
> > > >>>>>>
> > > >>>>>> Phone: 617-495-3724
> > > >>>>>> Fax: 617-496-5600
> > > >>>>>> Email: [hidden email]
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>
> > > >>
> > -----------------------------------------------------------------------
> > > >>>>
> > > >>>>>> --
> > > >>>>>> This SF.net email is sponsored by: Microsoft
> > > >>>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> > > >>>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > >>>>>> _______________________________________________
> > > >>>>>> VuFind-General mailing list
> > > >>>>>> [hidden email]
> > > >>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > -----------------------------------------------------------------------
> > > >> --
> > > >>>>
> > > >>>>> This SF.net email is sponsored by: Microsoft
> > > >>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> > > >>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > >>>>> _______________________________________________
> > > >>>>> VuFind-General mailing list
> > > >>>>> [hidden email]
> > > >>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>
> > -----------------------------------------------------------------------
> > > >> --
> > > >>>> This SF.net email is sponsored by: Microsoft
> > > >>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> > > >>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > >>>> _______________________________________________
> > > >>>> VuFind-General mailing list
> > > >>>> [hidden email]
> > > >>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> > > >>>>
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > -----------------------------------------------------------------------
> > > >> --
> > > >>> This SF.net email is sponsored by: Microsoft
> > > >>> Defy all challenges. Microsoft(R) Visual Studio 2008.
> > > >>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > >>> _______________________________________________
> > > >>> VuFind-General mailing list
> > > >>> [hidden email]
> > > >>> https://lists.sourceforge.net/lists/listinfo/vufind-general
> > > >>
> > > >>
> > -----------------------------------------------------------------------
> > > >> --
> > > >> This SF.net email is sponsored by: Microsoft
> > > >> Defy all challenges. Microsoft(R) Visual Studio 2008.
> > > >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > >> _______________________________________________
> > > >> VuFind-General mailing list
> > > >> [hidden email]
> > > >> https://lists.sourceforge.net/lists/listinfo/vufind-general
> > > >
> > > >
> > -------------------------------------------------------------------------
> > > > This SF.net email is sponsored by: Microsoft
> > > > Defy all challenges. Microsoft(R) Visual Studio 2008.
> > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > > _______________________________________________
> > > > VuFind-General mailing list
> > > > [hidden email]
> > > > https://lists.sourceforge.net/lists/listinfo/vufind-general
> > >
> > > Naomi Dushay
> > > [hidden email]
> > >
> > >
> > >
> > >
> > > -------------------------------------------------------------------------
> > > This SF.net email is sponsored by: Microsoft
> > > Defy all challenges. Microsoft(R) Visual Studio 2008.
> > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > > _______________________________________________
> > > VuFind-General mailing list
> > > [hidden email]
> > > https://lists.sourceforge.net/lists/listinfo/vufind-general
> >
> > -------------------------------------------------------------------------
> > Check out the new SourceForge.net Marketplace.
> > It's the best place to buy or sell services for
> > just about anything Open Source.
> > http://sourceforge.net/services/buy/index.php
> > _______________________________________________
> > VuFind-General mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/vufind-general
> >
>
>
>
> --
> PJ O'Rourke  - "You can't get rid of poverty by giving people money."

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|

Re: "Cleaning" MARC files for use with java importer (was Re: diacritic display -- font problem?)

Steve Thomas-8
In reply to this post by Corinna Baksik
Here's the command that I use to export MARC for our ebooks project,
which exports as MARC21 in UTF-8 format:

Pmarcexport -rB -mM -t/var/tmp/ebook.rids -q
-o/m1/voyager/adelaidedb/local/ebooks.bib

The default output format is UTF-8, so no need to specify a character
format.

Of course, I have not yet tried importing this to VuFind, because the
importer seems broken and my request for help received no answer. :-)
Once I get more time to look at it ....


Cheers,
Steve


Corinna Baksik wrote:

> If it's possible to get your Voyager data out in UTF8, I would highly
> recommend that. It's my understanding that Voyager stores data in UTF8, so
> to convert it to MARC8 during export, then back to UTF8 for VuFind import,
> would likely create unnecessary problems. (Apologies if I misunderstood the
> question).
>
> In case it's helpful to anyone, I would point out that MARC21, a record
> structure, is distinct from MARC8, a character repertoire. The MARC8
> character repertoire is not synonymous with ISO 8859-1 (Latin 1). MARC8
> actually encompasses non-Roman, like CJK. Even if you don't have non-Roman
> in your records, MARC8 includes the extended Latin character set, so if you
> have records with the musical sharp or the copyright sign, for example, I
> don't think that would be covered by tools relying on ISO 8859-1.
>
> corinna
>
> At 05:17 PM 6/12/2008 -0500, Chris Delis wrote:
>> Hello all,
>>
>> Are there any Voyager customers out there using Voyager's marcexport
>> tool along with the java importer?  If so, are you exporting as MARC21
>> MARC-8?  And how are you "cleaning" your marc records, if at all?  I
>> am having trouble getting the ISOLatin1Filter to work properly in SOLR
>> and am guessing the problem may have to do with a bad encoding
>> somewhere.  Are there any good tools (which can run in a batch on a
>> *nix system) someone can recommend?  Or is it just better to translate
>> (via yaz-marcdump or whatever) to MARCXML and modify the java importer
>> to read MARCXML?
>>
>> Thanks!
>> Chris
>>
>> On Wed, May 21, 2008 at 02:29:36PM -0700, Naomi Dushay wrote:
>>> There is a C "utf8conditioner" program available at the OAI-PMH web
>>> site (look under "tools").  It changes bad UTF-8 characters to a
>>> benign (but unmeaningful) character.  The program comes with test
>>> files with bad UTF-8 characters.
>>>
>>> When I worked for the National Science Digital Library, we harvested
>>> OAI data that had bad UTF-8 chars.  It was fairly common.
>>>
>>> The multi-byte UTF-8 characters tend to be particularly thorny, as I
>>> recall.
>>>
>>> - Naomi
>>>
>>> On May 21, 2008, at 1:13 PM, Andrew Nagy wrote:
>>>
>>>> Well I slightly agree - I like putting the burden on the programmer
>>>> and make things very easy for the implementer.  Especially when the
>>>> programmer is Wayne :)
>>>>
>>>> While we are on this topic - I have talked with some folks here as
>>>> well as other libraries and there seems to be a common issue of
>>>> records that are in utf-8 format but we not fully converted and have
>>>> records that are ridden with bad utf-8 characters.
>>>>
>>>> Wayne - do you know of any java toolkits that can help cleanup utf-8
>>>> data during the import?
>>>>
>>>> Andrew
>>>>
>>>>> -----Original Message-----
>>>>> From: [hidden email] [mailto:vufind-
>>>>> [hidden email]] On Behalf Of James Farrugia
>>>>> Sent: Wednesday, May 21, 2008 2:56 PM
>>>>> To: Wayne Graham
>>>>> Cc: [hidden email]
>>>>> Subject: Re: [VuFind-General] diacritic display -- font problem?
>>>>>
>>>>> Hi Wayne,
>>>>>
>>>>> Thanks. I think the easiest way all around is to put the "burden" of
>>>>> getting records into UTF-8 (which is what VuFind uses/requires, yes?)
>>>>> on users rather than developers.
>>>>>
>>>>> The simple one-line yaz command with -o marc (thanks, Doug) is
>>>>> all that's needed it seems.
>>>>>
>>>>> This seems the best way to deal with it (or some other conversion
>>>>> to UTF-8 before loading into VuFind).
>>>>>
>>>>> Jim
>>>>>
>>>>>>>> On 5/21/2008 at 2:01 PM, Wayne Graham <[hidden email]> wrote:
>>>>>> Not sure if this will answer you question, but here it goes.
>>>>>>
>>>>>> The Java that does the indexing has several converters for different
>>>>>> formats . These include Ansel, ISO5426 (Latin), and ISO 6937
>>>>>> (ASCII).
>>>>>> The Ansel converter will convert to- and from- the MARC-8 format.
>>>>> Right
>>>>>> now the code to do the indexing doesn't do any conversion... is this
>>>>>> something you need? If so, we can do an enhancement request.
>>>>>>
>>>>>> If you're asking about UTF-8, this is a slightly different answer.
>>>>>> By
>>>>>> virtue that it's Java, String objects are stored in UTF-16. I can't
>>>>>> really think of a reason to do the extra programming to make it
>>>>> UTF-8...
>>>>>> Wayne
>>>>>>
>>>>>> James Farrugia wrote:
>>>>>>> Andrew,
>>>>>>>
>>>>>>> Does VuFind offer a MARC to UTF-8 converter?
>>>>>>>
>>>>>>> Jim
>>>>>>>
>>>>>>>
>>>>>>>>>> On 5/21/2008 at 1:39 PM, Andrew Nagy <[hidden email]>
>>>>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I just changed the CSS for vufind to no longer use Lucida Grande
>>>>> as
>>>>>>> the
>>>>>>>
>>>>>>>> default font due to the diacritics issues, the default is now
>>>>> Arial
>>>>>>> Unicode
>>>>>>>
>>>>>>>> MS, Arial, Sans-Serif.
>>>>>>>>
>>>>>>>> Arial Unicode MS is one of the most unicode compliant fonts and is
>>>>>>>>
>>>>>>> installed
>>>>>>>
>>>>>>>> with windows and OSX 10.5 or later.
>>>>>>>>
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: [hidden email]
>>>>> [mailto:vufind-
>>>>>>>>> [hidden email]] On Behalf Of Corinna
>>>>> Baksik
>>>>>>>>> Sent: Wednesday, May 21, 2008 1:16 PM
>>>>>>>>> To: [hidden email]
>>>>>>>>> Subject: [VuFind-General] diacritic display -- font problem?
>>>>>>>>>
>>>>>>>>> Hi - It seems that diacritical marks are not displaying properly.
>>>>>>>>>
>>>>>>> The
>>>>>>>
>>>>>>>>> accent displays over the letter to the right of where it should.
>>>>> I
>>>>>>>>> think
>>>>>>>>> this is a font problem as I can save an html page and use a
>>>>>>>>>
>>>>>>> different
>>>>>>>
>>>>>>>>> font
>>>>>>>>> and it displays correctly. For example, in this record the accent
>>>>>>>>>
>>>>>>> over
>>>>>>>
>>>>>>>>> the
>>>>>>>>> first e in Bibliothèque is displaying over the q:
>>>>>>>>> http://vufind.org/demo/Record/243957
>>>>>>>>>
>>>>>>>>> This happens consistently for different types of accents and
>>>>>>>>>
>>>>>>> different
>>>>>>>
>>>>>>>>> letters. I suspect that the source record is in decomposed
>>>>> Unicode,
>>>>>>>>> otherwise it might display properly. We use Arial Unicode MS in
>>>>> our
>>>>>>>>> catalog
>>>>>>>>> because it displays the most number of diacritics and non-Latin
>>>>>>>>> characters
>>>>>>>>> properly (though it is not without bugs).
>>>>>>>>>
>>>>>>>>> corinna
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Corinna Baksik
>>>>>>>>> Harvard University Library
>>>>>>>>> Office for Information Systems
>>>>>>>>> 90 Mt. Auburn St
>>>>>>>>> Cambridge, MA 02138
>>>>>>>>>
>>>>>>>>> Phone: 617-495-3724
>>>>>>>>> Fax: 617-496-5600
>>>>>>>>> Email: [hidden email]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>> -----------------------------------------------------------------------
>>>>>>>>> --
>>>>>>>>> This SF.net email is sponsored by: Microsoft
>>>>>>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>>>>>>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>>>>>>> _______________________________________________
>>>>>>>>> VuFind-General mailing list
>>>>>>>>> [hidden email]
>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>>>>>>>>>
>>>>> -----------------------------------------------------------------------
>>>>> --
>>>>>>>> This SF.net email is sponsored by: Microsoft
>>>>>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>>>>>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>>>>>> _______________________________________________
>>>>>>>> VuFind-General mailing list
>>>>>>>> [hidden email]
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>>>>>>>>
>>>>>>>
>>>>> -----------------------------------------------------------------------
>>>>> --
>>>>>>> This SF.net email is sponsored by: Microsoft
>>>>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>>>>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>>>>> _______________________________________________
>>>>>>> VuFind-General mailing list
>>>>>>> [hidden email]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>>>>>>>
>>>>>>
>>>>>>
>>>>> -----------------------------------------------------------------------
>>>>> --
>>>>>> This SF.net email is sponsored by: Microsoft
>>>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>>>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>>>> _______________________________________________
>>>>>> VuFind-General mailing list
>>>>>> [hidden email]
>>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>>>>> -----------------------------------------------------------------------
>>>>> --
>>>>> This SF.net email is sponsored by: Microsoft
>>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>>> _______________________________________________
>>>>> VuFind-General mailing list
>>>>> [hidden email]
>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>>>> -------------------------------------------------------------------------
>>>> This SF.net email is sponsored by: Microsoft
>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>> _______________________________________________
>>>> VuFind-General mailing list
>>>> [hidden email]
>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>>> Naomi Dushay
>>> [hidden email]
>>>
>>>
>>>
>>>
>>> -------------------------------------------------------------------------
>>> This SF.net email is sponsored by: Microsoft
>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>> _______________________________________________
>>> VuFind-General mailing list
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>> -------------------------------------------------------------------------
>> Check out the new SourceForge.net Marketplace.
>> It's the best place to buy or sell services for
>> just about anything Open Source.
>> http://sourceforge.net/services/buy/index.php
>> _______________________________________________
>> VuFind-General mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>
>
>
> -------------------------------------------------------------------------
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services for
> just about anything Open Source.
> http://sourceforge.net/services/buy/index.php
> _______________________________________________
> VuFind-General mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/vufind-general

--

Stephen Thomas,
Senior Systems Analyst,
University of Adelaide Library
UNIVERSITY OF ADELAIDE SA 5005 AUSTRALIA
Phone: +61 8 830 35190
Fax: +61 8 830 34369
Email: [hidden email]
URL: http://www.adelaide.edu.au/directory/stephen.thomas
CRICOS Provider Number 00123M

-----------------------------------------------------------
This email message is intended only for the addressee(s) and contains
information that may be confidential and/or copyright. If you are not
the intended recipient please notify the sender by reply email and
immediately delete this email. Use, disclosure or reproduction of this
email by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or any attachments
are free of viruses. Virus scanning is recommended and is the
responsibility of the recipient.


-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|

Re: "Cleaning" MARC files for use with java importer (was Re: diacritic display -- font problem?)

wsgrah
Administrator
In reply to this post by Delis, Christopher
Well, the code isn't really new. It's what I wrote for this project with
all the hard coded stuff configurable via a properties file (which I
should have done in the first place). The really nice thing about this
new code base is that it just uses a .properties file to map marc
elements to solr elements (along with a lot of other mappings)...so
changes are much easier! If you want to add your own methods to call,
there is a facility to do this (though that requires a recompile). As
long as you remember your changes, you should be good to go ;)

Wayne

Chris Delis wrote:

> Thanks, Wayne,
>
> I may have to just give the solrmarc project a try.  I'm not sure if
> it'd be easier using the overridden marc4j libraries in lieu of the
> original, or if it'd be easier just going all-out with solrmarc even
> though it's really "young" (I made quite a bit of changes to the
> original Java importer vis a vis marc -> solr field mapping).  We'll
> see :) I'm just glad that this project exists!  I'm sure once this
> project matures, it will make all of this easy as pie.
>
> Chris
>
>
> On Thu, Jun 12, 2008 at 07:24:53PM -0400, Wayne Graham wrote:
>  
>> Hi Chris,
>>
>> How pressed are you for this? The reason I mention this is that with the
>> solrmarc project there area a few patches added into the marc4j library that
>> do a lot better job of guessing what the actual record is written in, rather
>> than what the record reports itself as (and hopefully produce better
>> results). There is some committed code in the solrmarc project, I just
>> haven't had time (yet) to pull them into the Vufind trunk. Looking at my
>> schedule, the code probably won't be pulled into Vufind until July, but you
>> may want to grab that code on your own and test (and if you do, please let
>> me know how it goes).
>>
>> http://code.google.com/p/solrmarc/
>>
>> Wayne
>>
>> On Thu, Jun 12, 2008 at 6:17 PM, Chris Delis <[hidden email]> wrote:
>>
>>    
>>> Hello all,
>>>
>>> Are there any Voyager customers out there using Voyager's marcexport
>>> tool along with the java importer?  If so, are you exporting as MARC21
>>> MARC-8?  And how are you "cleaning" your marc records, if at all?  I
>>> am having trouble getting the ISOLatin1Filter to work properly in SOLR
>>> and am guessing the problem may have to do with a bad encoding
>>> somewhere.  Are there any good tools (which can run in a batch on a
>>> *nix system) someone can recommend?  Or is it just better to translate
>>> (via yaz-marcdump or whatever) to MARCXML and modify the java importer
>>> to read MARCXML?
>>>
>>> Thanks!
>>> Chris
>>>
>>> On Wed, May 21, 2008 at 02:29:36PM -0700, Naomi Dushay wrote:
>>>      
>>>> There is a C "utf8conditioner" program available at the OAI-PMH web
>>>> site (look under "tools").  It changes bad UTF-8 characters to a
>>>> benign (but unmeaningful) character.  The program comes with test
>>>> files with bad UTF-8 characters.
>>>>
>>>> When I worked for the National Science Digital Library, we harvested
>>>> OAI data that had bad UTF-8 chars.  It was fairly common.
>>>>
>>>> The multi-byte UTF-8 characters tend to be particularly thorny, as I
>>>> recall.
>>>>
>>>> - Naomi
>>>>
>>>> On May 21, 2008, at 1:13 PM, Andrew Nagy wrote:
>>>>
>>>>        
>>>>> Well I slightly agree - I like putting the burden on the programmer
>>>>> and make things very easy for the implementer.  Especially when the
>>>>> programmer is Wayne :)
>>>>>
>>>>> While we are on this topic - I have talked with some folks here as
>>>>> well as other libraries and there seems to be a common issue of
>>>>> records that are in utf-8 format but we not fully converted and have
>>>>> records that are ridden with bad utf-8 characters.
>>>>>
>>>>> Wayne - do you know of any java toolkits that can help cleanup utf-8
>>>>> data during the import?
>>>>>
>>>>> Andrew
>>>>>
>>>>>          
>>>>>> -----Original Message-----
>>>>>> From: [hidden email] [mailto:vufind-
>>>>>> [hidden email]] On Behalf Of James Farrugia
>>>>>> Sent: Wednesday, May 21, 2008 2:56 PM
>>>>>> To: Wayne Graham
>>>>>> Cc: [hidden email]
>>>>>> Subject: Re: [VuFind-General] diacritic display -- font problem?
>>>>>>
>>>>>> Hi Wayne,
>>>>>>
>>>>>> Thanks. I think the easiest way all around is to put the "burden" of
>>>>>> getting records into UTF-8 (which is what VuFind uses/requires, yes?)
>>>>>> on users rather than developers.
>>>>>>
>>>>>> The simple one-line yaz command with -o marc (thanks, Doug) is
>>>>>> all that's needed it seems.
>>>>>>
>>>>>> This seems the best way to deal with it (or some other conversion
>>>>>> to UTF-8 before loading into VuFind).
>>>>>>
>>>>>> Jim
>>>>>>
>>>>>>            
>>>>>>>>> On 5/21/2008 at 2:01 PM, Wayne Graham <[hidden email]> wrote:
>>>>>>>>>                  
>>>>>>> Not sure if this will answer you question, but here it goes.
>>>>>>>
>>>>>>> The Java that does the indexing has several converters for different
>>>>>>>              
>>>>>>> formats . These include Ansel, ISO5426 (Latin), and ISO 6937
>>>>>>> (ASCII).
>>>>>>>              
>>>>>>> The Ansel converter will convert to- and from- the MARC-8 format.
>>>>>>>              
>>>>>> Right
>>>>>>            
>>>>>>> now the code to do the indexing doesn't do any conversion... is this
>>>>>>>              
>>>>>>> something you need? If so, we can do an enhancement request.
>>>>>>>
>>>>>>> If you're asking about UTF-8, this is a slightly different answer.
>>>>>>> By
>>>>>>>              
>>>>>>> virtue that it's Java, String objects are stored in UTF-16. I can't
>>>>>>> really think of a reason to do the extra programming to make it
>>>>>>>              
>>>>>> UTF-8...
>>>>>>            
>>>>>>> Wayne
>>>>>>>
>>>>>>> James Farrugia wrote:
>>>>>>>              
>>>>>>>> Andrew,
>>>>>>>>
>>>>>>>> Does VuFind offer a MARC to UTF-8 converter?
>>>>>>>>
>>>>>>>> Jim
>>>>>>>>
>>>>>>>>
>>>>>>>>                
>>>>>>>>>>> On 5/21/2008 at 1:39 PM, Andrew Nagy <[hidden email]>
>>>>>>>>>>>
>>>>>>>>>>>                      
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>                
>>>>>>>>> I just changed the CSS for vufind to no longer use Lucida Grande
>>>>>>>>>                  
>>>>>> as
>>>>>>            
>>>>>>>> the
>>>>>>>>
>>>>>>>>                
>>>>>>>>> default font due to the diacritics issues, the default is now
>>>>>>>>>                  
>>>>>> Arial
>>>>>>            
>>>>>>>> Unicode
>>>>>>>>
>>>>>>>>                
>>>>>>>>> MS, Arial, Sans-Serif.
>>>>>>>>>
>>>>>>>>> Arial Unicode MS is one of the most unicode compliant fonts and is
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>> installed
>>>>>>>>
>>>>>>>>                
>>>>>>>>> with windows and OSX 10.5 or later.
>>>>>>>>>
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: [hidden email]
>>>>>>>>>>                    
>>>>>> [mailto:vufind-
>>>>>>            
>>>>>>>>>> [hidden email]] On Behalf Of Corinna
>>>>>>>>>>                    
>>>>>> Baksik
>>>>>>            
>>>>>>>>>> Sent: Wednesday, May 21, 2008 1:16 PM
>>>>>>>>>> To: [hidden email]
>>>>>>>>>> Subject: [VuFind-General] diacritic display -- font problem?
>>>>>>>>>>
>>>>>>>>>> Hi - It seems that diacritical marks are not displaying properly.
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>> The
>>>>>>>>
>>>>>>>>                
>>>>>>>>>> accent displays over the letter to the right of where it should.
>>>>>>>>>>                    
>>>>>> I
>>>>>>            
>>>>>>>>>> think
>>>>>>>>>> this is a font problem as I can save an html page and use a
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>> different
>>>>>>>>
>>>>>>>>                
>>>>>>>>>> font
>>>>>>>>>> and it displays correctly. For example, in this record the accent
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>> over
>>>>>>>>
>>>>>>>>                
>>>>>>>>>> the
>>>>>>>>>> first e in Bibliothèque is displaying over the q:
>>>>>>>>>> http://vufind.org/demo/Record/243957
>>>>>>>>>>
>>>>>>>>>> This happens consistently for different types of accents and
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>> different
>>>>>>>>
>>>>>>>>                
>>>>>>>>>> letters. I suspect that the source record is in decomposed
>>>>>>>>>>                    
>>>>>> Unicode,
>>>>>>            
>>>>>>>>>> otherwise it might display properly. We use Arial Unicode MS in
>>>>>>>>>>                    
>>>>>> our
>>>>>>            
>>>>>>>>>> catalog
>>>>>>>>>> because it displays the most number of diacritics and non-Latin
>>>>>>>>>> characters
>>>>>>>>>> properly (though it is not without bugs).
>>>>>>>>>>
>>>>>>>>>> corinna
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Corinna Baksik
>>>>>>>>>> Harvard University Library
>>>>>>>>>> Office for Information Systems
>>>>>>>>>> 90 Mt. Auburn St
>>>>>>>>>> Cambridge, MA 02138
>>>>>>>>>>
>>>>>>>>>> Phone: 617-495-3724
>>>>>>>>>> Fax: 617-496-5600
>>>>>>>>>> Email: [hidden email]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                    
>>> -----------------------------------------------------------------------
>>>      
>>>>>>>>>> --
>>>>>>>>>> This SF.net email is sponsored by: Microsoft
>>>>>>>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>>>>>>>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>>>>>>>> _______________________________________________
>>>>>>>>>> VuFind-General mailing list
>>>>>>>>>> [hidden email]
>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>>>>>>>>>>
>>>>>>>>>>                    
>>> -----------------------------------------------------------------------
>>>      
>>>>>> --
>>>>>>            
>>>>>>>>> This SF.net email is sponsored by: Microsoft
>>>>>>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>>>>>>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>>>>>>> _______________________________________________
>>>>>>>>> VuFind-General mailing list
>>>>>>>>> [hidden email]
>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>                
>>> -----------------------------------------------------------------------
>>>      
>>>>>> --
>>>>>>            
>>>>>>>> This SF.net email is sponsored by: Microsoft
>>>>>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>>>>>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>>>>>> _______________________________________________
>>>>>>>> VuFind-General mailing list
>>>>>>>> [hidden email]
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>>>>>>>>
>>>>>>>>                
>>>>>>>
>>>>>>>              
>>> -----------------------------------------------------------------------
>>>      
>>>>>> --
>>>>>>            
>>>>>>> This SF.net email is sponsored by: Microsoft
>>>>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>>>>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>>>>> _______________________________________________
>>>>>>> VuFind-General mailing list
>>>>>>> [hidden email]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>>>>>>>              
>>>>>>            
>>> -----------------------------------------------------------------------
>>>      
>>>>>> --
>>>>>> This SF.net email is sponsored by: Microsoft
>>>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>>>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>>>> _______________________________________________
>>>>>> VuFind-General mailing list
>>>>>> [hidden email]
>>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>>>>>>            
>>>>>          
>>> -------------------------------------------------------------------------
>>>      
>>>>> This SF.net email is sponsored by: Microsoft
>>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>>> _______________________________________________
>>>>> VuFind-General mailing list
>>>>> [hidden email]
>>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>>>>>          
>>>> Naomi Dushay
>>>> [hidden email]
>>>>
>>>>
>>>>
>>>>
>>>> -------------------------------------------------------------------------
>>>> This SF.net email is sponsored by: Microsoft
>>>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>>> _______________________________________________
>>>> VuFind-General mailing list
>>>> [hidden email]
>>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>>>>        
>>> -------------------------------------------------------------------------
>>> Check out the new SourceForge.net Marketplace.
>>> It's the best place to buy or sell services for
>>> just about anything Open Source.
>>> http://sourceforge.net/services/buy/index.php
>>> _______________________________________________
>>> VuFind-General mailing list
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/vufind-general
>>>
>>>      
>>
>> --
>> PJ O'Rourke  - "You can't get rid of poverty by giving people money."
>>    
>
>  

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|

Re: "Cleaning" MARC files for use with java importer (was Re: diacritic display -- font problem?)

Naomi Dushay
Wayne,

I do love your importer and I can't wait to use it.  Do your upcoming mods to the VuFind codebase mean you're already taking on the task of renaming fields:

-----Original Message-----
From: [hidden email] [[hidden email]
[hidden email]] On Behalf Of Naomi Dushay
Sent: Tuesday, June 03, 2008 12:44 PM
To: [hidden email]
Subject: Re: [VuFind-Tech] thinking about fieldnames and import ...

I can live with that!

But:  how about using the convention of field names with suffixes
like

facet
display   or disp
search or srch or index or ix
sort

"text" is SOLR jargon and not obvious at first as to purpose;  Str
is ambiguous, diff b/t author and author2 is not clear etc.


I was also thinking
author_main_(suffix)
author_addl   or author_other  _(suffix)

and 

title_main_(suffix)
title_addl  or title_other  _(suffix)

Basically, anything that makes the functions of the fields clearer.  If you're already doing this or can add it into your mods with little effort, fab-u-lous.  

Otherwise, it may wait until I get back from my 2 weeks off (and no, it's not all vacation.)

- Naomi


On Jun 13, 2008, at 5:48 AM, Wayne Graham wrote:

Well, the code isn't really new. It's what I wrote for this project with
all the hard coded stuff configurable via a properties file (which I
should have done in the first place). The really nice thing about this
new code base is that it just uses a .properties file to map marc
elements to solr elements (along with a lot of other mappings)...so
changes are much easier! If you want to add your own methods to call,
there is a facility to do this (though that requires a recompile). As
long as you remember your changes, you should be good to go ;)

Wayne


-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general
Reply | Threaded
Open this post in threaded view
|

Re: [VuFind-Tech] "Cleaning" MARC files for use with java importer (was Re: diacritic display -- font problem?)

wsgrah
Administrator
Not yet...but the change to the code base will make this a very trivial
change.

Naomi Dushay wrote:

> Wayne,
>
> I do love your importer and I can't wait to use it.  Do your upcoming
> mods to the VuFind codebase mean you're already taking on the task of
> renaming fields:
>
>>>> -----Original Message-----
>>>> From: [hidden email]
>>>> <mailto:[hidden email]> [mailto:vufind-tech-
>>>> [hidden email]
>>>> <mailto:[hidden email]>] On Behalf Of Naomi Dushay
>>>> Sent: Tuesday, June 03, 2008 12:44 PM
>>>> To: [hidden email]
>>>> <mailto:[hidden email]>
>>>> Subject: Re: [VuFind-Tech] thinking about fieldnames and import ...
>>>>
>>>> I can live with that!
>>>>
>>>> But:  how about using the convention of field names with suffixes
>>>> like
>>>>
>>>> facet
>>>> display   or disp
>>>> search or srch or index or ix
>>>> sort
>>>>
>>>> "text" is SOLR jargon and not obvious at first as to purpose;  Str
>>>> is ambiguous, diff b/t author and author2 is not clear etc.
>
>
> I was also thinking
> author_main_(suffix)
> author_addl   or author_other  _(suffix)
>
> and
>
> title_main_(suffix)
> title_addl  or title_other  _(suffix)
>
> Basically, anything that makes the functions of the fields clearer.
>  If you're already doing this or can add it into your mods with little
> effort, fab-u-lous.  
>
> Otherwise, it may wait until I get back from my 2 weeks off (and no,
> it's not all vacation.)
>
> - Naomi
>
>
> On Jun 13, 2008, at 5:48 AM, Wayne Graham wrote:
>
>> Well, the code isn't really new. It's what I wrote for this project with
>> all the hard coded stuff configurable via a properties file (which I
>> should have done in the first place). The really nice thing about this
>> new code base is that it just uses a .properties file to map marc
>> elements to solr elements (along with a lot of other mappings)...so
>> changes are much easier! If you want to add your own methods to call,
>> there is a facility to do this (though that requires a recompile). As
>> long as you remember your changes, you should be good to go ;)
>>
>> Wayne
>>
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services for
> just about anything Open Source.
> http://sourceforge.net/services/buy/index.php
> ------------------------------------------------------------------------
>
> _______________________________________________
> Vufind-tech mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/vufind-tech
>  

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
VuFind-General mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-general