solrmarc - tweaks to standard number indexing and extraction

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

solrmarc - tweaks to standard number indexing and extraction

Naomi Dushay
Folks,

Using code from the solrmarc project, I've done some (test-driven!) local coding for standard numbers: ISBN, ISSN, OCLC and LCCN.  I thought I would share the information, FWIW; I am happy to share the code as well if folks are interested.  I also have the algorithms and a lot of relevant additional information in a Stanford-only wiki, but I can presumably get a PDF version or something that I could pass around (or possibly cut and paste the wiki text into another wiki somewhere).

I am of the belief that the indexing should take care of the massaging of data as necessary, not the UI code.  So stripping following text, prefixes and the like is done in my indexing code.

For ISBN and ISSN, our cataloging expert pointed out that we want to be as *inclusive* as possible for our users: when they are looking in *our* index, we should enable matching occurring in as many cases as possible (maximizing "recall"!).   On the other hand, when we are using these numbers for retrieving external resources (e.g. Google Book Search), we want the numbers that are most likely to get us a correct answer.  These are two different needs, and they require two different fields:

<!-- isbn is for code to do external lookups by ISBN (e.g. Google Book Search) -->
<!-- TODO:  change isbn to isbn_store -->
<field name="isbn" type="string" indexed="false" stored="true" multiValued="true"/>
<!-- isbnUser_search is for end users to search our index via an ISBN -->
<field name="isbnUser_search" type="string" indexed="true" stored="false" multiValued="true"/>
<!-- issn is for code to do external lookups by ISSN -->
<!-- TODO:  change isbn to issn_store -->
<field name="issn" type="string" indexed="false" stored="true" multiValued="true"/>
<!-- issnUser_search is for end users to search our index via an ISSN -->
<field name="issnUser_search" type="string" indexed="true" stored="false" multiValued="true"/>

ISBN
------
a. multiple ISBN in a single marc bib record are allowed.
b. 10 or 13 digit number (last digit may also be "X").
c. Strip any following text.

isbnUser_search field (for end users to search our index):
----
1.  all 020 subfields a starting with an ISBN string - strip following text
2.  AND  all 020 subfields z starting with an ISBN string - strip following text

isbn (for external lookups)
----
1.  all 020 subfields a starting with an ISBN string - strip following text
2.  if none,  all 020 subfields z starting with an ISBN string - strip following text

ISSN
-----
a. multiple ISSN in a single marc bib record are allowed.
b. 4 digit number followed by hyphen followed by 4 digit number (last digit may also be "X").

issnUser_search field (for end users to search our index):
   I was able to implement this using a pattern map in our vufind.properties file.
----
1.  all 020 subfields a with ISSN
2.  AND  all 020 subfields "l" (letter "L") with ISSN
3.  AND  all 020 subfields m with ISSN
4.  AND  all 020 subfields y with ISSN
5.  AND  all 020 subfields z with ISSN

issn (for external lookups)
----
1.  all 020 subfields a with ISSN
5.  if none,  all 020 subfields z with ISSN


OCLC and LCCN are not exposed to end users, so we want to use the data that is most likely to get us correct retrieval ("precision"!) in external resources, such as OCLCWorldCat or Google Book Search.   Moreover, since this data does not need to be searched in our catalog by our users, it is not imperative to index these fields, though we must store them.  Choosing to index these fields would enable staff searches on these numbers, if that is desired.

solr/conf/schema.xml:
<!-- lccn number for code to do external lookups -->
<field name="lccn_store" type="string" indexed="false" stored="true"/>
<!-- oclc number for google book search links and for oclc worldcat links -->
<field name="oclc_store" type="string" indexed="false" stored="true" multiValued="true"/>

OCLC:
------
a. multiple OCLC numbers in a single marc bib record are allowed.

1.  all 035 subfields a with *our local prefix* "(OCoLC-M)"
2.  if none, all 079 sufields a prefixed "ocm" or "ocn"
3.  if none of the above, all 035 subfields a prefixed "(OCoLC)"

LCCN :
-------    
a. at most one per marc bib record.
b. Strip following text, but not prefixes.  (Not sure this is correct, but that's what I did.)
c. I was able to implement this using a pattern map in our vufind.properties file.

1. 010 subfield a.
2.  if none, 010 subfield z.


Naomi Dushay


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: solrmarc - tweaks to standard number indexing and extraction

Andrew Nagy-2

Naomi - this seems quite valuable.  I ask why the need for 2 isbn and issn fields.  Couldn't the isbn/issn fields be stripped down to just the number codes and remove any erroneous information - such as whether the book is paperback, etc.?

 

Since the ISBN and ISSN numbers don't get displayed to the end user in the search results, I don't see the reason to have the "unmassaged" field in the index.

 

And Yes - we'd be interested in including your code in the solrmarc project.

 

Andrew

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Naomi Dushay
Sent: Monday, July 28, 2008 12:22 PM
To: [hidden email]
Subject: [VuFind-Tech] solrmarc - tweaks to standard number indexing and extraction

 

Folks,

 

Using code from the solrmarc project, I've done some (test-driven!) local coding for standard numbers: ISBN, ISSN, OCLC and LCCN.  I thought I would share the information, FWIW; I am happy to share the code as well if folks are interested.  I also have the algorithms and a lot of relevant additional information in a Stanford-only wiki, but I can presumably get a PDF version or something that I could pass around (or possibly cut and paste the wiki text into another wiki somewhere).

 

I am of the belief that the indexing should take care of the massaging of data as necessary, not the UI code.  So stripping following text, prefixes and the like is done in my indexing code.

 

For ISBN and ISSN, our cataloging expert pointed out that we want to be as *inclusive* as possible for our users: when they are looking in *our* index, we should enable matching occurring in as many cases as possible (maximizing "recall"!).   On the other hand, when we are using these numbers for retrieving external resources (e.g. Google Book Search), we want the numbers that are most likely to get us a correct answer.  These are two different needs, and they require two different fields:

 

              <!-- isbn is for code to do external lookups by ISBN (e.g. Google Book Search) -->
             
<!-- TODO:  change isbn to isbn_store -->
             
<field name="isbn" type="string" indexed="false" stored="true" multiValued="true"/>
             
<!-- isbnUser_search is for end users to search our index via an ISBN -->
             
<field name="isbnUser_search" type="string" indexed="true" stored="false" multiValued="true"/> 
             
<!-- issn is for code to do external lookups by ISSN -->
             
<!-- TODO:  change isbn to issn_store -->
             
<field name="issn" type="string" indexed="false" stored="true" multiValued="true"/>
             
<!-- issnUser_search is for end users to search our index via an ISSN -->
             
<field name="issnUser_search" type="string" indexed="true" stored="false" multiValued="true"/> 

 

ISBN

------

a. multiple ISBN in a single marc bib record are allowed.

b. 10 or 13 digit number (last digit may also be "X").

c. Strip any following text.

 

isbnUser_search field (for end users to search our index):

----

1.  all 020 subfields a starting with an ISBN string - strip following text

2.  AND  all 020 subfields z starting with an ISBN string - strip following text

 

isbn (for external lookups)

----

1.  all 020 subfields a starting with an ISBN string - strip following text

2.  if none,  all 020 subfields z starting with an ISBN string - strip following text

 

ISSN

-----

a. multiple ISSN in a single marc bib record are allowed.

b. 4 digit number followed by hyphen followed by 4 digit number (last digit may also be "X").

 

issnUser_search field (for end users to search our index):

   I was able to implement this using a pattern map in our vufind.properties file.

----

1.  all 020 subfields a with ISSN

2.  AND  all 020 subfields "l" (letter "L") with ISSN

3.  AND  all 020 subfields m with ISSN

4.  AND  all 020 subfields y with ISSN

5.  AND  all 020 subfields z with ISSN

 

issn (for external lookups)

----

1.  all 020 subfields a with ISSN

5.  if none,  all 020 subfields z with ISSN

 

 

OCLC and LCCN are not exposed to end users, so we want to use the data that is most likely to get us correct retrieval ("precision"!) in external resources, such as OCLCWorldCat or Google Book Search.   Moreover, since this data does not need to be searched in our catalog by our users, it is not imperative to index these fields, though we must store them.  Choosing to index these fields would enable staff searches on these numbers, if that is desired.

 

solr/conf/schema.xml:

              <!-- lccn number for code to do external lookups -->
             
<field name="lccn_store" type="string" indexed="false" stored="true"/>
             
<!-- oclc number for google book search links and for oclc worldcat links -->
             
<field name="oclc_store" type="string" indexed="false" stored="true" multiValued="true"/>

 

OCLC:

------

a. multiple OCLC numbers in a single marc bib record are allowed.

 

1.  all 035 subfields a with *our local prefix* "(OCoLC-M)"

2.  if none, all 079 sufields a prefixed "ocm" or "ocn"

3.  if none of the above, all 035 subfields a prefixed "(OCoLC)"

 

LCCN :

-------    

a. at most one per marc bib record.

b. Strip following text, but not prefixes.  (Not sure this is correct, but that's what I did.)

c. I was able to implement this using a pattern map in our vufind.properties file.

 

1. 010 subfield a.

2.  if none, 010 subfield z.

 

 

Naomi Dushay

 


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: solrmarc - tweaks to standard number indexingand extraction

James Farrugia
Maybe because the subfield z for ISBN are invalid ones, so you wouldn't want to use those externally, but maybe you would internally, to be more inclusive in your search.


>>> On 7/28/2008 at 12:30 PM, Andrew Nagy <[hidden email]> wrote:
> Naomi - this seems quite valuable.  I ask why the need for 2 isbn and issn
> fields.  Couldn't the isbn/issn fields be stripped down to just the number
> codes and remove any erroneous information - such as whether the book is
> paperback, etc.?
>
> Since the ISBN and ISSN numbers don't get displayed to the end user in the
> search results, I don't see the reason to have the "unmassaged" field in the
> index.
>
> And Yes - we'd be interested in including your code in the solrmarc project.
>
> Andrew
>
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of Naomi
> Dushay
> Sent: Monday, July 28, 2008 12:22 PM
> To: [hidden email]
> Subject: [VuFind-Tech] solrmarc - tweaks to standard number indexing and
> extraction
>
> Folks,
>
> Using code from the solrmarc project, I've done some (test-driven!) local
> coding for standard numbers: ISBN, ISSN, OCLC and LCCN.  I thought I would
> share the information, FWIW; I am happy to share the code as well if folks
> are interested.  I also have the algorithms and a lot of relevant additional
> information in a Stanford-only wiki, but I can presumably get a PDF version or
> something that I could pass around (or possibly cut and paste the wiki text
> into another wiki somewhere).
>
> I am of the belief that the indexing should take care of the massaging of
> data as necessary, not the UI code.  So stripping following text, prefixes
> and the like is done in my indexing code.
>
> For ISBN and ISSN, our cataloging expert pointed out that we want to be as
> *inclusive* as possible for our users: when they are looking in *our* index,
> we should enable matching occurring in as many cases as possible (maximizing
> "recall"!).   On the other hand, when we are using these numbers for
> retrieving external resources (e.g. Google Book Search), we want the numbers
> that are most likely to get us a correct answer.  These are two different
> needs, and they require two different fields:
>
>               <!-- isbn is for code to do external lookups by ISBN (e.g. Google
> Book Search) -->
>               <!-- TODO:  change isbn to isbn_store -->
>               <field name="isbn" type="string" indexed="false" stored="true"
> multiValued="true"/>
>               <!-- isbnUser_search is for end users to search our index via an
> ISBN -->
>               <field name="isbnUser_search" type="string" indexed="true"
> stored="false" multiValued="true"/>
>               <!-- issn is for code to do external lookups by ISSN -->
>               <!-- TODO:  change isbn to issn_store -->
>               <field name="issn" type="string" indexed="false" stored="true"
> multiValued="true"/>
>               <!-- issnUser_search is for end users to search our index via an
> ISSN -->
>               <field name="issnUser_search" type="string" indexed="true"
> stored="false" multiValued="true"/>
>
> ISBN
> ------
> a. multiple ISBN in a single marc bib record are allowed.
> b. 10 or 13 digit number (last digit may also be "X").
> c. Strip any following text.
>
> isbnUser_search field (for end users to search our index):
> ----
> 1.  all 020 subfields a starting with an ISBN string - strip following text
> 2.  AND  all 020 subfields z starting with an ISBN string - strip following
> text
>
> isbn (for external lookups)
> ----
> 1.  all 020 subfields a starting with an ISBN string - strip following text
> 2.  if none,  all 020 subfields z starting with an ISBN string - strip
> following text
>
> ISSN
> -----
> a. multiple ISSN in a single marc bib record are allowed.
> b. 4 digit number followed by hyphen followed by 4 digit number (last digit
> may also be "X").
>
> issnUser_search field (for end users to search our index):
>    I was able to implement this using a pattern map in our vufind.properties
> file.
> ----
> 1.  all 020 subfields a with ISSN
> 2.  AND  all 020 subfields "l" (letter "L") with ISSN
> 3.  AND  all 020 subfields m with ISSN
> 4.  AND  all 020 subfields y with ISSN
> 5.  AND  all 020 subfields z with ISSN
>
> issn (for external lookups)
> ----
> 1.  all 020 subfields a with ISSN
> 5.  if none,  all 020 subfields z with ISSN
>
>
> OCLC and LCCN are not exposed to end users, so we want to use the data that
> is most likely to get us correct retrieval ("precision"!) in external
> resources, such as OCLCWorldCat or Google Book Search.   Moreover, since this
> data does not need to be searched in our catalog by our users, it is not
> imperative to index these fields, though we must store them.  Choosing to
> index these fields would enable staff searches on these numbers, if that is
> desired.
>
> solr/conf/schema.xml:
>               <!-- lccn number for code to do external lookups -->
>               <field name="lccn_store" type="string" indexed="false"
> stored="true"/>
>               <!-- oclc number for google book search links and for oclc
> worldcat links -->
>               <field name="oclc_store" type="string" indexed="false"
> stored="true" multiValued="true"/>
>
> OCLC:
> ------
> a. multiple OCLC numbers in a single marc bib record are allowed.
>
> 1.  all 035 subfields a with *our local prefix* "(OCoLC-M)"
> 2.  if none, all 079 sufields a prefixed "ocm" or "ocn"
> 3.  if none of the above, all 035 subfields a prefixed "(OCoLC)"
>
> LCCN :
> -------
> a. at most one per marc bib record.
> b. Strip following text, but not prefixes.  (Not sure this is correct, but
> that's what I did.)
> c. I was able to implement this using a pattern map in our vufind.properties
> file.
>
> 1. 010 subfield a.
> 2.  if none, 010 subfield z.
>
>
> Naomi Dushay
> [hidden email]<mailto:[hidden email]>

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: solrmarc - tweaks to standard number indexingand extraction

Naomi Dushay
Yes, Jim nailed it.   Apparently, if you have a book in hand and type  
in the ISBN from that book, it is sometimes an invalid ISBN (even  
though it is printed on the book), so it is in 020 subfield z.  So the  
user should get a match for that ISBN.  However, when we search Google  
Books, we don't want to use invalid numbers.

Similarly for ISSN:  022 subfields

a - good
l - good (a linking issn(?)
m - cancelled linking issn
y - incorrect issn
z - canceled issn

I also forgot to mention that (of course) we strip the OCLC prefix off  
those numbers as well.


On Jul 28, 2008, at 9:32 AM, James Farrugia wrote:

> Maybe because the subfield z for ISBN are invalid ones, so you  
> wouldn't want to use those externally, but maybe you would  
> internally, to be more inclusive in your search.
>
>
>>>> On 7/28/2008 at 12:30 PM, Andrew Nagy <[hidden email]>  
>>>> wrote:
>> Naomi - this seems quite valuable.  I ask why the need for 2 isbn  
>> and issn
>> fields.  Couldn't the isbn/issn fields be stripped down to just the  
>> number
>> codes and remove any erroneous information - such as whether the  
>> book is
>> paperback, etc.?
>>
>> Since the ISBN and ISSN numbers don't get displayed to the end user  
>> in the
>> search results, I don't see the reason to have the "unmassaged"  
>> field in the
>> index.
>>
>> And Yes - we'd be interested in including your code in the solrmarc  
>> project.
>>
>> Andrew
>>
>> From: [hidden email]
>> [mailto:[hidden email]] On Behalf Of Naomi
>> Dushay
>> Sent: Monday, July 28, 2008 12:22 PM
>> To: [hidden email]
>> Subject: [VuFind-Tech] solrmarc - tweaks to standard number  
>> indexing and
>> extraction
>>
>> Folks,
>>
>> Using code from the solrmarc project, I've done some (test-driven!)  
>> local
>> coding for standard numbers: ISBN, ISSN, OCLC and LCCN.  I thought  
>> I would
>> share the information, FWIW; I am happy to share the code as well  
>> if folks
>> are interested.  I also have the algorithms and a lot of relevant  
>> additional
>> information in a Stanford-only wiki, but I can presumably get a PDF  
>> version or
>> something that I could pass around (or possibly cut and paste the  
>> wiki text
>> into another wiki somewhere).
>>
>> I am of the belief that the indexing should take care of the  
>> massaging of
>> data as necessary, not the UI code.  So stripping following text,  
>> prefixes
>> and the like is done in my indexing code.
>>
>> For ISBN and ISSN, our cataloging expert pointed out that we want  
>> to be as
>> *inclusive* as possible for our users: when they are looking in  
>> *our* index,
>> we should enable matching occurring in as many cases as possible  
>> (maximizing
>> "recall"!).   On the other hand, when we are using these numbers for
>> retrieving external resources (e.g. Google Book Search), we want  
>> the numbers
>> that are most likely to get us a correct answer.  These are two  
>> different
>> needs, and they require two different fields:
>>
>>              <!-- isbn is for code to do external lookups by ISBN  
>> (e.g. Google
>> Book Search) -->
>>              <!-- TODO:  change isbn to isbn_store -->
>>              <field name="isbn" type="string" indexed="false"  
>> stored="true"
>> multiValued="true"/>
>>              <!-- isbnUser_search is for end users to search our  
>> index via an
>> ISBN -->
>>              <field name="isbnUser_search" type="string"  
>> indexed="true"
>> stored="false" multiValued="true"/>
>>              <!-- issn is for code to do external lookups by ISSN -->
>>              <!-- TODO:  change isbn to issn_store -->
>>              <field name="issn" type="string" indexed="false"  
>> stored="true"
>> multiValued="true"/>
>>              <!-- issnUser_search is for end users to search our  
>> index via an
>> ISSN -->
>>              <field name="issnUser_search" type="string"  
>> indexed="true"
>> stored="false" multiValued="true"/>
>>
>> ISBN
>> ------
>> a. multiple ISBN in a single marc bib record are allowed.
>> b. 10 or 13 digit number (last digit may also be "X").
>> c. Strip any following text.
>>
>> isbnUser_search field (for end users to search our index):
>> ----
>> 1.  all 020 subfields a starting with an ISBN string - strip  
>> following text
>> 2.  AND  all 020 subfields z starting with an ISBN string - strip  
>> following
>> text
>>
>> isbn (for external lookups)
>> ----
>> 1.  all 020 subfields a starting with an ISBN string - strip  
>> following text
>> 2.  if none,  all 020 subfields z starting with an ISBN string -  
>> strip
>> following text
>>
>> ISSN
>> -----
>> a. multiple ISSN in a single marc bib record are allowed.
>> b. 4 digit number followed by hyphen followed by 4 digit number  
>> (last digit
>> may also be "X").
>>
>> issnUser_search field (for end users to search our index):
>>   I was able to implement this using a pattern map in our  
>> vufind.properties
>> file.
>> ----
>> 1.  all 020 subfields a with ISSN
>> 2.  AND  all 020 subfields "l" (letter "L") with ISSN
>> 3.  AND  all 020 subfields m with ISSN
>> 4.  AND  all 020 subfields y with ISSN
>> 5.  AND  all 020 subfields z with ISSN
>>
>> issn (for external lookups)
>> ----
>> 1.  all 020 subfields a with ISSN
>> 5.  if none,  all 020 subfields z with ISSN
>>
>>
>> OCLC and LCCN are not exposed to end users, so we want to use the  
>> data that
>> is most likely to get us correct retrieval ("precision"!) in external
>> resources, such as OCLCWorldCat or Google Book Search.   Moreover,  
>> since this
>> data does not need to be searched in our catalog by our users, it  
>> is not
>> imperative to index these fields, though we must store them.  
>> Choosing to
>> index these fields would enable staff searches on these numbers, if  
>> that is
>> desired.
>>
>> solr/conf/schema.xml:
>>              <!-- lccn number for code to do external lookups -->
>>              <field name="lccn_store" type="string" indexed="false"
>> stored="true"/>
>>              <!-- oclc number for google book search links and for  
>> oclc
>> worldcat links -->
>>              <field name="oclc_store" type="string" indexed="false"
>> stored="true" multiValued="true"/>
>>
>> OCLC:
>> ------
>> a. multiple OCLC numbers in a single marc bib record are allowed.
>>
>> 1.  all 035 subfields a with *our local prefix* "(OCoLC-M)"
>> 2.  if none, all 079 sufields a prefixed "ocm" or "ocn"
>> 3.  if none of the above, all 035 subfields a prefixed "(OCoLC)"
>>
>> LCCN :
>> -------
>> a. at most one per marc bib record.
>> b. Strip following text, but not prefixes.  (Not sure this is  
>> correct, but
>> that's what I did.)
>> c. I was able to implement this using a pattern map in our  
>> vufind.properties
>> file.
>>
>> 1. 010 subfield a.
>> 2.  if none, 010 subfield z.
>>
>>
>> Naomi Dushay
>> [hidden email]<mailto:[hidden email]>

Naomi Dushay
[hidden email]




-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: [VuFind-General] solrmarc - tweaks to standard number indexingand extraction

Andrew Nagy-2
Let's keep this on the tech mailing list.

So maybe we need to have an isbn and an isbn2 field much like what we do now with title, series and author.  This will allow us to weight the valid isbns higher than the invalid isbns?

Andrew

> -----Original Message-----
> From: [hidden email] [mailto:vufind-
> [hidden email]] On Behalf Of Naomi Dushay
> Sent: Monday, July 28, 2008 1:14 PM
> To: [hidden email]; vufind-
> [hidden email]
> Subject: Re: [VuFind-General] [VuFind-Tech] solrmarc - tweaks to
> standard number indexingand extraction
>
> Yes, Jim nailed it.   Apparently, if you have a book in hand and type
> in the ISBN from that book, it is sometimes an invalid ISBN (even
> though it is printed on the book), so it is in 020 subfield z.  So the
> user should get a match for that ISBN.  However, when we search Google
> Books, we don't want to use invalid numbers.
>
> Similarly for ISSN:  022 subfields
>
> a - good
> l - good (a linking issn(?)
> m - cancelled linking issn
> y - incorrect issn
> z - canceled issn
>
> I also forgot to mention that (of course) we strip the OCLC prefix off
> those numbers as well.
>
>
> On Jul 28, 2008, at 9:32 AM, James Farrugia wrote:
>
> > Maybe because the subfield z for ISBN are invalid ones, so you
> > wouldn't want to use those externally, but maybe you would
> > internally, to be more inclusive in your search.
> >
> >
> >>>> On 7/28/2008 at 12:30 PM, Andrew Nagy <[hidden email]>
> >>>> wrote:
> >> Naomi - this seems quite valuable.  I ask why the need for 2 isbn
> >> and issn
> >> fields.  Couldn't the isbn/issn fields be stripped down to just the
> >> number
> >> codes and remove any erroneous information - such as whether the
> >> book is
> >> paperback, etc.?
> >>
> >> Since the ISBN and ISSN numbers don't get displayed to the end user
> >> in the
> >> search results, I don't see the reason to have the "unmassaged"
> >> field in the
> >> index.
> >>
> >> And Yes - we'd be interested in including your code in the solrmarc
> >> project.
> >>
> >> Andrew
> >>
> >> From: [hidden email]
> >> [mailto:[hidden email]] On Behalf Of
> Naomi
> >> Dushay
> >> Sent: Monday, July 28, 2008 12:22 PM
> >> To: [hidden email]
> >> Subject: [VuFind-Tech] solrmarc - tweaks to standard number
> >> indexing and
> >> extraction
> >>
> >> Folks,
> >>
> >> Using code from the solrmarc project, I've done some (test-driven!)
> >> local
> >> coding for standard numbers: ISBN, ISSN, OCLC and LCCN.  I thought
> >> I would
> >> share the information, FWIW; I am happy to share the code as well
> >> if folks
> >> are interested.  I also have the algorithms and a lot of relevant
> >> additional
> >> information in a Stanford-only wiki, but I can presumably get a PDF
> >> version or
> >> something that I could pass around (or possibly cut and paste the
> >> wiki text
> >> into another wiki somewhere).
> >>
> >> I am of the belief that the indexing should take care of the
> >> massaging of
> >> data as necessary, not the UI code.  So stripping following text,
> >> prefixes
> >> and the like is done in my indexing code.
> >>
> >> For ISBN and ISSN, our cataloging expert pointed out that we want
> >> to be as
> >> *inclusive* as possible for our users: when they are looking in
> >> *our* index,
> >> we should enable matching occurring in as many cases as possible
> >> (maximizing
> >> "recall"!).   On the other hand, when we are using these numbers for
> >> retrieving external resources (e.g. Google Book Search), we want
> >> the numbers
> >> that are most likely to get us a correct answer.  These are two
> >> different
> >> needs, and they require two different fields:
> >>
> >>              <!-- isbn is for code to do external lookups by ISBN
> >> (e.g. Google
> >> Book Search) -->
> >>              <!-- TODO:  change isbn to isbn_store -->
> >>              <field name="isbn" type="string" indexed="false"
> >> stored="true"
> >> multiValued="true"/>
> >>              <!-- isbnUser_search is for end users to search our
> >> index via an
> >> ISBN -->
> >>              <field name="isbnUser_search" type="string"
> >> indexed="true"
> >> stored="false" multiValued="true"/>
> >>              <!-- issn is for code to do external lookups by ISSN --
> >
> >>              <!-- TODO:  change isbn to issn_store -->
> >>              <field name="issn" type="string" indexed="false"
> >> stored="true"
> >> multiValued="true"/>
> >>              <!-- issnUser_search is for end users to search our
> >> index via an
> >> ISSN -->
> >>              <field name="issnUser_search" type="string"
> >> indexed="true"
> >> stored="false" multiValued="true"/>
> >>
> >> ISBN
> >> ------
> >> a. multiple ISBN in a single marc bib record are allowed.
> >> b. 10 or 13 digit number (last digit may also be "X").
> >> c. Strip any following text.
> >>
> >> isbnUser_search field (for end users to search our index):
> >> ----
> >> 1.  all 020 subfields a starting with an ISBN string - strip
> >> following text
> >> 2.  AND  all 020 subfields z starting with an ISBN string - strip
> >> following
> >> text
> >>
> >> isbn (for external lookups)
> >> ----
> >> 1.  all 020 subfields a starting with an ISBN string - strip
> >> following text
> >> 2.  if none,  all 020 subfields z starting with an ISBN string -
> >> strip
> >> following text
> >>
> >> ISSN
> >> -----
> >> a. multiple ISSN in a single marc bib record are allowed.
> >> b. 4 digit number followed by hyphen followed by 4 digit number
> >> (last digit
> >> may also be "X").
> >>
> >> issnUser_search field (for end users to search our index):
> >>   I was able to implement this using a pattern map in our
> >> vufind.properties
> >> file.
> >> ----
> >> 1.  all 020 subfields a with ISSN
> >> 2.  AND  all 020 subfields "l" (letter "L") with ISSN
> >> 3.  AND  all 020 subfields m with ISSN
> >> 4.  AND  all 020 subfields y with ISSN
> >> 5.  AND  all 020 subfields z with ISSN
> >>
> >> issn (for external lookups)
> >> ----
> >> 1.  all 020 subfields a with ISSN
> >> 5.  if none,  all 020 subfields z with ISSN
> >>
> >>
> >> OCLC and LCCN are not exposed to end users, so we want to use the
> >> data that
> >> is most likely to get us correct retrieval ("precision"!) in
> external
> >> resources, such as OCLCWorldCat or Google Book Search.   Moreover,
> >> since this
> >> data does not need to be searched in our catalog by our users, it
> >> is not
> >> imperative to index these fields, though we must store them.
> >> Choosing to
> >> index these fields would enable staff searches on these numbers, if
> >> that is
> >> desired.
> >>
> >> solr/conf/schema.xml:
> >>              <!-- lccn number for code to do external lookups -->
> >>              <field name="lccn_store" type="string" indexed="false"
> >> stored="true"/>
> >>              <!-- oclc number for google book search links and for
> >> oclc
> >> worldcat links -->
> >>              <field name="oclc_store" type="string" indexed="false"
> >> stored="true" multiValued="true"/>
> >>
> >> OCLC:
> >> ------
> >> a. multiple OCLC numbers in a single marc bib record are allowed.
> >>
> >> 1.  all 035 subfields a with *our local prefix* "(OCoLC-M)"
> >> 2.  if none, all 079 sufields a prefixed "ocm" or "ocn"
> >> 3.  if none of the above, all 035 subfields a prefixed "(OCoLC)"
> >>
> >> LCCN :
> >> -------
> >> a. at most one per marc bib record.
> >> b. Strip following text, but not prefixes.  (Not sure this is
> >> correct, but
> >> that's what I did.)
> >> c. I was able to implement this using a pattern map in our
> >> vufind.properties
> >> file.
> >>
> >> 1. 010 subfield a.
> >> 2.  if none, 010 subfield z.
> >>
> >>
> >> Naomi Dushay
> >> [hidden email]<mailto:[hidden email]>
>
> Naomi Dushay
> [hidden email]
>
>
>
>
> -----------------------------------------------------------------------
> --
> This SF.Net email is sponsored by the Moblin Your Move Developer's
> challenge
> Build the coolest Linux based applications with Moblin SDK & win great
> prizes
> Grand prize is a trip for two to an Open Source event anywhere in the
> world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> VuFind-General mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/vufind-general

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech
Reply | Threaded
Open this post in threaded view
|

Re: [VuFind-General] solrmarc - tweaks to standard number indexingand extraction

Naomi Dushay
My $.02:

I chose to separate the two fields according to their functions.  
Since only one of them is stored, and only one of them is indexed,  
this shouldn't bloat the index.

Also, I believe the code is clearer with

isbn_store
isbnUser_search

and comments in the schema.xml file, as well as the code, to indicate  
the intended uses of these fields.  There may be better names than the  
ones I chose, but I'm personally not fond of xxx2 as a fieldname,  
because I have to comb through the code to see how it's used, and its  
relation to the xxx field is not clear just from the name.  (Recall my  
previous postings about field names:  xxx_search  xxx_display  
xxx_facet and the like.)

The only reason i didn't use "isbn_store" at this time was to avoid  
lots of changes to the UI code at this time.  Eventually I intend to  
refactor.

- Naomi

On Jul 28, 2008, at 11:02 AM, Andrew Nagy wrote:

> Let's keep this on the tech mailing list.
>
> So maybe we need to have an isbn and an isbn2 field much like what  
> we do now with title, series and author.  This will allow us to  
> weight the valid isbns higher than the invalid isbns?
>
> Andrew
>
>> -----Original Message-----
>> From: [hidden email] [mailto:vufind-
>> [hidden email]] On Behalf Of Naomi Dushay
>> Sent: Monday, July 28, 2008 1:14 PM
>> To: [hidden email]; vufind-
>> [hidden email]
>> Subject: Re: [VuFind-General] [VuFind-Tech] solrmarc - tweaks to
>> standard number indexingand extraction
>>
>> Yes, Jim nailed it.   Apparently, if you have a book in hand and type
>> in the ISBN from that book, it is sometimes an invalid ISBN (even
>> though it is printed on the book), so it is in 020 subfield z.  So  
>> the
>> user should get a match for that ISBN.  However, when we search  
>> Google
>> Books, we don't want to use invalid numbers.
>>
>> Similarly for ISSN:  022 subfields
>>
>> a - good
>> l - good (a linking issn(?)
>> m - cancelled linking issn
>> y - incorrect issn
>> z - canceled issn
>>
>> I also forgot to mention that (of course) we strip the OCLC prefix  
>> off
>> those numbers as well.
>>
>>
>> On Jul 28, 2008, at 9:32 AM, James Farrugia wrote:
>>
>>> Maybe because the subfield z for ISBN are invalid ones, so you
>>> wouldn't want to use those externally, but maybe you would
>>> internally, to be more inclusive in your search.
>>>
>>>
>>>>>> On 7/28/2008 at 12:30 PM, Andrew Nagy <[hidden email]>
>>>>>> wrote:
>>>> Naomi - this seems quite valuable.  I ask why the need for 2 isbn
>>>> and issn
>>>> fields.  Couldn't the isbn/issn fields be stripped down to just the
>>>> number
>>>> codes and remove any erroneous information - such as whether the
>>>> book is
>>>> paperback, etc.?
>>>>
>>>> Since the ISBN and ISSN numbers don't get displayed to the end user
>>>> in the
>>>> search results, I don't see the reason to have the "unmassaged"
>>>> field in the
>>>> index.
>>>>
>>>> And Yes - we'd be interested in including your code in the solrmarc
>>>> project.
>>>>
>>>> Andrew
>>>>
>>>> From: [hidden email]
>>>> [mailto:[hidden email]] On Behalf Of
>> Naomi
>>>> Dushay
>>>> Sent: Monday, July 28, 2008 12:22 PM
>>>> To: [hidden email]
>>>> Subject: [VuFind-Tech] solrmarc - tweaks to standard number
>>>> indexing and
>>>> extraction
>>>>
>>>> Folks,
>>>>
>>>> Using code from the solrmarc project, I've done some (test-driven!)
>>>> local
>>>> coding for standard numbers: ISBN, ISSN, OCLC and LCCN.  I thought
>>>> I would
>>>> share the information, FWIW; I am happy to share the code as well
>>>> if folks
>>>> are interested.  I also have the algorithms and a lot of relevant
>>>> additional
>>>> information in a Stanford-only wiki, but I can presumably get a PDF
>>>> version or
>>>> something that I could pass around (or possibly cut and paste the
>>>> wiki text
>>>> into another wiki somewhere).
>>>>
>>>> I am of the belief that the indexing should take care of the
>>>> massaging of
>>>> data as necessary, not the UI code.  So stripping following text,
>>>> prefixes
>>>> and the like is done in my indexing code.
>>>>
>>>> For ISBN and ISSN, our cataloging expert pointed out that we want
>>>> to be as
>>>> *inclusive* as possible for our users: when they are looking in
>>>> *our* index,
>>>> we should enable matching occurring in as many cases as possible
>>>> (maximizing
>>>> "recall"!).   On the other hand, when we are using these numbers  
>>>> for
>>>> retrieving external resources (e.g. Google Book Search), we want
>>>> the numbers
>>>> that are most likely to get us a correct answer.  These are two
>>>> different
>>>> needs, and they require two different fields:
>>>>
>>>>             <!-- isbn is for code to do external lookups by ISBN
>>>> (e.g. Google
>>>> Book Search) -->
>>>>             <!-- TODO:  change isbn to isbn_store -->
>>>>             <field name="isbn" type="string" indexed="false"
>>>> stored="true"
>>>> multiValued="true"/>
>>>>             <!-- isbnUser_search is for end users to search our
>>>> index via an
>>>> ISBN -->
>>>>             <field name="isbnUser_search" type="string"
>>>> indexed="true"
>>>> stored="false" multiValued="true"/>
>>>>             <!-- issn is for code to do external lookups by ISSN --
>>>
>>>>             <!-- TODO:  change isbn to issn_store -->
>>>>             <field name="issn" type="string" indexed="false"
>>>> stored="true"
>>>> multiValued="true"/>
>>>>             <!-- issnUser_search is for end users to search our
>>>> index via an
>>>> ISSN -->
>>>>             <field name="issnUser_search" type="string"
>>>> indexed="true"
>>>> stored="false" multiValued="true"/>
>>>>
>>>> ISBN
>>>> ------
>>>> a. multiple ISBN in a single marc bib record are allowed.
>>>> b. 10 or 13 digit number (last digit may also be "X").
>>>> c. Strip any following text.
>>>>
>>>> isbnUser_search field (for end users to search our index):
>>>> ----
>>>> 1.  all 020 subfields a starting with an ISBN string - strip
>>>> following text
>>>> 2.  AND  all 020 subfields z starting with an ISBN string - strip
>>>> following
>>>> text
>>>>
>>>> isbn (for external lookups)
>>>> ----
>>>> 1.  all 020 subfields a starting with an ISBN string - strip
>>>> following text
>>>> 2.  if none,  all 020 subfields z starting with an ISBN string -
>>>> strip
>>>> following text
>>>>
>>>> ISSN
>>>> -----
>>>> a. multiple ISSN in a single marc bib record are allowed.
>>>> b. 4 digit number followed by hyphen followed by 4 digit number
>>>> (last digit
>>>> may also be "X").
>>>>
>>>> issnUser_search field (for end users to search our index):
>>>>  I was able to implement this using a pattern map in our
>>>> vufind.properties
>>>> file.
>>>> ----
>>>> 1.  all 020 subfields a with ISSN
>>>> 2.  AND  all 020 subfields "l" (letter "L") with ISSN
>>>> 3.  AND  all 020 subfields m with ISSN
>>>> 4.  AND  all 020 subfields y with ISSN
>>>> 5.  AND  all 020 subfields z with ISSN
>>>>
>>>> issn (for external lookups)
>>>> ----
>>>> 1.  all 020 subfields a with ISSN
>>>> 5.  if none,  all 020 subfields z with ISSN
>>>>
>>>>
>>>> OCLC and LCCN are not exposed to end users, so we want to use the
>>>> data that
>>>> is most likely to get us correct retrieval ("precision"!) in
>> external
>>>> resources, such as OCLCWorldCat or Google Book Search.   Moreover,
>>>> since this
>>>> data does not need to be searched in our catalog by our users, it
>>>> is not
>>>> imperative to index these fields, though we must store them.
>>>> Choosing to
>>>> index these fields would enable staff searches on these numbers, if
>>>> that is
>>>> desired.
>>>>
>>>> solr/conf/schema.xml:
>>>>             <!-- lccn number for code to do external lookups -->
>>>>             <field name="lccn_store" type="string" indexed="false"
>>>> stored="true"/>
>>>>             <!-- oclc number for google book search links and for
>>>> oclc
>>>> worldcat links -->
>>>>             <field name="oclc_store" type="string" indexed="false"
>>>> stored="true" multiValued="true"/>
>>>>
>>>> OCLC:
>>>> ------
>>>> a. multiple OCLC numbers in a single marc bib record are allowed.
>>>>
>>>> 1.  all 035 subfields a with *our local prefix* "(OCoLC-M)"
>>>> 2.  if none, all 079 sufields a prefixed "ocm" or "ocn"
>>>> 3.  if none of the above, all 035 subfields a prefixed "(OCoLC)"
>>>>
>>>> LCCN :
>>>> -------
>>>> a. at most one per marc bib record.
>>>> b. Strip following text, but not prefixes.  (Not sure this is
>>>> correct, but
>>>> that's what I did.)
>>>> c. I was able to implement this using a pattern map in our
>>>> vufind.properties
>>>> file.
>>>>
>>>> 1. 010 subfield a.
>>>> 2.  if none, 010 subfield z.
>>>>
>>>>
>>>> Naomi Dushay
>>>> [hidden email]<mailto:[hidden email]>
>>
>> Naomi Dushay
>> [hidden email]
>>
>>
>>
>>
>> -----------------------------------------------------------------------
>> --
>> This SF.Net email is sponsored by the Moblin Your Move Developer's
>> challenge
>> Build the coolest Linux based applications with Moblin SDK & win  
>> great
>> prizes
>> Grand prize is a trip for two to an Open Source event anywhere in the
>> world
>> http://moblin-contest.org/redirect.php?banner_id=100&url=/
>> _______________________________________________
>> VuFind-General mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/vufind-general

Naomi Dushay
[hidden email]




-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Vufind-tech mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/vufind-tech