Mark Humphrys - Undergrad project ideas

The Internet

Computers and Genealogy

Masters project ideas


Undergrad project ideas



The Internet


  1. Link fixer - DONE BEFORE, WOULD NEED SOME NEW IDEAS

    This would be more than just a program to tell you that links on your web page were broken. It would go out and, using a variety of heuristics, attempt to actually fix the links.
    Input would be a web page. First it would find broken links. Then it would search to see if the missing page had moved somewhere, e.g. hacking off bits from the RHS and then reconstructing the URL.
    If it could not reconstruct the URL, it would turn to search engines using linkto: and other techniques. If it still could not find the URL, it might suggest an alternative based on keywords.
    Finally, it should actually output a copy of the web page with the links fixed, that the user may simply copy it over the old one, or selectively cut and paste.


    Possible enhancements:

    1. Take as input an entire website. Run overnight. Generate a report of the broken links listed in order of the site with the most broken links to it. Because often a change must be made globally through many of your pages. e.g. "http://akebono.stanford.edu/yahoo/" has moved to "http://www.yahoo.com/" - change needed throughout the following 53 of your web pages.
    2. Build a script that, when run, makes these changes.
    3. Nice user interface output, showing old page, new page, and selective buttons to press to "Fix link" on disk. User in control at all times.
    4. Use archive.org to find old page. Then search in Google to find where it has gone to.
    5. Suggest alternative links. e.g. Page on Shakespeare has gone. Suggest wiki/Shakespeare instead. "Click this to make change". User always in control.

    
    
    
    
  2. "Who links to me?" Web agent - DONE BEFORE, WOULD NEED SOME NEW IDEAS

    Search engines have a linkto: facility, so you can see who links to you, but it takes forever to browse the list so you don't bother.
    This would be a standalone program to find all pages that link to the user or reference them, download all these pages (will take a long time), perhaps sort them by the topic or page referenced, and present them all in a nice readable output, with all the references highlighted (as in Google's cache).


    How to highlight a phrase using     the font tag    
    View Source to see how this is done (link works on Firefox).


    How to highlight a phrase using     the bold tag  
    View Source to see how this is done (link works on Firefox).


    How to highlight a phrase using     tables  
    View Source to see how this is done (link works on Firefox).


    
    
    
    
  3. "What is like this?"

    Extract keywords from page. (How? Need idea of dictionary frequency.)
    Use search engines to find similar pages on Web.

    Implement as CGI script so that I can automatically add a "What is like this?" link at the top of every URL.



Web page enhancer - DONE BEFORE, WOULD NEED SOME NEW IDEAS





Hyperwords demo.




Computers and Genealogy


  1. Royal Descent finder

    Background:

    The program:

    1. The user selects Hull or Australia.
    2. The user inputs the ID number of the person whose ancestry they want to check. Up to the user to find this ID number.
    3. The program searches the relevant database for lines from this person to any of n pre-defined English and British monarchs.
    4. It returns the shortest such line found (least number of generations).
    5. It displays the line like this.

    Notes:

    1. To restrict the search (e.g. Victoria would have over 10,000 ancestors) we proceed as follows. First do an indefinite-length search on ancestors until one Royal descent is found. Let us say this is at 12 generations back. From now, on, search no more than 12 generations back. Let us say we then find a further Royal descent at 6 generations back. From now, on, search no more than 6 generations back. And so on. So with Victoria we would find a Royal descent at 2 generations and would then search no line longer than 2 generations, and the search would end quickly.
    2. HTTP requests should be spaced out. Say one per second.
    3. Australia database has "Ancestors" mode which downloads multiple generations in one HTTP request. This could be used to speed things up.
    4. Information should be cached so don't have to repeat the HTTP if did it before.
    5. Need to parse the returned page in order to find link to father and mother.
    6. See Parsing XML / HTML

    Enhancements:

    1. Enter name (not ID number). Searches remote site for name. Presents you with list of choices. Click on one. Program extracts ID no.
    2. Option to relate person to other lists of monarchs (e.g. French monarchs).
    3. Make it an online service.
    
    
    
    
  2. Tree matcher.

    Takes trees which are in a structured HTML format (e.g. GEDCOM 2 HTML), and tries to match up fragments of them with other trees in structured HTML format on the Web, looking for overlaps.

    Start with matching surname lists. Then look for overlaps round each individual.

    Similar to "What is like this?" above.

    
    
    
    
  3. "GEDCOM 2 narrative" family tree converter - DONE BEFORE, BUT I HAVE SOME NEW IDEAS

    The standard format for computerised family trees is the GEDCOM format. Historically, the standard format for paper family trees has been the Burke's Peerage narrative format. The aim of the project is to write a converter between the two.
    The converter would take as input a family tree in GEDCOM format (there are many sample GEDCOM trees on the Web) and output the information in the Burke's narrative format in HTML (which is illustrated on my own Web pages).
    One of the main challenges for the software would be to automatically detect where to break the narrative and start a new narrative, something Burke's (and I) currently do by hand. The result would be a more flexible output than the databases provide.

    Perhaps to be done in cooperation with Tompsett at Hull. Debugged offline on separate data. End product is a script Tompsett could add to his site, so that we can see all of Tompsett's data in condensed Hypertext Narrative format.