Here are some examples of structured and unstructured data projects and services (which at times overlap). And remember that data is almost always wrong but sometimes it is useful!
Structured data (Pre-defined and machine-readable, is locatable and usually has a relational ‘data model’ and usually is about real-world objects)
- What is meta-data? (Australian National Data Service) http://www.ands.org.au/
- Library Catalogues (date, author, place, subject, etc)
- Census records (birth, income, employment, place etc.)
- Federal and State Hansard http://www.openaustralia.org/
- Legal records: Old Bailey Online (1674-1913) http://www.oldbaileyonline.org/
- Economic data (GDP, PPI, ASX etc.)
- FaceBook like button (big-data collection!)
- Phone numbers (and the phone book)
- Databases (structuring fields)
- XML-TEI (bringing structure to the text through tagging particular elements like versions of the word ”canal’ in 17th C Dutch.
Un-structured data (no pre-defined data model, usually text. But there is always some structure)
The techniques for dealing with unstructured data usually involve text-analysis (sometimes statistical) to look for patterns (semantic, linguistic, historical ‘dates, numbers, facts’ etc), to aid in search and discovery (not analysis, that involved critical humanities scholars). The patterns can be small (ie a single author) or large-scale (ie. a newspaper corpus), but sometimes so large scale that results may lack meaning.
- The Web! (google’s Page Rank algorithm)
- email (body), web-page, word-precessed document
- Voyant Tools http://docs.voyant-tools.org/tools/
- TROVE http://trove.nla.gov.au/
- Stylometry (ie. work of the Australian, John Burrows), Federalist papers a good example
- Factiva http://unimelb.libguides.com/content.php?pid=99524&sid=761325
- British Newspapers (first ‘newspaper’ 1702) http://www.britishnewspaperarchive.co.uk/
- Google NGrams http://books.google.com/ngrams
- Health records (and there is a move to e-records)
- Topsy (Social web-research)Â http://topsy.com/
- Spying with un-structured ‘big data’ (NSA). Perhaps good for real-time analysis or locating one specific individual.
- SPSS and NVivo (commercial products for un-structured data)
Leave a Reply