unstructured data [English]


Other Languages

Syndetic Relationships

InterPARES Definition

n. ~ 1. Data that is not organized according to a data model or pattern. – 2. Information, especially free-flowing text, that is not fielded, tagged, or otherwise classified for access by database queries.

General Notes

Unstructured data is exemplified by narrative or discursive text. Note that such text may be in a container that provides some structure. The body of an email message may be unstructured, but the email container itself has structure, including elements for sender, recipients, data, and subject.

Citations

  • Arthur 2013 (†454 ): Unstructured data comes from information that is not organized or easily interpreted by traditional databases or data models, and typically, it’s text-heavy. Metadata, Twitter tweets, and other social media posts are good examples of unstructured data. (†625)
  • Franks 2013 (†560 p.36): Unstructured data is anything not in a database. Images, word documents, and even tweets are examples of unstructured data. Unstructured data is more difficult to classify, maintain, archive, and dispose of than structured data. (†1849)
  • Gingrich & Morris 2006 (†358 p. 31): Unstructured electronic records consist of electronic information created or obtained by end users where the information is not stored in tables in a relational database system. Unstructured records in a typical organization would include emails, word processing documents, spreadsheets, presentations and graphics – documents mostly created by individual users from desktop applications. Unstructured records would also include Adobe PDF files and electronic captures of facsimiles as well as other image files. (†347)
  • NIST 2013 (†734 p. F-16): Unstructured data typically refers to digital information without a particular data structure or with a data structure that does not facilitate the development of rule sets to address the particular sensitivity of the information conveyed by the data or the associated flow enforcement decisions. Unstructured data consists of: (i) bitmap objects that are inherently non language-based (i.e., image, video, or audio files); and (ii) textual objects that are based on written or printed languages (e.g., commercial off-the-shelf word processing documents, spreadsheets, or emails). (†1810)
  • What is Unstructured Data 2014 (†455 ): Unstructured data can be broken down into different groups. A well known group is multimedia or rich media. Here there are types like digital image, audio, video and document (though there are more in this list). Some of these types are well defined and can contain embedded in them XML (or other) that conform to an agreed set of standards. The format of the binary data can also follow agreed rules. The digital image format JPEG is an open standard. For video, MPEG is also an open standard. Multimedia would be a category of unstructured data that is well defined. Its category is fluid and changing as technology changes and unlikely will ever be able to conform to the mathematically and well proven relational structure. (†629)
  • Wikipedia (†387 s.v. "data model"): A data model explicitly determines the structure of data or structured data. Typical applications of data models include database models, design of information systems, and enabling exchange of data. Usually data models are specified in a data modeling language. (†616)
  • Wikipedia (†387 s.v. "unstructured data"): Information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional computer programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents. (†617)
  • Wikipedia (†387 s.v. "unstructured data"): In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-90% of all potentially usable business information may originate in unstructured form. This rule of thumb is not based on primary or any quantitative research, but nonetheless is accepted by some. ¶IDC and EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold growth from the beginning of 2010. Computer World states that unstructured information might account for more than 70%–80% of all data in organizations. [Notes omitted.] (†618)