Table of Contents
- 1. Data Import
- 2. Subsetting
- 5. Appendix
A-Frame Examples. Aframe.io; GitHub; Slack; Twitter; Showcase Directory; Examples. Anime UI; Comic Book; Composite; Curved Mockups; Dynamic Lights; Hand Tracking. See how BuzzFeed harnessed the power of the Frame.io API. Read the case study. Learn more about our enterprise plan Spend less time on logistics and more on creating. Apple Design Award Winner No Strings Attached. Stay connected when you’re on the move with our powerful, pocket-sized platform. This attribute allows you to give a name to a frame. It is used to indicate which frame a document should be loaded into. This is especially important when you want to create links in one frame that load pages into an another frame, in which case the second frame needs a name to identify itself as the target of the link.
This is a loose port of adataframe tutorial RosettaStone to compare traditional dataframe tools built in R, Julia,Python, etc. withFrames. Performing dataanalysis in Haskell brings with it a few advantages:
- Interactive exploration is supported in GHCi
- GHC produces fast, memory-efficient code when you're ready to run aprogram that might take a bit of time
- You get to use Haskell to write your own functions when what youwant isn't already defined in the library
- The code you write is statically typed so that mismatches betweenyour code and your data data are found by the type checker
The example data fileused (specifically, the
u.user file from the MovieLens 100kdata set) does not include column headers, nor does it use commasto separate values, so it does not fall into the sweet spot of CSVparsing that
Frames is aimed at. That said, this mismatch of testdata and library support is a great opportunity to verify that
Frames are flexible enough to meet a variety of needs.
We begin with rather a lot of imports to support a variety of testoperations and parser customization. I encourage you to start with asmaller test program than this!
A few other imports will be used for highly customized parsing later.
1 Data Import
We usually package column names with the data to keep things a bitmore self-documenting. In the common case where a data file has aheader row providing column names, and columns are separated bycommas, generating the types needed to import a data set is as simpleas,
The data set this example considers is rather far from the sweetspot of CSV processing that
Frames is aimed it: it does not includecolumn headers, nor does it use commas to separate values! However,these mismatches do provide an opportunity to see that the
Frameslibrary is flexible enough to meet a variety of needs.
This template haskell splice explicitly specifies the name for theinferred record type, column names, a separator string, and thedata file from which to infer the record type (i.e. what typeshould be used to represent each column). The result of this spliceis included in an appendix below so you can flipbetween the generated code and how it is used.
Since this data is far from the ideal CSV file, we have to tell
Frames how to interpret the data so that it can decide what datatype to use for each column. Having the types depend upon the data inthe given file is a useful exercise in this domain as the actual shapeof the data is of paramount importance during the early import andexploration phases of data analysis.
We can load the module into
cabal repl to see what we have so far.
This lets us perform a quick check that the types are basically whatwe expect them to be.
We now define a streaming representation of the full data set. If thedata set is too large to keep in memory, we can process it as itstreams through RAM.
Alternately, if we want to run multiple operations against a data setthat can fit in RAM, we can do that. Here we define an in-core (inmemory) array of structures (AoS) representation.
1.1 Streaming Cores?
Frame is an in-memory representation of your data. The
Frameslibrary stores each column as compactly as it knows how, and letsyou index your data as a structure of arrays (where each field ofthe structure is an array corresponding to a column of your data),or as an array of structures, also known as a
Frame. These latterstructures correspond to rows of your data. Alternatively, rows ofdata may be handled in a streaming fashion so that you are notlimited to available RAM. In the streaming paradigm, you processeach row individually as a single record.
O(1) indexing, as well as any other operationsyou are familiar with based on the
Foldable class. If a data set issmall, keeping it in RAM is usually the fastest way to performmultiple analyses on that data that you can't fuse into a singletraversal.
Producer of rows is a great way to whittle down alarge data set before moving on to whatever you want to do next.
The upshot is that you can work with your data as a collection ofrows with either a densely packed in-memory reporesentation – a
Frame – or a stream of rows provided by a
Producer. The choicedepends on if you want to perform multiple queries against yourdata, and, if so, whether you have enough RAM to hold the data. Ifthe answer to both of those questions is,'Yes!', consider using a
Frame as in the
loadMovies example. If the answer to either question is,'Nope!', you will be better off with a
Producer, as in the
1.2 Sanity Check
We can compute some easy statistics to see how things look.
When there are multiple properties we would like to compute, we canfuse multiple traversals into one pass using something like the foldlpackage,
Here we are projecting the
age column out of each record, andcomputing the minimum and maximum
age across all rows.
2.1 Row Subset
Data may be inspected using either Haskell's traditional list API…
O(1) indexing of individual rows. Here we take the last threerows of the data set,
This lets us view a subset of rows,
2.2 Column Subset
We can consider a single column.
Or multiple columns,
If you'd rather not define a function like
miniUser, you can fixthe types in-line by using the
2.3 Query / Conditional Subset
Filtering our frame is rather nicely done using thepipes package. Herewe pick out the users whose occupation is 'writer'.
If you're not too keen on all the
pipes syntax in that example, youcould also write it using a helper function provided by
This is a handy way to try out various maps and filters you may wantto eventually apply to a large data set.
2.4 Column Subset Update
We can also apply a function to a subset of columns of each row! Here,we want to apply a function with type
Int -> Int to two columnswhose values are of type
Let's preview the effect of this function by applying it to the
Age columns of the first three rows of our data set.
This is a neat way of manipulating a few columns without having toworry about what other columns might exist. You might want to use thisfor normalizing the capitalization, or truncating the length of,various text fields, for example.
2.5 Mostly-Uniform Data
(Warning: This section veers into types that are likely of moreuse to library authors than end users.)
Suppose we don't know much about our data, but we do know that itstarts with an identifying column, and then some number of numericcolumns. We can structurally peel off the first column, perform aconstrained polymorphic operation on the other columns, then gluethe first column back on to the result.
But what if we don't want to rely entirely on ordering of our rows?Here, we know there is an identifying column,
Occupation, and wewant to shuffle it around to the head of the record while mapping aconstrained polymorphic operation over the other columns.
It is a bit clumsy to delete and then add back a particular field,and the dependence on explicit structure is relying a bit more oncoincidence than we might like. We could choose, instead, to workwith row types that contain a distinguished column somewhere intheir midst, but regarding precisely where it is, or how manyother fields there are, we care not.
We can unpack this type a bit to understand what is happening. A
Record is a record from the
Vinyl library, except thateach type has phantom column information. This metadata isavailable to the type checker, but is erased during compilation sothat it does not impose any runtime overhead. What we are doinghere is saying that we will operate on a
Frames row type,
Recordrs, that has an element
Occupation, and that deleting thiselement works properly (i.e. the leftover fields are a propersubset of the original row type). We further state – with the
AsVinyl constraint – that we want to work on the unadorned fieldvalues, temporarily discarding their header information, with the
mapMethod function that will treat our richly-typed row as a lessinformative
We then peer through a lens onto the set of all unadorned fields otherthan
Occupation, apply a function with a
Num constraint to each ofthose fields, then pull back out of the lens reattaching the columnheader information on our way. All of that manipulation andbookkeeping is managed by the type checker.
Lest we forget we are working in a typed setting, what happens ifthe constraint on our polymorphic operation can't be satisfied byone of the columns?
This error message isn't ideal in that it doesn't tell us whichcolumn failed to satisfy the constraint. Hopefully this can beimproved in the future!
3 Escape Hatch
When you're done with
Frames and want to get back to morefamiliar monomorphic pastures, you can bundle your data up.
4 Better Types
A common disappointment of parsing general data sets is thereliance on text for data representation even after parsing. Ifyou find that the default
Columns spectrum of potential columntypes that
Frames uses doesn't capture desired structure, you cango ahead and define your own universe of column types! The
Userrow types we've been playing with here is rather boring: it onlyuses
Text column types. But
Text is far too vague atype for a column like
All of the zip codes in this set are five characters, and most arestandard numeric US zip codes. Let's go ahead and define our ownuniverse of column types.
Note that these definitions must be imported from a separate moduleto satisfy GHC's stage restrictions related to TemplateHaskell. The full code for the custom type may be found in an appendix.
We name this record type
U2, and give all the generated column typesand lenses a prefix, 'u2', so they don't conflict with the definitionswe generated earlier.
This new record type,
U2, has a more interesting
Let's take the occupations of the first 10 users from New England,New Jersey, and other places whose zip codes begin with a zero.
So there we go! We've done both row and column subset queries with astrongly typed query (namely,
isNewEngland). Another situation inwhich one might want to define a custom universe of column types iswhen dealing with dates. This would let you both reject rows withbadly formatted dates, for example, and efficiently query the data setwith richly-typed queries.
Even better, did you notice the types of
neOccupations? They are polymorphic over the full row type!That's what the
(Occupation ∈ rs) constraint signifies: such afunction will work for record types with any set of fields,
rs, solong as
Occupation is an element of that set. This means that ifyour schema changes, or you switch to a related but different dataset, these functions can still be used without even touching thecode. Just recompile against the new data set, and you're good to go.
5.1 User Types
Here are the definitions needed to define the
MyColumns type withits more descriptive
ZipT type. We have to define these things ina separate module from our main work due to GHC's stagerestrictions regarding Template Haskell. Specifically,
ZipT andits instances are used at compile time to infer the record typeneeded to represent the data file. Notice the extension point hereis not too rough: you prepend new, more refined, type compatibilitychecks to the head of
CommonColumns, or you can build up your ownlist of expected types.
This may not be something you'd want to do for every dataset. However, the ability to refine the structure of parsed datais in keeping with the overall goal of
Frames: it's easy to takeoff, and the sky's the limit.
5.2 Splice Dump
The Template Haskell splices we use produce quite a lot ofcode. The raw dumps of these splices can be hard to read, but Ihave included some elisp code for cleaning up that output in thedesign notes for
Frames. Here is what we get from the
tableTypes' splice shown above.
The overall structure is this:
Userwith all necessary columns
userParservalue that overrides parsing defaults
- A type synonym for each column that pairs the column name with itstype
- A lens to work with each column on any given row
Remember that for CSV files that include a header, the splice youwrite in your code need not include the column names or separatorcharacter.
Thanks to Greg Hale and Ben Gamari for reviewing early drafts of this document.
Since the StereoMorph digitizing application cannot read videos directly, the frames must be extracted from the videos and input to the digitizing application as images. Frames can be extracted in StereoMorph using the function extractFrames(). Before using extractFrames() be sure that you have completed the steps in installing ffmpeg so that R can read the video files. If you'd like to work through the example below, you can download an example video file here (10 MB). Note that in Safari you may have to right-click and select 'Download Video' rather than using File/Save As.
1. If you are unsure of how many frames the video has or which frames you would like to extract, you can call extractFrames() without any parameters.
The function will prompt you to enter a video file that you want to extract frames from. Either type the video file path or simply click and drag the video file into the R console and the file path will be copied over.
Sdp Io Intro
2. Next, the function will prompt you to ask where you want to save the extracted frames. Either type a file path to a folder or simply click and drag a folder into the R console and the file path will be copied over.
3. The function will then tell you the number of total video frames and ask you to enter the frames that you want to extract. Note that the first frame is frame 0. The example video has 100 frames total, so you can enter any frames between 0 and 99. You can specify the frames you want to extract by entering a single number, using the ':' symbol or using the c() or seq() functions:
By default, if the number of frames you set to extract is greater than 100, the function will list all the frames to be extracted and issue a second prompt to ask if you are sure (to avoid extracting thousands of frames by mistake). This warning can be turned off by setting the warn.min parameter to any number larger than the total number of frames in the video (e.g. 1000000).
4. If you already know the input parameters in advance and want to use the function without any prompts, you can just set these parameters in the function call. Create a folder named 'Frames' in your current R working directory.
5. Call extractFrames() with all the input parameters.