Open Refine is as useful for faceted browsing as it is for data cleaning. You can get to grips with the structure and character of a large and irregular data set without altering anything.
When joining columns it can make things easier to read if you put a space between them.
So if you join
hat you get
cat hat rather than
cells.colA.value + cells.colB.value
cells.colA.value + " " + cells.colB.value
You will then want to make sure you remove extra spaces from the string at the end.
trim() removes any spaces from the start or end of the value and
replace(/\s+/," ") replaces any group of spaces with a single one. The two commands can be joined together:
You can include this in the command when originally creating the new column, ending up with:
cells.colA.value + " " + cells.colB.value .trim().replace(/\s+/," ")
This is the equivalent of doing the join and then running both of Edit Cells > Commonly Used Transformations > Remove leading or following white spaces and remove multiple white spaces
When sticking the values in different columns together Refine can sometimes balk when one of the values is null and return a null for the joined-up string even if there are values in some of the columns.
This will replace any blank or null values with an empty string:
If you’re referring to another column replace
So if you were working with a dataset where some cells contained Genus information, some Subgenus and some Species - but not all of them in every row - you end up with something along the lines of
That’s formatted to hopefully make it a bit easier to read - spaces or new lines don’t matter to Refine, so I use them a lot to make what I’m doing clearer to myself.
It’s easy to miss the “permalink” link just to the right of your project name at the top left of the screen. This is useful for when you’re doing lots of different checks but always starting with the same few transformations. Just get the facets set up to your starting point and save the permalink. When you next start work just use that rather than opening the project
I’m currently working with a bunch of datasets which I need to bring to a standardised format so they can be transferred into another management system. Before cleaning or transforming any of the data using lots of faceted browsing is handy for getting a grip on just what you’re working with.
There’s a few different transformations and facets I need to be doing frequently so I’m recording them here to make an easy reference for myself.
Facet on the following and then choose “true” (does contain the string) or “false” (doesn’t contain the string or is blank) as appropriate.
To check to see if there are entries with multiple words, and what kind of entries they are.
“value.split” doesn’t do what I want here, because it returns an array with as many entries as there are bits separated by spaces. This way if the entry happens to be, say, a whole sentence then you’ll see a bunch of words rather than just the first word after the first space. Then you can facet by Facet > Customized Facets > Text length facet and see if there’s anything which obviously doesn’t belong.
A common convention, particularly with library cataloging, is that comments added by a compiler or cataloger get added in square brackets to distinguish them from the original data. To isolate out these comments:
… But not convert to Title Case - just capitalise the very first word of the sentence or bunch of words that make up the entry.
To facet by whether the value is blank or another particular word – the example I’m working with is where I’ve a load of values in a particular column which read “none”. Equally you might want to ignore “n/a” or something like that.
Then choose “false”
What it’s doing is searching for blanks, but only after replacing any occurrences of “none” with a blank.
When tidying up data it can be useful to be able to see how many lumps of text are in a field. I work with a lot of taxonomic data and I often want to see if the information is consistently structured like genus-name species-name.
To split up the value at any spaces and then count the number of “lumps” generated by that split
This gives results of 1,2 and 3, meaning the value either had 1, 2 or 3 “words” in it:
So I can see that the second and third lumps of text should really be considered as a single unit. This helps avoid mistakes like assuming that if there’s a third word then it’ll be referring to a subspecies.
Sometimes when using Refine you’re not sure if the column you’re checking actually needs to be cleaned up – it looks like it’s the same as another one, but you want to be sure:
Take two columns COLUMN-ONE and COLUMN-TWO which look like they probably contain the same information, but it’s hard to tell for sure as there’s a few tens of thousands of entries in each of them.
In order to compare whether two columns which seem to be identical actually are, add a new column based on COLUMN_TWO:
Or (if you like to type everything out):
and then facet and have a check to see if there are any ‘false’ results.
This can give confusing answers if there are blank cells - “null” while typing out the expression or “(blank)” after you’ve clicked okay and the results are in the little box on the right-hand-side. If that happens use
forNotBlank() as indicated above.
If you’ve named your column with a single word you can use the slightly clearer (I always give my columns single-word names because of this):
If you’re not getting useful results when comparing two columns make sure that the format of the data is the same - Refine doesn’t think that a date 1984 and a string “1984” are identical. To compare them add a
toString where appropriate so you’d end up with something like:
I want to get one column of dates and compare it to the dates in another column called “Authority”.
The entries in Authority look like “Smith, 1985” or “(Smith, 1985)”; the entries in Year look like “1986”.
value.replace(")","")which combined makes: