I was very excited yesterday when I found out how to save some time by using R instead of doing manual re-formatting in Excel. I had to prepare a groundwater chemistry dataset for upload to Jupiter database, which included both reformatting and adding some standard codes to the table.
The reformatting part was the one that bothered me most, because the lab results were reported with the “<” symbol in front of the concentration for all values below the limit of detection (LOD). Also, sometimes there would be additional text after the numbers, indicating if the specific concentration was indeed the LOD or a higher value. And all that in the same field, e.g. “<0,0020 (LOD)”.
It is a fairly simple thing to do in Excel, you just need to add a column with flags indicating that the value is “<” (as there were also samples with actual detections), to change the “,” to “.”, to remove the text and keep only the number, and it is done.
The problem was that the table was 51 x 280 (51 rows with different locations and 280 columns with different parameters). Some of the parameters had different LODs as well. So this very easy excel task became something very boring and time consuming. Not to mention risky, as making so many changes manually in excel is doomed to result in at least one error.
So, instead, I imported the file to R, pivoted it, so it is in a long format, and then used the “stringr” R package to clean-up the file.
Below I have embeded my tweet from yesterday (left) and have provided a link to an R Notebook with how 3 functions from the “stringr” R package work.
I just found out how useful are the str_extract, str_split_fixed, and str_replace_all (#stringr R package) with a very untidy lab-dataset (51x280) incl. different strings in the general format "<Decimal,number (characters)" (e.g. "<0,0020 (LOD)) mixed with "Decimal.number" #R
— Denitza Voutchkova (@DenitzaV) August 26, 2021
Leave a Reply