Last year, I posted an article on my experience ingesting a CSV file into Druid. Yesterday, I got an email from a reader who needed more detail to make it work. So, today, I recreated that work and am posting a step-by-step guide.
The first step is to install Druid. I did this through Homebrew.
As of the time of this writing, Druid 0.11.0 is the latest.
Next, I went to Kaggle and downloaded a fresh copy of the FiveThirtyEight dataset, which is distributed as a ZIP file called data.zip.
I created a folder on my machine at ~/sandbox, unzipped the data, and got started.
In particular, I’m interested in the file us-weather-history/KSEA.csv.
The first thing it wants it for Apache Zookeeper to be installed and started. I use Homebrew to ensure it’s installed, find out where it’s installed, and get it running.
You’ll notice I had a false start due to permissions, since I’m running as a non-privileged user.
The next step is to start Druid. Again, I rely on Homebrew to tell me where Druid is installed.
Darn. Let’s try to install Druid using the quickstart guide instructions, to see if bin/init is included.
Success! Next, I run the commands in the tutorial that start Druid processes. I use iTerm2, so I used the Duplicate Tab command to create a new terminal window per process. For example:
The tutorial continues with importing some Wikipedia data. Let’s continue to follow the tutorial to make sure Druid is set up correctly, then we’ll import our CSV weather data.
I cut and pasted the JSON file I created in my 2017 post into a file.
As mentioned in my 2017 post, I also made some modification of the FiveThirtyEight datafile. I loaded the file in Excel and changed the date column format to ‘yyyy-mm-dd’ and saved the result.
Now, let’s check the UI. First, the coordinator console at port 8090.
Ah, it appears the job has failed. Clicking on the log button, it’s easy to see why:
Input path does not exist: file:/Users/me/sandbox/druid-0.11.0/ksea.csv
I moved up to the sandbox folder, then modified my seattle-weather.json file to match the location on disk. For the record, here’s the file in its entirety:
Keep in mind that if you use a relative path like I did, it is relative to the folder you started the Druid processes in, in my case sandbox/druid-0.11.0.
It tried for a while longer, but failed again. I saw JNDI and JMX exceptions again, like I did last year. However, toward the end, I did find some useful exceptions: