Over the past month, I’ve gotten this question way more times than I can handle. I think it’s a fluke / temporary thing. Apache DataFusion is participating in the Google Summer of Code program this year, and I have volunteered to be one of its mentors. There has been a lot of interest in this program, and I’ve been getting asked over and over again what people can do to get started.
I have to throw a caveat out there on this post, though. I don’t think I’m extremely qualified to answer about how to get started in open source. My open source contributions really only started in earnest just under a year ago. Up until then nearly all of my work has been in the private domain. But since I’m getting the question, I’ll try to collect my thoughts here.
About twice a month, I’ll get someone coming to me to ask for career advice. I’m starting to collect the more common questions and my thoughts on them. I’m putting these together as a series of blog posts under the tag CareerPlusPlus.
What’s Different
One thing you have to know about working in open source projects is that the majority of the people working are doing so in their free time. In the ~11 months I’ve been on DataFusion, only in the last few weeks have I been able to do any work on it that I get paid to do. Everything else has been voluntary.
Why does that make a difference? It means that community must be a first class priority. If you want to build something great that is going to require a large developer base, you’re going to have to build an environment that people want to be a part of. To be honest, that’s what pulled me in to DataFusion. I found a small problem that needed fixing, put together a solution, and submitted it. Then I enjoyed the experience working with the community so much I kept going at it. Almost 11 months later I feel like a fully integrated member of the community because they were warm, welcoming, and awesome.
I cannot make any demands of anyone on the project. They can simply walk away. If I want people to help with what I’m working on, I have to make them want to help me.
For me, that is a huge attraction for working on open source. As crusty as I may sometimes be (GenX and a veteran here), I do love building community. I believe community is the key to successful open source.
Because we never know how many people will be around to work on the code, we can’t do things like build good timelines or roadmaps. I can estimate my own effort to a pretty good degree, but I have no idea what’s going on in the lives of the other people. Will they have time? Will they want to contribute? It’s never certain.
This also means that there may not be any kind of project manager. If you’re looking for a well groomed user story / epic / task list you may be disappointed. People file issues as they come up. Some are well documented, and some are not. When you go to work on open source, one great bit is that you’re often in direct connection with end users.
How to actually get started
Time to get to the real actual advice and not just pontificating.
- Go use it.
- Find a friction point.
- Fix it.
That is exactly what I did. PR #641 specifically is how I got started. I wanted to use DataFusion for a project (specifically, datafusion-python
), so I was testing it out. I came across something that I thought must exist but couldn’t find it. I dug a little deeper, and then deeper still. As it turns out, the solution was pretty easy. I wrote up the code to fix it, and I submitted it.
In my mind, that really is the best way to get started. Use the product you wish to work on. For my new job, I did the same thing. Before I accepted the job offer, I downloaded their product. I used it. I wanted to know how well it worked, what the friction points were, and if it was something I would enjoy working on.
Also, to get back up on a soap box for a minute: Not enough software engineers use their own products. I believe the best products are built by the people who use them. I’ve found this to be the case at every one of my jobs. There are some good software engineers who focus only on doing their work assignments. But in general, I’ve found the best developers to be those who actively use and are curious about the products they are working on.
For applicants: How to get Noticed
Since this blog post is prompted by students trying to get noticed so they can get into the Google Summer of Code program, here are a few things that I look for:
- A concrete idea of what you want to work on. Again, I would suggest going back up to my previous section. Go forth and use the product. Build data pipelines with it. Build analysis. Find what you like and what you don’t like. Use that to inform yourself about what you think is most worth working on.
- Contribute. I know people want to get involved in our project so they can get the GSoC program and not just because they’re super excited about the project itself. Otherwise they probably would have come to us before GSoC added us as a project. As such, people are probably not interested in contributing unless they get accepted. The contradiction is that we’re going to be most interested in applicants who want to work on our project for the sake of working on it, regardless of if they get the GSoC program. The quality of your contributions will speak more than anything about how you would work and your understanding of the project itself.
- Quality. To me, it’s not just about the quantity of someone’s contributions. I am really looking at the quality of thought that went into them. From reviewing someone’s code it often becomes clear who really understands the things they’re working on. Thoughtfulness will pervade your code.
Closing Thoughts
Open Source is awesome.
The DataFusion community, specifically, is awesome.
The great challenge of working on these projects is that you never know who is going to be around, and what their strengths and weaknesses are. You have to build a community that people want to engage with.
