Ipynb To Markdown



README is typically the front page of a project, and should contain relevant information for current users & prospective users. As to make sure documentation across a project is consistent as well, imagine if we could include this README that is the front page of our project, both on the repository, and in the documentation. This post goes into how to set this workflow up. Find a live example of this being implemented on: https://github.com/JackMcKew/pandas_alive.

A good starting structure for a project's README is:

  1. Intro - A short description & output (if applicable) of the project.

  2. Usage - A section on how the project is to be used (if applicable).

  3. Documentation - Link to documentation for the project.

  4. Contributing Guidelines - If this is an open source project, a note whether contributions are welcome & instructions how to get involved is well received.

  5. Changelog - Keeping a changelog of what is changing as the project evolves.

Other useful sections when applicable are requirements, future plans and inspiration.

Ipynb markdown bold

You could convert ipynb to a lots of format in Jupyter on web browser. File - Download as - PDF via LaTex or etc.

Inspiration for This Post

  • One of my biggest mistakes with this blog was not finding a WordPress plugin that would allow me to write my posts with markdown; to this day I still need to write posts in “Visual” mode and then.
  • In experimentation/Diabetes Ridge Regression Training.ipynb, complete the following steps: Create a function called splitdata to split the data frame into test and train data. The function should take the dataframe df as a parameter, and return a dictionary containing the keys train and test.

The inspiration for this post also comes from Pandas_Alive, wherein there is working examples with output hosted on the README. Initially, this was contained in a generate_examples.py file and as the package evolved, the code to match the examples, was being copied over into code blocks in the README.md. If you can see where this is going, obviously whenever some new examples were made, the code to generate the examples was being forgotten to be copied over. This is very frustrating for new users to the package, as the examples simply don't work. Thus the workflow we go into in this post was adopted.

README.ipynb

In projects, typically it's best practice to not have to repeat yourself in multiple places (this the DRY principle). In the README, it's nice to have working examples on how a user may use the project. If we could tie the original README with live code that generates the examples, that would be ideal, enter README.ipynb.

Jupyter supports markdown & code cells, thus all the current documentation in the README.md can be copied within markdown cells. Similarly, the code used to generate examples or demonstrate usage can then be placed in code cells. Allowing the author, to run the entire notebook, generating the new examples & verifying the examples are working code. Fantastic, this is exactly where we want to go.

Now if you only have the README.ipynb in the repository, GitHub will represent the file in it's raw form, JSON. For example would be hundreds of line like:

This is not ideal whatsoever, this is nowhere near as attractive as the nicely rendered README.md. Enter nbconvert.

README.ipynb -> README.md with nbconvert

nbconvert is a package built to convert Jupyter notebooks to other formats and can be installed similar to jupyter (eg, pip install jupyter, pip install nbconvert). See the documentation at: https://nbconvert.readthedocs.io/en/latest/.

Now let's check the supported output types for nbconvert:

  • HTML,
  • LaTeX,
  • PDF,
  • Reveal.js HTML slideshow,
  • Markdown,
  • Ascii,
  • reStructuredText,
  • executable script,
  • notebook.

nbconvert supports Markdown! Fantastic, we can add this step into our CI process (eg, GitHub Action). This will allow us to generate a new README.md whenever our README.ipynb changes.

In Pandas_Alive, we clear the output output of the cells in README.ipynb with the flags: jupyter nbconvert --ClearMetadataPreprocessor.enabled=True --ClearOutput.enabled=True --to markdown README.ipynb.

Python Highlighting in Output

When first run, it was noticed that nbconvert wasn't marking the code blocks with the language (python). This is required to highlight the code blocks in the README.md with language specifics. The workaround for this, was to use nbconvert's support for custom templates. See the docs at: https://nbconvert.readthedocs.io/en/latest/customizing.html#Custom-Templates.

The resulting template 'pythoncodeblocks.tpl' was:

Which could be used with nbconvert with:

Integration into Documentation with Sphinx

Ipynb To Markdown

If you haven't already, check out my previous post Automatically Generate Documentation with Sphinx. The post goes into detail on how to implement Sphinx as to generate all of the documentation for a project from docstrings automatically.

Ipynb To Markdown

Before going on, the live site of the documentation in reference can be reached at: https://jackmckew.github.io/pandas_alive/

Now, we've:

  1. Stored our working code & documentation for a our project's front page in a Jupyter notebook README.ipynb
  2. Converted README.ipynb into markdown format with nbconvert
  3. Inserted language specific (python) into the code blocks within the markdown

The next step is to make the README content also live in the documentation.

Since Sphinx relies on reStructuredText format, so we'll need to convert README.md to README.rst. Enter m2r, a markdown to reStructuredText converter.

nbconvert could be used in this step over m2r, in saying that this step was originally developed prior to the README.ipynb being created, thus only README.md existed. Please drop a comment if you try using nbconvert over m2r for this step and your results!

Firstly, m2r can be installed with pip (pip install m2r) and we can convert README.md with the command m2r README.md which will generate README.rst in the same directory.

Now we need to include our README.rst in the documentation. After much tweaking, the documentation structure set up landed upon for Pandas_Alive, with use of autosummary to automatically generate documentation from docstrings was:

Autosummary generated documentation is included within a separate rst file (developer.rst) to nest all the generated with autosummary within one heading with the ReadTheDocs theme

Integration with GitHub Actions

All the steps above mentioned are currently being used to maintain the project Pandas_Alive.

Find the GitHub Action yml files at: https://github.com/JackMcKew/pandas_alive/tree/master/.github/workflows

Find the Sphinx configuration files at: https://github.com/JackMcKew/pandas_alive/tree/master/docs

Having migrated my site from hugo Academic to {{distill}} I wanted to see if I could fold-in python rendered notebooks with ease, and what the results would look like. Suffice it to say the process has been painless so far …

  1. create the usual Rmd for the post,
  2. export the ipynb notebook from jupyter as a markdown file (.md), and
  3. copy-and-paste md file’s content into the post’s Rmd file,
  4. add some css.

I had some pretty long and wide tables in the notebook that did not fit within the page-width and so I had to use some css to create a scrollable div ([@braican's solution found here](https://stackoverflow.com/a/17451132)) and voila!

We will start with importing some libraries we need and then play with some data to understand basic python commands. What data shall we work with? Well, let us pull down some data on criminal incidences that were reported.

First we install a particular library called pandas and in the command that follows, note that pd is just the alias that pandas assumes so that we can type pd and have all the pandas commands at our disposal.

The crime incident reports data are available here and span multiple years so we may end up working only with 2019 data but for now we proceed by gathering everything.

In the command below, the key part is pd.read_csv() and inside it is the URL for the comma-separated variable file. Once the file is downloaded by pandas we are saving it in python with the name df

Note that>INCIDENT_NUMBEROFFENSE_CODEOFFENSE_CODE_GROUPOFFENSE_DESCRIPTIONDISTRICTREPORTING_AREASHOOTINGOCCURRED_ON_DATEYEARMONTHDAY_OF_WEEKHOURUCR_PARTSTREETLatLongLocation0TESTTEST2423NaNASSAULT - AGGRAVATEDExternal02019-10-16 00:00:00201910Wednesday0NaNRIVERVIEW DRNaNNaN(0.00000000, 0.00000000)1S973337013301NaNVERBAL DISPUTEC691502020-07-18 14:34:0020207Saturday14NaNMARY BOYLE WAY42.330813-71.051368(42.33081300, -71.05136800)2S475131312647NaNTHREATS TO DO BODILY HARME1853002020-06-24 10:15:0020206Wednesday10NaNREADVILLE ST42.239491-71.135954(42.23949100, -71.13595400)3I921022013301NaNVERBAL DISPUTEE1358302019-12-20 03:08:00201912Friday3NaNDAY ST42.325122-71.107779(42.32512200, -71.10777900)4I920971733115NaNINVESTIGATE PERSONC1135502019-10-23 00:00:00201910Wednesday0NaNGIBSON ST42.297555-71.059709(42.29755500, -71.05970900)

What about the last 10 rows of the data?

INCIDENT_NUMBEROFFENSE_CODEOFFENSE_CODE_GROUPOFFENSE_DESCRIPTIONDISTRICTREPORTING_AREASHOOTINGOCCURRED_ON_DATEYEARMONTHDAY_OF_WEEKHOURUCR_PARTSTREETLatLongLocation
5150721020954893115NaNINVESTIGATE PERSONE1852002019-11-25 16:30:00201911Monday16NaNHYDE PARK AVE42.256215-71.124019(42.25621500, -71.12401900)
5150731020916712647NaNTHREATS TO DO BODILY HARMB341702019-11-12 12:00:00201911Tuesday12NaNMORA ST42.282081-71.073648(42.28208100, -71.07364800)
5150740202240653018NaNSICK/INJURED/MEDICAL - POLICEB228202020-03-19 07:30:0020203Thursday7NaNWASHINGTON ST42.353272-71.173738(42.35327200, -71.17373800)
5150750202028562672NaNBIOLOGICAL THREATSB228202020-03-19 08:30:0020203Thursday8NaNWARREN ST42.328234-71.083289(42.32823400, -71.08328900)
5150760200634253114NaNINVESTIGATE PROPERTYA72102020-09-01 00:00:0020209Tuesday0NaNPARIS ST42.374426-71.035278(42.37442600, -71.03527800)
5150770200623563115NaNINVESTIGATE PERSONE1852002020-08-28 18:39:0020208Friday18NaNHYDE PARK AVE42.256215-71.124019(42.25621500, -71.12401900)
5150780200540403501NaNMISSING PERSONC1102020-07-30 15:30:0020207Thursday15NaNGIBSON ST42.297555-71.059709(42.29755500, -71.05970900)
5150790200464001501NaNWEAPON VIOLATION - CARRY/ POSSESSING/ SALE/ TR…B233002020-07-02 01:38:0020207Thursday1NaNPASADENA RD42.305760-71.083771(42.30576000, -71.08377100)
5150800200384461501NaNWEAPON VIOLATION - CARRY/ POSSESSING/ SALE/ TR…B230002020-06-03 01:15:0020206Wednesday1NaNWASHINGTON ST42.323807-71.089150(42.32380700, -71.08915000)
515081020030892540NaNBURGLARY - COMMERICALC1138002020-05-03 00:00:0020205Sunday0NaNGALLIVAN BLVD42.283700-71.047761(42.28370000, -71.04776100)

Let us look at the contents of the>OFFENSE_CODEYEARMONTHHOURLatLongcount515082.000000515082.000000515082.000000515082.000000485909.000000485909.000000mean2333.2756322017.5429336.63419413.07917042.239043-70.949353std1182.4898221.5433293.3179646.3472591.8916453.060012min111.0000002015.0000001.0000000.000000-1.000000-71.20331225%1102.0000002016.0000004.0000009.00000042.296861-71.09746550%3005.0000002018.0000007.00000014.00000042.325029-71.07772375%3201.0000002019.0000009.00000018.00000042.348312-71.062562max3831.0000002020.00000012.00000023.00000042.3950420.000000

By default the command will report the values with decimals but we may not want that. Decimals can be rounded or removed altogether as shown below.

OFFENSE_CODEYEARMONTHHOURLatLong
count515082.00515082.00515082.00515082.00485909.00485909.00
mean2333.282017.546.6313.0842.24-70.95
std1182.491.543.326.351.893.06
min111.002015.001.000.00-1.00-71.20
25%1102.002016.004.009.0042.30-71.10
50%3005.002018.007.0014.0042.33-71.08
75%3201.002019.009.0018.0042.35-71.06
max3831.002020.0012.0023.0042.400.00
OFFENSE_CODEYEARMONTHHOURLatLong
count515082.0515082.0515082.0515082.0485909.0485909.0
mean2333.02018.07.013.042.0-71.0
std1182.02.03.06.02.03.0
min111.02015.01.00.0-1.0-71.0
25%1102.02016.04.09.042.0-71.0
50%3005.02018.07.014.042.0-71.0
75%3201.02019.09.018.042.0-71.0
max3831.02020.012.023.042.00.0

Note a few things here.

  • We have a total of 515082 incidents logged. But the latitude and longitude are availale for no more than 485909 incidents.

Say we want to restrict the dataframe just to 2020. How can we do that?

Notice the sequence here dataframe[ dataframe['column-name'] somevalue ] & pay attention to the double equal sign which is a strict equality.

OFFENSE_CODEYEARMONTHHOURLatLong
count63733.00000063733.063733.00000063733.00000062200.00000062200.000000
mean2353.1373232020.04.90055412.92352542.319872-71.084193
std1182.6709960.02.5614636.5668990.0323390.030578
min111.0000002020.01.0000000.00000042.181845-71.203312
25%1001.0000002020.03.0000009.00000042.295353-71.098579
50%3005.0000002020.05.00000014.00000042.321918-71.078444
75%3207.0000002020.07.00000018.00000042.344561-71.062000
max3831.0000002020.09.00000023.00000042.395041-70.953726

At this point we might be curious to know what types of offenses are most often reported? Before we that, however, let us also see how many unique values of OFFENSE_CODE are there

So code 3301 leads with 6234 reports in 2020, followed by code 3115, then 801, then 3005, and then 3831. Code 3005 is missing from their list so we have no idea what it is!! That is a crime in itself.

OFFENSE_CODEcount
10933016234
9531155494
238013908
8230053227
12938312700
8630162
269901
10632031
7626721
6526281

Ipynb To Markdown Free

130 rows × 2 columns

Ipynb Markdown Table

Not bad. I suppose with forcing fixed headers and some other aesthetic tuning of the table rendering this could be a pretty efficient solution to add python notebooks. I am really coming to love distill!!